WO2024053370A1 - Information processing device, information processing method, and program - Google Patents

Information processing device, information processing method, and program Download PDF

Info

Publication number
WO2024053370A1
WO2024053370A1 PCT/JP2023/029935 JP2023029935W WO2024053370A1 WO 2024053370 A1 WO2024053370 A1 WO 2024053370A1 JP 2023029935 W JP2023029935 W JP 2023029935W WO 2024053370 A1 WO2024053370 A1 WO 2024053370A1
Authority
WO
WIPO (PCT)
Prior art keywords
unit
feature
session
data
time
Prior art date
Application number
PCT/JP2023/029935
Other languages
French (fr)
Japanese (ja)
Inventor
健人 中田
智佳子 浅井
慎吾 高松
Original Assignee
ソニーグループ株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ソニーグループ株式会社 filed Critical ソニーグループ株式会社
Publication of WO2024053370A1 publication Critical patent/WO2024053370A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present disclosure relates to an information processing device, an information processing method, and a program, and in particular, an information processing device that can efficiently search and extract features effective in creating a machine learning model from time-series data, and Related to information processing methods and programs.
  • the present disclosure has been made in view of this situation, and in particular, it is intended to enable efficient searching and extraction of feature quantities effective in creating a machine learning model from time-series data.
  • An information processing device and a program include a metadata generation unit that generates metadata of flow data including at least time series data, and a metadata generation unit that generates metadata of flow data including at least time series data, and a metadata generation unit that generates metadata of flow data that includes at least time series data, and
  • the present invention provides an information processing device and a program, comprising: an estimating section that estimates a feature amount generation method; and a feature amount generating section that generates a feature amount from the series data using the generation method estimated by the estimating section.
  • An information processing method generates metadata of flow data including at least time-series data, and estimates a method for generating feature amounts from series data constituting the flow data based on the metadata. , an information processing method including the step of generating feature amounts from the series data using an estimated generation method.
  • metadata of flow data including at least time-series data is generated, and based on the metadata, a method of generating a feature amount from the series data constituting the flow data is estimated.
  • a feature amount is generated from the series data using a generation method.
  • FIG. 3 is a diagram illustrating flow data of the present disclosure.
  • FIG. 3 is a diagram illustrating examples of session units, time units, attribute data, and time series data in flow data.
  • FIG. 1 is a hardware block diagram illustrating a configuration example of an information processing device according to the present disclosure.
  • 4 is a functional block diagram illustrating functions realized by the UI control unit, data processing unit, and machine learning model generation unit in FIG. 3.
  • FIG. It is a figure explaining the example of composition of attribute data and time series data in flow data.
  • FIG. 6 is a diagram illustrating an example of a display image of a UI that prompts setting of a column for each session, a column for each time, and a prediction target column in flow data.
  • FIG. 1 is a hardware block diagram illustrating a configuration example of an information processing device according to the present disclosure.
  • 4 is a functional block diagram illustrating functions realized by the UI control unit, data processing unit, and machine learning model generation unit in FIG. 3.
  • FIG. It is
  • FIG. 2 is a diagram illustrating an example of a melt format as an output format.
  • FIG. 2 is a diagram illustrating an example of a pivot format as an output format.
  • FIG. 6 is a diagram illustrating an example of an output format when a ball speed sequence is set as a prediction target based on flow data related to a pitching log of a predetermined baseball batter.
  • FIG. 6 is a diagram illustrating an example of an output format when a result string is set as a prediction target based on flow data related to a pitching log of a predetermined baseball batter.
  • FIG. 3 is a diagram illustrating a method for generating feature amounts of time-series data.
  • FIG. 6 is a diagram illustrating an example of setting a window related to generation of feature amounts.
  • FIG. 7 is a diagram illustrating another setting example of a window related to generation of a feature amount.
  • FIG. 3 is a diagram illustrating selection of series data from which feature amounts are generated.
  • FIG. 3 is a diagram illustrating an example of generation of intra-session feature amount data.
  • FIG. 3 is a diagram illustrating an example of generation of inter-session feature amount data.
  • FIG. 6 is a diagram illustrating each example of a session ID, time unit, attribute data, time series data, intra-session feature amount, inter-session feature amount, and session set ID in the flow data of the present disclosure.
  • FIG. 3 is a diagram illustrating an example of presentation of feature amount data. It is a flowchart explaining feature data generation processing. It is a flowchart explaining generation source selection processing.
  • FIG. 12 is a flowchart illustrating intra-session feature amount data generation processing.
  • FIG. 7 is a diagram illustrating a modification example of clustering sessions.
  • FIG. 7 is a diagram illustrating a modification example of clustering sessions. An example of the configuration of a general-purpose computer is shown.
  • flow data a data set consisting of a plurality of time-series data
  • a technique for efficiently searching and extracting feature quantities effective for creating a machine learning model from flow data will be described.
  • Flow data is a data set that requires one or more time series data and can optionally include one or more attribute data. That is, while flow data always includes at least one piece of time-series data, it may not include attribute data, but it may include a plurality of attribute data.
  • time-series data is data that changes over time
  • attribute data is data that does not change over time
  • heartbeat, respiratory rate per unit time, and operation log of a measuring device are time-series data, and each patient's Attribute data includes gender, weight, etc.
  • flow data is constructed with one patient as a set unit.
  • the sensor data that can be obtained from the robot arm becomes time series data, and the number of failures for each individual becomes attribute data.
  • the speed of a pitched ball in the at-bat becomes time-series data
  • the information about the pitcher and batter becomes attribute data
  • flow data is constructed with one turn at bat as a set unit.
  • the flow data includes time-series data consisting of data Dt1, Dt2, etc., which are measured in time series at the timings indicated by circles on the time axis indicated by arrows, and the person to be measured. It is composed of data Da1 such as the gender and weight of the user, and attribute data including data Da2 such as the name of the measuring device and the setting values of the measuring device.
  • data Dt1 and Dt2 consisting of patient's vital signals are time series data
  • data Da2 of the device name and setting value of the measuring device are attribute data.
  • the time intervals of the individual time-series data indicated by circles may be uneven as shown by the intervals T1 and T2, or even though not shown. It may be.
  • flow data consisting of time series data and attribute data constitutes one set for each patient, each measurement device, each set value, etc.
  • this set unit is referred to as a session.
  • a collection of flow data configured under predetermined conditions is a session SS.
  • various prediction targets are predicted based on the flow data consisting of a plurality of sessions SS.
  • Figure 2 summarizes examples of session units, time units, attribute data examples, and time series data examples when hospital vital logs, factory robot logs, and baseball pitching logs each constitute flow data. It is something that
  • an example of a session unit is a patient
  • an example of a time unit is a date and time
  • an example of attribute data is a patient's gender and time.
  • An example of series data is a heartbeat signal.
  • an example of session unit is robot
  • an example of time unit is date and time
  • an example of attribute data is the number of robot failures
  • an example of time series data is the torque sensor signal.
  • an example of the session unit is a turn at bat
  • an example of the time unit is the number of pitches in an at bat
  • an example of attribute data is a pitcher's left/right pitching
  • An example of time series data is ball speed.
  • flow data exists as various entities, and is data that can be generated in large quantities in the future as IoT becomes more widespread.
  • time series data being at equal intervals and only being able to predict future values of time series data.
  • the user can easily generate feature amounts that are effective for a wide range of tasks by inputting the minimum settings to the flow data.
  • the information processing device 31 includes a control section 51, an input section 52, an output section 53, a storage section 54, a communication section 55, a drive 56, and a removable storage medium 57, which are connected to each other via a bus 58. It is possible to send and receive data and programs.
  • the control unit 51 is composed of a processor and a memory, and controls the entire operation of the information processing device 31.
  • the control unit 51 also includes a UI control unit 61, a data processing unit 62, and a machine learning model generation unit 63.
  • the UI control unit 61 When the UI control unit 61 receives input of flow data, it generates a UI (User Interface) that prompts the input of a column indicating time as a task setting, a column indicating a session unit, and a column to be predicted, and outputs it.
  • UI User Interface
  • the display section 71 and the audio output section 72 of the section 53 are controlled and presented.
  • the UI control unit 61 receives the input task settings by operating the input unit 52 by the user in response, and outputs them to the data processing unit 62 together with the input flow data.
  • the UI control unit 61 also controls the display unit 71 and the audio output unit 72 of the output unit 53 to display information on the feature amount generated by the data processing unit 62 on the display unit 71 and the audio output unit of the output unit 53. 72 and presents it to the user.
  • the data processing unit 62 acquires the flow data and task settings supplied from the UI control unit 61, and generates effective feature quantities (hereinafter also referred to as effective feature quantities) in generating a machine learning model as feature data. , the UI control unit 61, and the machine learning model generation unit 63.
  • the machine learning model generation unit 63 generates a machine learning model based on feature amount data consisting of effective feature amounts supplied from the data processing unit 62.
  • the input unit 52 is composed of input devices such as a keyboard, a mouse, and a touch panel through which the user inputs operation commands, and supplies various input signals to the control unit 51.
  • the output section 53 is controlled by the control section 51 and includes a display section and an audio output section.
  • the output unit 53 outputs and displays images of the operation screen and processing results on a display unit 71 that is a display device such as an LCD (Liquid Crystal Display) or an organic EL (Electro Luminescence).
  • the output unit 53 also controls an audio output unit 72 consisting of an audio output device to reproduce various voices, music, sound effects, and the like.
  • the storage unit 54 is composed of an HDD (Hard Disk Drive), an SSD (Solid State Drive), or a semiconductor memory, and is controlled by the control unit 51 to write or read various data and programs.
  • HDD Hard Disk Drive
  • SSD Solid State Drive
  • semiconductor memory a semiconductor memory
  • the communication unit 55 is controlled by the control unit 51 and realizes wired or wireless communications such as LAN (Local Area Network) and Bluetooth (registered trademark), and performs various types of communication via the network as necessary. Sends and receives various data and programs to and from devices.
  • wired or wireless communications such as LAN (Local Area Network) and Bluetooth (registered trademark)
  • the drive 56 includes magnetic disks (including flexible disks), optical disks (including CD-ROMs (Compact Disc-Read Only Memory) and DVDs (Digital Versatile Discs)), magneto-optical disks (including MDs (Mini Discs)), Alternatively, data is read and written from and to a removable storage medium 57 such as a semiconductor memory.
  • a removable storage medium 57 such as a semiconductor memory.
  • the UI control unit 61 includes a flow data input unit 101, a task setting unit 102, and a generated feature amount visualization unit 103.
  • the flow data input unit 101 receives operation input from the input unit 52 and input of flow data from at least one of the storage unit 54, the communication unit 55, and the removable storage medium 57 via the drive 56, and inputs the flow data to the data processing unit 62, and output to the generated feature amount visualization unit 103.
  • the task setting unit 102 acquires the column estimation results of the flow data supplied from the data processing unit 62, it generates a UI that shows the column estimation results, and also displays the time as a task setting on the UI that presents the column estimation results.
  • the display section 71 and the audio output section 72 of the output section 53 are displayed by adding a column indicating the session unit, a column indicating the session unit, and information prompting the input of the column to be predicted.
  • the task setting unit 102 may further prompt the user to input the prediction frequency and prediction time of the prediction target column as task settings.
  • the task setting unit 102 is prompted by this UI and the user operates the input unit 52 to set a column indicating time, a column indicating a session unit, a prediction target, and, if necessary, predicting the prediction target column.
  • Information on task settings including the frequency and predicted time is output to the data processing unit 62.
  • the generated feature visualization unit 103 acquires the flow data supplied from the flow data input unit 101 and the feature data consisting of the effective features supplied from the data processing unit 62, it visualizes it as a UI and displays it in the output unit 53.
  • the display section 71 and the audio output section 72 are controlled and presented.
  • the data processing unit 62 includes a column estimation unit 121, an output format determination unit 122, a generation source selection unit 123, an intra-session feature generation unit 124, a feature selection unit 125, an inter-session feature generation unit 126, a combination unit 127, and a feature generation unit 124. It includes a quantity data storage 128 and a loop determination section 129.
  • the column estimation unit 121 analyzes the data format of the flow data supplied from the UI control unit 61, estimates columns that can be columns indicating time, columns indicating session units, etc., and sends the column estimation results to the UI control unit 61. Output.
  • the output format determining unit 122 determines the flow data based on the information of the column indicating time and the column indicating the session unit as task settings supplied from the task setting unit 102 of the UI control unit 61, and the column to be predicted. The output format is determined and output to the generation source selection unit 123.
  • the output format determining unit 122 determines an output format that also takes into consideration the prediction frequency and prediction time of the prediction target sequence.
  • the generation source selection unit 123 executes a process of selectively extracting series data from which feature quantities are generated from the flow data according to the output format supplied from the output format determination unit 122, and stores the processing results within the session. It is output to the feature value generation unit 124.
  • the in-session feature amount generation unit 124 generates in-session feature amount data and selects the feature amount based on the series data required for feature generation out of the flow data supplied from the generation source selection unit 123. 125.
  • the intra-session feature generation unit 124 includes a metadata extraction unit 124a and an estimation model 124b.
  • the metadata extraction unit 124a extracts metadata consisting of the number of time series data in the flow data, sequence length (number of samples per sequence), statistical values of each variable (average, variance, etc.), and performs estimation. Output to model 124b.
  • the estimation model 124b is a model that has been trained in advance by pairing metadata and a method for generating feature amounts used for predicting the prediction target, and based on the metadata, the feature values required for predicting the prediction target. Estimate the generation method.
  • the in-session feature amount generation unit 124 generates in-session feature amount data using the series data to be the generation source selected by the generation source selection unit 123, using the feature generation method estimated from the metadata by the estimation model 124b. generate.
  • the feature quantity selection unit 125 calculates the effectiveness score related to prediction for the prediction target for each of the feature quantities constituting the intra-session feature data and the inter-session feature data, and selects feature quantities higher than a predetermined effectiveness score. , and exclude the others to reconstruct intra-session feature data and inter-session feature data.
  • the feature selection unit 125 includes an intra-session feature selection unit 141, an inter-session feature selection unit 142, and an effectiveness score calculation unit 143.
  • the intra-session feature quantity selection unit 141 controls the effectiveness score calculation unit 143 to calculate an effectiveness score for each of the feature quantities constituting the intra-session feature data, and selects a feature quantity higher than a predetermined effectiveness score. By selecting and excluding feature values lower than a predetermined effectiveness score, the intra-session feature data is reconfigured and output to the inter-session feature generation unit 126 and the combining unit 127.
  • the inter-session feature selection unit 142 controls the effectiveness score calculation unit 143 to calculate an effectiveness score for each of the features forming the inter-session feature data supplied from the inter-session feature generation unit 126. , by selecting feature quantities higher than a predetermined effectiveness score and excluding feature quantities lower than a predetermined effectiveness score, the inter-session feature data is reconfigured and output to the combining unit 127.
  • the effectiveness score calculation unit 143 calculates, for example, the amount of mutual information with the prediction target as the effectiveness score for each of the features forming the intra-session feature data and the inter-session feature data, and calculates the mutual information with the prediction target as the effectiveness score. It is output to the selection unit 141 , the inter-session feature quantity selection unit 142 , and the loop determination unit 129 .
  • the effectiveness score calculation unit 143 may calculate the accuracy of the machine learning model generated using the intra-session feature data and the inter-session feature data as the effectiveness score.
  • the machine learning model used is a machine learning model determined by a simpler machine learning algorithm or hyperparameter than the machine learning model generated by the machine learning model generation unit 63.
  • the intra-session feature quantity selection unit 141 and the inter-session feature quantity selection unit 142 select intra-session feature quantity data whose effectiveness score calculated from the accuracy etc. of the generated machine learning model does not fall below a predetermined value.
  • the intra-session feature data and the inter-session feature data are reconstructed by selecting a subset of the features constituting each of the intra-session feature data and the inter-session feature data.
  • the inter-session feature amount generation unit 126 generates inter-session feature amount data based on the reconstructed intra-session feature amount data that is output from the feature amount selection unit 125 and is composed of features higher than a predetermined effectiveness score. It is generated and output to the feature selection unit 125.
  • the combining unit 127 combines the reconstructed intra-session feature data consisting of features higher than a predetermined effectiveness score supplied from the feature selection unit 125 and the inter-session feature data, and generates a feature value.
  • the data is configured and stored in the feature data storage 128.
  • the feature data storage 128 stores the feature data supplied from the combining unit 127, and also supplies the stored feature data to the loop determination unit 129 as needed.
  • the loop determination unit 129 selects a prediction target based on the effectiveness score of the feature amount constituting the feature amount data that is a combination of the intra-session feature amount data and the inter-session feature amount data stored in the feature amount data storage 128.
  • the overall effectiveness score of the feature data in predicting is calculated as, for example, the overall average value.
  • the loop determination unit 129 sends the generation source selection unit 123 again to generate more feature values than the current number of features from the same flow data. Instructs the process to loop again to extract.
  • the loop determination unit 129 selects the feature data stored in the feature data storage 128 at that time and the entire feature data.
  • the information on the effectiveness score of is output to the UI control unit 61 and the machine learning model generation unit 63.
  • the generated feature amount visualization unit 103 visualizes and presents the generated feature amount data and information on the overall effectiveness score of the feature amount data as a UI.
  • the machine learning The model generation unit 63 may generate a machine learning model based on the supplied feature data.
  • Task settings enable tasks such as predicting the future value of time-series data, predicting whether a specific event will occur in time-series data, and predicting non-time-series data (that does not change depending on time) from flow data. This is the setting for
  • the task settings are settings for a column indicating time in flow data, a column indicating a session unit, and a column to be predicted, and if necessary, the prediction frequency and the prediction target column for the prediction target column. Also includes settings for predicted time.
  • FIG. 5 shows an example of flow data related to a pitching log of a predetermined baseball batter.
  • Flow data FD in FIG. 5 is composed of attribute data AD and time series data TD.
  • the attribute data AD is composed of three data columns, which from the left in the figure are a pitcher ID column, a turn at bat ID column, and a result column.
  • the result column is a column in which the results of a given batter's turn at bat identified by the at-bat ID for pitches by the pitcher identified by the pitcher ID are registered. ”, and “out” are registered.
  • the time-series data TD is composed of three data columns, from the left in the figure: a turn ID column, a pitch ID column, and a pitch speed column.
  • the at-bat ID column is a column in which IDs that identify a given batter's at-bat are registered, and in the figure, from the top, the at-bat IDs are 0, 0, 0, 1, 1, 2, 2, 2. Registered.
  • the pitch ID column is a column in which IDs identifying pitches pitched by a pitcher to a predetermined batter are registered in chronological order, and in the figure, from the top, pitch IDs are 0, 1, 2, 0, 1, 0, 1, 2 are registered.
  • the ball speed column is a column in which the ball speed (km/h) pitched by a given batter in the at-bat identified by the at-bat ID by the pitcher identified by the pitcher ID is registered. , 150, 120, 120, 110, 90, 130, and 155 are registered.
  • the second pitch identified by pitch ID 1. It is registered that the ball speed of the pitch is 150 km/h
  • the speed of the second pitch identified by pitch ID 1.
  • the pitching speed is registered as 110km/h.
  • the information on the pitch sequence in the time series data TD is information that is registered in time series, so it is treated as a time sequence.
  • a common turn-at-bat ID column exists as a session column in each of the time-series data TD and the attribute data AD.
  • the pitcher ID string can also be thought of as a clustered set (session cluster) above the turn-at-bat ID string as a session string.
  • time string may be a value whose order is known (float, int) or a date/time type (YY:MM:DD hh:mm:ss, etc.).
  • the column estimating section 121 of the data processing section 62 estimates, for example, a time column or a session column as shown in FIG. 5, and supplies the result to the task setting section 102 of the UI control section 61 as a column estimation result.
  • the task setting unit 102 outputs to the output format determining unit 122 of the data processing unit.
  • the task setting unit 102 controls the display unit 71 and the audio output unit 72 of the output unit 53 based on the column estimation results to present the flow data to the user.
  • the task setting unit 102 presents a UI that prompts to set a column indicating a time unit, a column indicating a session unit, and a prediction target column as task settings, and the task settings are set according to the UI.
  • the information is output to the output format determining section 122 of the data processing section 62.
  • the task setting unit 102 presents a display image PV consisting of a UI as shown in FIG. 6, for example.
  • attribute data AD is displayed on the left side
  • time series data TD is displayed on the right side.
  • the batting turn ID column indicated by a dotted line is set as a column indicating session units
  • the pitching column indicated by a dashed line is set as a column indicating time units.
  • An example is shown in which a ball speed sequence that has been set and is indicated by a solid line is set as a prediction target.
  • the task setting unit 102 stores information on columns indicating time units, columns indicating session units, and prediction target columns, which are set using frames such as the dotted lines, dashed lines, and solid lines shown in FIG. is output to the output format determining section 122.
  • the prediction frequency and predicted time of the prediction target column may be input as task settings.
  • the output format determination unit 122 receives information for setting a column indicating the time, a column indicating the session unit, a prediction target column, and a prediction frequency and prediction time of the prediction target column, which are supplied from the task setting unit 102 of the UI control unit 61. Determine the output format based on.
  • Examples of the output format include the melt format shown in FIG. 7 and the pivot format shown in FIG. 8.
  • the melt format in FIG. 7 is composed of an id column, a time column, a name column, and a value column from the left.
  • the name column constitutes a session unit
  • the id column is a session cluster that groups session units
  • the time column is a sampling time column
  • the value column is a sampled time series data column. becomes.
  • FIG. 7 there are two upper session clusters, A and B, for grouping session units, and within the session unit there are two series, x and y, and in each series, times t1 and t2 are It is set.
  • the time series data from the top are x (A, t1), x (A, t2), y (A, t1), y (A, t2), x (B, t1), x (B , t2), y(B, t1), and y(B, t2) are registered.
  • FIG. 8 it is composed of an id column, a time column, a value x column, and a value y column.
  • the sampling time sequence is shared between the two x and y sequences for each session, and the value x and value y columns are registered in parallel.
  • x (A, t1), x (A, t2), x (B, t1), x (B, t2) are registered from the top as the value x column, and y ( A, t1), y (A, t2), y (B, t1), and y (B, t2) are registered.
  • the output format determining unit 122 determines, for example, the output format FIS1 as shown on the right side of FIG. 9.
  • the output format FIS1 on the right side of FIG. 9 is composed of the melt format described with reference to FIG. A previous at-bat result column is provided.
  • 140, 150, 120, 120, 110, 90, 130, 155 are registered from the top in the ball speed column, and NaN, 140, 150, NaN, 120, NaN, 90 and 130 are registered, and NaN, NaN, NaN, hit, hit, out, out, out are registered in the previous at-bat result column from the top.
  • the data in the ball speed column is formatted as one row per time (one column per pitch) so that it is arranged as time series data.
  • the turn ID column indicated by a dotted line is set as a column indicating session units
  • the pitching column indicated by a dashed-dotted line is set as a column indicating time units
  • the output format determining unit 122 determines, for example, the output format FIS2 as illustrated on the right side of FIG. 10.
  • the output format FIS2 on the right side of FIG. 10 is composed of the pivot format described with reference to FIG. 8, and from the left is a pitcher ID column, a turn at bat ID column indicating the session unit, a result column, an average ball speed column for each turn at bat, and a previous at-bat result column.
  • A, B, and A are registered from the top in the pitcher ID column, 0, 1, and 2 are registered from the top in the turn ID column, and 0, 1, and 2 are registered from the top in the result column.
  • Hit, out, and out are registered, 145, 115, and 110 are registered from the top in the ball speed average column for each at-bat, and NaN, hit, and out are registered from the top in the previous at-bat result column.
  • the time-series data is the average ball speed for each turn at bat, and is in a format in which features are added that aggregate time information using statistics.
  • the feature quantity is configured as a vector whose elements are a plurality of statistical quantities for each series data obtained in the time direction from the time series data.
  • a predetermined series of time series data is expressed by a waveform Ldt that changes in the time direction as shown in FIG.
  • a window with a time width w is set, and the values of the waveform Ldt in each window are obtained as partial series X1, X2, X3, .
  • the values of the waveform Ldt at future times t11, t12, t13 corresponding to each of the partial series X1, X2, X3, . . . are acquired as prediction targets y1, y2, y3, . Ru.
  • the sequence data is converted into f(X2), f(X3), ..., and a vector whose elements are the converted statistical values and the prediction targets y1, y2, y3, ... is constructed. A feature quantity for each is formed.
  • the column consisting of the feature amount of the predetermined series data and the prediction target column are expressed, for example, as in the following equations (1) and (2).
  • Fs is a feature amount of predetermined series data
  • f(X1), f(X2), f(X3), ... are partial series of predetermined series data expressed by waveform Ldt, respectively. This is an element consisting of the statistical amount of Xn.
  • y1, y2, y3, . . . are prediction targets corresponding to the partial sequences X1, X2, X3, .
  • each element f(Xn) constituting the feature amount Fs of the series data corresponding to the partial series Xn is expressed, for example, as in the following equation (3).
  • f(Xn) is each element of the feature amount Fs of the series data of the subsequence Xn
  • Ave(Xn) is the average value of the subsequence Xn
  • Min(Xn) is the average value of the subsequence Xn.
  • Max(Xn) is the maximum value of the subsequence Xn
  • Var(Xn) is the variance of the subsequence Xn
  • Stde(Xn) is the standard deviation of the subsequence Xn.
  • each element f(Xn) of the feature amount Fs of the subsequence Xn in equation (3) is expressed as a vector with each statistic as an element, but the kernel function using each statistic It may also be expressed as a weighted sum of products (convolution kernel).
  • convolution kernel please refer to https://arxiv.org/abs/1910.13051 etc.
  • the windows forming the above-mentioned partial series Xn may be set using various methods.
  • a series Xn may be set.
  • the prediction start time is, for example, the reference time ts
  • the prediction start time is set in a predetermined time width ws while changing the offset offset-fs from the reference time ts.
  • the partial sequence Xn may be set using the window WS as a unit.
  • the time width is set from the session start time tb to the time tos offset by a predetermined time from the predicted execution time ts. good.
  • a partial sequence may be set in units of windows WA in which the entire range from the session start time tb to the session end is set as the time width.
  • the specific value Ldt(s) when shifted a certain period of time from time tos may be obtained as a partial sequence.
  • the information that is the generation source of the feature amount (hereinafter referred to as the generation source feature amount) is generated as vectorized information for each series of data.
  • the generation source selection unit 123 determines whether or not time series data extracted from flow data and series data including attribute data are useful as a generation source of feature amounts. and exclude as necessary.
  • series data L1 to L3 exist as time series data that can be used as a generation source of a machine learning model that predicts the prediction target T. .
  • the generation source selection unit 123 determines whether each of the series data L1 to L3 is appropriate as a generation source of a feature amount used for prediction of a prediction target. More specifically, for example, the generation source selection unit 123 selects, for the series data L1, a generation source feature amount F(tn) consisting of statistics Fa to Fd as a time series generation source feature amount used for prediction of the prediction target T. are extracted in time series to generate a feature table TB.
  • the statistical quantities Fa to Fd mentioned here correspond to Ave (Xn), Min (Xn), Max (Xn), Var (Xn), Stde (Xn), etc. in equation (3) mentioned above.
  • the generation source feature amount F(tn) corresponds to each element f(Xn) that constitutes the feature amount Fs of the series data.
  • the generation source selection unit 123 excludes the series data L1 from the feature amount, for example, when there is no time-series change in the series data L1 and no correlation with the prediction target is recognized.
  • the prediction model PM is a relatively simple and lightweight prediction model, and is a model for easily predicting the prediction target T based on the generation source feature of a predetermined series.
  • the generation source selection unit 123 calculates the prediction accuracy PA from a comparison between the prediction target T and the prediction result T', and when the prediction accuracy is lower than a predetermined threshold value, the generation source selection unit 123 selects the series data L1 as a generation source for determining the prediction target. Exclude from
  • the generation source selection unit 123 similarly determines the prediction accuracy PA not only for the sequence data L1 but also for each of the sequence data L2 and L3, and excludes sequences lower than a predetermined prediction accuracy from the generation sources.
  • the intra-session feature generation unit 124 controls the metadata extraction unit 124a to determine, for example, the number of sequences, sequence length, variance of each sequence, etc. in the flow data, based on the flow data. and the number of attribute data are extracted as metadata of flow data.
  • the metadata extraction unit 124a extracts metadata from the series data selected as the generation source by the generation source selection unit 123 from among the flow data.
  • the metadata of the flow data may be a machine learning model or algorithm generated by the machine learning model generation unit 63 based on feature data generated based on the flow data.
  • the in-session feature generation unit 124 acquires and pools various metadata and the distribution of the final effective feature generation method as paired information, and learns using these to generate information based on the metadata. , an estimation model 124b for estimating a method of generating effective feature quantities.
  • the intra-session feature amount generation unit 124 controls this estimation model 124b to estimate the effective feature amount generation method based on the metadata of the extracted flow data.
  • the estimation model 124b selects the generation source selection unit 123 from among the series data extracted in the determined output format based on the column indicating the time set by the user, the column indicating the session unit, and the prediction target column.
  • a method for generating an effective feature amount is estimated based on metadata of flow data, which is composed of series data whose prediction accuracy is higher than a predetermined accuracy threshold.
  • a method for generating effective features using series data with a high value is estimated.
  • the information specifying how to use the series data used to generate effective features may be, for example, setting the usage ratio of categorical series data to numerical series data to a predetermined ratio such as 40:60. This information is used in proportions.
  • the information specifying the window setting method is, for example, the information obtained in the window WB and window WS in FIG. 12 and the window WSS and window WA in FIG. 13, for example, 50:20:20:10 This information is used in
  • information specifying the ratio of each value and the setting method of the weight in each element of the feature amount is, for example, Ave (Xn), Min (Xn), Max (Xn), Var (Xn), Stde (Xn). This information includes the proportions and weights assigned to each.
  • the in-session feature amount generation unit 124 uses each series data to create an effective feature amount using the estimated creation method, uses the created effective feature amount to generate in-session feature amount data, and outputs it. do.
  • the inter-session feature amount generation unit 126 uses series data whose prediction accuracy PA is higher than a predetermined value based on the prediction model PM among the flow data, and is estimated from the metadata of the flow data.
  • Inter-session feature data is generated using feature values obtained from the time context of intra-session feature data, which is comprised of features generated by the effective feature generation method.
  • the feature amount before x at-bats for either the same batter, the same pitcher, or the same batter and the same pitcher, or , the entire past feature can be treated as an inter-session feature.
  • turn-at-bat IDs are so-called integer-type data that has an order, they are treated as time-series data that assumes the context of each session, and are used as inter-session features. be able to.
  • the context may be calculated in units of classes within the grouped sessions.
  • the pitcher ID column can be a set that clusters at-bat IDs in session units
  • the result column to be predicted such as a hit or out in the previous at-bat of "the same pitcher” or the average ball speed can be used as a feature between sessions. It can be the amount.
  • intra-session feature data consists of, from the left, a pitcher ID column that clusters the at-bat ID column that is the unit of session, a turn-at-bat ID column that is the session unit, a result column, and an average ball speed column for each at-bat.
  • the feature amount of the previous turn at bat is an inter-session feature amount. Therefore, as shown in the center part of FIG. 16, for each turn-at-bat ball speed average column that is a session unit in the intra-session feature value, as shown in the right part of FIG. 16, in the inter-session feature value, A pitcher's previous ball speed average column corresponding to the pitcher ID column that clusters the batting turn ID column for each session has been added.
  • an example of session (unit) ID and time when a hospital vital log, a factory robot log, and a baseball pitching log constitute flow data for the flow data of the present disclosure.
  • An example of a unit, an example of attribute data, an example of time series data, an example of an intra-session feature amount, an example of an inter-session feature amount, and an example of session set (unit) ID when session units are clustered will be explained.
  • an example of session (unit) ID is patient ID
  • an example of time unit is date and time
  • an example of attribute data is patient gender and time.
  • An example of series data is a heartbeat signal
  • an example of an intra-session feature is the patient's average heartbeat
  • an example of an inter-session feature is the age of a patient by hospital
  • the session set (unit) ID is , is the hospital ID.
  • an example of the session (unit) ID is the operation ID
  • an example of the time unit is the date and time
  • an example of the attribute data is the installation location of the robot
  • An example of time-series data is a torque sensor signal
  • an example of an intra-session feature is the average number of stops of a robot on that day
  • an example of an inter-session feature is the total number of stops for each robot
  • an example of a session set ( ID of the unit is the robot ID.
  • an example of the session (unit) ID is the at-bat ID
  • an example of the time unit is the number of pitches in the at-bat
  • an example of the attribute data is the pitcher's left / He is a right-handed pitcher
  • an example of time-series data is ball speed
  • an example of an intra-session feature is the average ball speed within an at-bat
  • an example of an inter-session feature is the results of the past three at-bats of the same pitcher
  • the session set (unit) ID is the pitcher ID.
  • examples of session (unit) IDs examples of time units, examples of attribute data, examples of time series data, examples of intra-session features, examples of inter-session features, and clustering of session units.
  • Examples of session set (unit) IDs in this case are not limited to those shown in FIG. 17.
  • FIG. 18 shows an example of presentation when the flow data is a baseball pitching log.
  • the feature data table is displayed in the upper row, and the detailed data of a part of the specified in-session feature data in the upper feature data table is displayed in the lower row.
  • the graph to be displayed is displayed.
  • there is a field for displaying the effectiveness score of the entire feature data in the upper right corner of the feature data table, and in FIG. For example, it is shown that the effectiveness score for the prediction of the prediction target is 85 points out of 100 points.
  • the feature data table includes, from the left, a data ID column, a turn-at-bat ID column as a session unit, a pitcher ID column as attribute data, a result column, a pitch ID column as a time column, a pitch speed column as a prediction target, and within a session.
  • the pitch speed row of the previous pitch, the average speed of the most recent three pitches, and the pitch type row of the previous pitch are provided as feature data, and the pitcher's previous average pitch speed row of the pitcher as inter-session feature data.
  • the data is data for a pitch with a pitch ID of 6 and a pitch with a pitch ID of 1.
  • the user can generate a machine learning model that predicts the ball speed to be predicted by using the ball speed of the previous ball, the average speed of the last three balls, and the ball speed of the previous ball as intra-session feature data. It can be recognized that the pitch type from one week ago has been proposed, and that the pitcher's previous pitch average speed has been proposed as the inter-session feature amount data.
  • the feature data required to generate a machine learning model can be generated by simply inputting flow data and specifying a column indicating the time, a column indicating the session unit, and a prediction target column for the flow data. becomes possible.
  • the validity score is low, and even when referring to the feature data presented, sufficient feature data for generating a machine learning model cannot be obtained. If it is determined that the flow data is not specified, for example, the column indicating the time specified for the flow data and the column indicating the session unit may be changed and the feature data may be generated again. Other flow data may also be used.
  • step S31 the flow data input unit 101 receives input of flow data and outputs it to the generated feature quantity visualization unit 103 and the data processing unit 62.
  • step S32 the column estimation unit 121 of the data processing unit 62 analyzes the flow data, estimates the columns that make up the flow data, and outputs the estimation result to the UI control unit 61.
  • step S33 when the task setting unit 102 obtains the estimation result of the flow data column, the task setting unit 102 prompts for input of a session unit column, a time unit column, and a prediction target as task settings together with the estimation result.
  • a UI as shown in the display image PV described with reference to 6 is generated and presented.
  • the task setting unit 102 receives input from the user and outputs the session unit sequence, time unit sequence, and prediction target information input as the task setting to the data processing unit 62.
  • the task setting unit 102 further presents information on the UI prompting the user to input the prediction frequency and prediction time of the prediction target column as a task setting, and also prompts the user to input the prediction frequency and prediction time information of the prediction target column. It accepts and outputs it to the data processing section 62.
  • step S34 the output format determining unit 122 determines the output format to be read from the flow data based on the session unit sequence, time unit sequence, and prediction target information supplied as the task settings, and 123.
  • step S35 the generation source selection unit 123 extracts sequence data from the flow data according to the output format, executes generation source selection processing, and selects the predicted sequence data from the sequence data extracted from the flow data based on the output format. Sequence data that is highly effective in predicting the target is selected and output to the intra-session feature generation unit 124.
  • step S36 the intra-session feature generation unit 124 executes an intra-session feature generation process, uses the selected series data to generate intra-session feature data, and outputs it to the feature selection unit 125.
  • step S37 the intra-session feature selection unit 141 of the feature selection unit 125 controls the effectiveness score calculation unit 143 to predict the prediction target of each feature forming the supplied intra-session feature data. It calculates the effectiveness score related to the prediction, and outputs the calculated effectiveness score to itself and to the loop determination unit 129.
  • the intra-session feature quantity selection unit 141 selects, as an effective feature quantity, a feature quantity whose effectiveness score is higher than a predetermined score threshold from among the respective feature quantities constituting the intra-session feature quantity data. , other features are excluded, intra-session feature data consisting of effective features is reconfigured and output to the inter-session feature generating section 126 and the combining section 127.
  • step S39 upon acquiring the intra-session feature data supplied from the feature selection unit 125, the inter-session feature generation unit 126 stores the intra-session feature data and uses other intra-session feature data to create an inter-session feature data. Quantity data is generated and output to the feature quantity selection unit 125.
  • step S40 the inter-session feature selection unit 142 of the feature selection unit 125 controls the effectiveness score calculation unit 143 to predict the prediction target of each feature forming the supplied inter-session feature data. It calculates the effectiveness score related to the prediction, and outputs the calculated effectiveness score to itself and to the loop determination unit 129.
  • step S41 the inter-session feature quantity selection unit 142 selects, as an effective feature quantity, a feature quantity whose effectiveness score is higher than a predetermined score threshold from among the respective feature quantities constituting the inter-session feature quantity data. , other feature quantities are excluded, and inter-session feature data consisting of effective feature quantities is reconstructed and output to the combining unit 127.
  • step S42 the combining unit 127 combines the intra-session feature data and the inter-session feature data to generate feature data, and stores the generated feature data in the feature data storage 128.
  • step S43 the loop determination unit 129 determines the effectiveness score for each feature of the intra-session feature data and the inter-session feature data, which correspond to the feature data stored in the feature data storage 128. Based on this, the overall effectiveness score of the feature amount data is calculated, and it is determined whether the effectiveness score is greater than or equal to a predetermined value or whether the elapsed time from the start of the process has exceeded a predetermined time.
  • step S43 If it is determined in step S43 that the overall effectiveness score of the feature data is smaller than the predetermined value and that the elapsed time from the start of the process has not exceeded the predetermined time, the process proceeds to step S44.
  • step S44 the loop determination unit 129 selects the generation source selection unit 123 and the feature so that the accuracy threshold used in the generation source selection process and the score threshold set for the effectiveness score are reduced from predetermined values. Controlling the amount selection unit 125, the process returns to step S35 and executes the feature amount data generation process again.
  • step S43 if the overall effectiveness score of the feature data is smaller than a predetermined value and the elapsed time from the start of processing has not exceeded the predetermined time, the validity score is also applied to the excluded series data and feature data. Therefore, the accuracy threshold and score threshold are set smaller by predetermined values, and the feature amount data is generated again.
  • the feature amount data generated up to this process will remain stored in the feature amount data storage 128 and will remain valid thereafter.
  • the feature values that have already been generated as feature data will be treated as already generated, and the series data and feature values that have been excluded in the previous processing will be restored, and then again.
  • Enable feature data to be generated For example, in calculating the effectiveness score in the feature quantity selection unit 125, a machine learning model is created using the union of the feature quantities stored in the storage 128 and the newly generated feature quantities, and the accuracy improvement range is newly calculated. It may also be calculated as a validity score.
  • step S43 If it is determined in step S43 that the overall effectiveness score of the feature data is equal to or greater than the predetermined value, or that the elapsed time from the start of the process has exceeded the predetermined time, the process proceeds to step S45.
  • step S45 the loop determination unit 129 reads out the feature data having the highest overall effectiveness score of the feature data from among the feature data stored in the feature data storage 128, and sends it to the UI control unit 61. It is output and presented to the user, and is also output to the machine learning model generation unit 63.
  • the generated feature visualization unit 103 of the UI control unit 61 generates a UI based on the flow data and feature data and presents it to the user.
  • the generated feature visualization unit 103 calculates the current validity score. At the same time, it may also be possible to present that the prediction accuracy may be insufficient with the current feature amount data.
  • step S71 the generation source selection unit 123 determines that among the series data extracted from the flow data based on the output format, time series data that does not change over time is irrelevant to the prediction of the prediction target. Exclude series data that is
  • step S72 the generation source selection unit 123 acquires a partial sequence for each sequence data and creates a feature amount table consisting of predetermined statistics.
  • step S73 the generation source selection unit 123 generates a prediction model that predicts the prediction target based on the feature table for each series of data.
  • step S74 the generation source selection unit 123 calculates the prediction accuracy of the prediction result based on the prediction model for each series of data.
  • step S75 the generation source selection unit 123 selects the series data whose prediction accuracy of the prediction result based on the prediction model is higher than a predetermined accuracy threshold as the generation source of the intra-session feature amount, and generates the intra-session feature amount. 124.
  • the series data that is highly effective for predicting the prediction target is selected as the series data from which the intra-session feature values are generated. Then, it becomes possible to output it to the intra-session feature amount generation unit 124.
  • step S91 the intra-session feature generation unit 124 controls the metadata extraction unit 124a to extract and generate metadata from the flow data.
  • step S92 the intra-session feature generation unit 124 uses the estimation model 124b to estimate a method for creating an effective feature from the metadata.
  • the intra-session feature quantity generation unit 124 selects the generation source of the intra-session feature quantity supplied from the generation source selection unit 123 based on the creation method of the effective feature quantity estimated by the estimation model 124b.
  • a feature amount is generated using the series data, and based on the generated feature amount, in-session feature amount data is generated and output to the feature amount selection unit 125.
  • the generation source selection unit 123 selects the series data that is effective for predicting the prediction target from among the series data extracted from the flow data as the generation source for the feature quantities that constitute the intra-session feature data. After using the flow data, the effective feature amount is further generated using a method of generating an estimated effective feature amount based on metadata generated from the flow data.
  • the feature selection unit 125 further calculates the effectiveness score among the features constituting the in-session feature data, and selects only the features whose effectiveness scores are higher than a predetermined score threshold. Once selected, intra-session feature data is generated.
  • inter-session feature data is generated based on this intra-session feature data, and among the features that make up this inter-session feature data, those whose effectiveness scores are higher than a predetermined score threshold are selected. Then, the inter-session feature data is reconstructed.
  • intra-session feature data consisting of features with high effectiveness scores related to prediction of the prediction target
  • inter-session feature data are generated based on the intra-session feature data, and then further effectiveness scores are generated.
  • Features based on the scores are selected to generate inter-session feature data.
  • the intra-session feature data generated in this way and the inter-session feature data are combined to generate feature data, so feature data that is highly effective in predicting the prediction target is generated. becomes possible.
  • the accuracy threshold and score threshold are set smaller by predetermined values, and the feature quantity data is generated again.
  • ⁇ Modified example> By clustering sessions, for example, a set of upper classes of set sessions may be created.
  • sessions may be clustered in advance, a superset of the session may be set, and a session may be set for each superordinate set.
  • TF- The IDF Term Frequency-Inverse Document Frequency
  • session FW1 is regarded as a set consisting of characteristic partial waveforms PW1-1, PW2-1, PW3-1
  • session FW2 is regarded as a set consisting of characteristic partial waveforms PW1-11, PW3-11, PW3-12
  • Session FW3 is regarded as a set consisting of characteristic partial waveforms PW2-21 and PW1-21, each of which is discretized, and TF-IDF is performed on the partial waveforms.
  • the TF-IDF values of (PW1, PW2, PW3) of session FW1 are (0, 0.1353, 0.1353)
  • the TF-IDF values of (PW1, PW2, PW3) of session FW2 are , (0, 0, 0.2706)
  • the TF-IDF value of (PW1, PW2, PW3) of session FW3 is (0, 0.2050, 0).
  • sessions with a high degree of similarity may be placed in the same class, and a superset may be set.
  • the turn at bat ID is set as a session and one line per session as shown in the left part of Fig. 23, metadata is extracted from the flow data and based on the extracted metadata. Then, for each at-bat ID, which is a session, clustering is performed on the at-bat ID, which is a session, based on the statistics of the attribute data, such as the frequency of pitcher IDs, to group the sessions, and create a session superset column (in the figure). You may create a new cluster ID column).
  • the pitcher ID extracted as metadata of the flow data is clustered by the opposing pitcher for each turn at bat, which is classified by the turn at bat ID that is the session, and the cluster ID is generated as a cluster ID.
  • An example of classification as A, B, and A from top to bottom is shown. That is, here, the cluster ID corresponds to the pitcher ID.
  • Example of execution using software can be executed by hardware, but can also be executed by software.
  • the programs that make up the software can execute various functions by using a computer built into dedicated hardware or by installing various programs. It is installed from a recording medium onto a computer that can be used, for example, a general-purpose computer.
  • FIG. 24 shows an example of the configuration of a general-purpose computer.
  • This computer has a built-in CPU (Central Processing Unit) 1001.
  • An input/output interface 1005 is connected to the CPU 1001 via a bus 1004.
  • a ROM (Read Only Memory) 1002 and a RAM (Random Access Memory) 1003 are connected to the bus 1004 .
  • the input/output interface 1005 includes an input unit 1006 consisting of input devices such as a keyboard and mouse for inputting operation commands by the user, an output unit 1007 for outputting processing operation screens and images of processing results to a display device, and an output unit 1007 for outputting programs and various data.
  • a storage unit 1008 consisting of a hard disk drive for storing data
  • a communication unit 1009 consisting of a LAN (Local Area Network) adapter, etc., and executing communication processing via a network typified by the Internet are connected.
  • LAN Local Area Network
  • magnetic disks including flexible disks
  • optical disks including CD-ROM (Compact Disc-Read Only Memory), DVD (Digital Versatile Disc)), magneto-optical disks (including MD (Mini Disc)), or semiconductor
  • a drive 1010 that reads and writes data to and from a removable storage medium 1011 such as a memory is connected.
  • the CPU 1001 executes programs stored in the ROM 1002 or read from a removable storage medium 1011 such as a magnetic disk, optical disk, magneto-optical disk, or semiconductor memory, installed in the storage unit 1008, and loaded from the storage unit 1008 into the RAM 1003. Execute various processes according to the programmed program.
  • the RAM 1003 also appropriately stores data necessary for the CPU 1001 to execute various processes.
  • the CPU 1001 executes the above-described series by, for example, loading a program stored in the storage unit 1008 into the RAM 1003 via the input/output interface 1005 and the bus 1004 and executing it. processing is performed.
  • a program executed by the computer (CPU 1001) can be provided by being recorded on a removable storage medium 1011 such as a package medium, for example. Additionally, programs may be provided via wired or wireless transmission media, such as local area networks, the Internet, and digital satellite broadcasts.
  • a program can be installed in the storage unit 1008 via the input/output interface 1005 by attaching the removable storage medium 1011 to the drive 1010. Further, the program can be received by the communication unit 1009 via a wired or wireless transmission medium and installed in the storage unit 1008. Other programs can be installed in the ROM 1002 or the storage unit 1008 in advance.
  • the program executed by the computer may be a program in which processing is performed chronologically in accordance with the order described in this specification, in parallel, or at necessary timing such as when a call is made. It may also be a program that performs processing.
  • CPU 1001 in FIG. 24 realizes the functions of the control unit 51 of the information processing device 31 in FIG. 2.
  • a system refers to a collection of multiple components (devices, modules (components), etc.), regardless of whether all the components are located in the same casing. Therefore, multiple devices housed in separate casings and connected via a network, and a single device with multiple modules housed in one casing are both systems. .
  • the present disclosure can take a cloud computing configuration in which one function is shared and jointly processed by multiple devices via a network.
  • each step described in the above flowchart can be executed by one device or can be shared and executed by multiple devices.
  • one step includes multiple processes
  • the multiple processes included in that one step can be executed by one device or can be shared and executed by multiple devices.
  • a metadata generation unit that generates metadata of flow data including at least time-series data; an estimating unit that estimates a method of generating feature amounts from series data forming the flow data based on the metadata;
  • An information processing apparatus comprising: a feature amount generating section that generates a feature amount from the series data using a generation method estimated by the estimating section.
  • ⁇ 3> Further including a column estimation unit that estimates columns constituting the flow data, The setting unit presents a column estimated by the column estimating unit and prompts setting of the session unit, the time unit, and the prediction target column in the flow data using the column as a unit (UI)
  • the information processing device according to ⁇ 2>, wherein the information processing device generates and presents a User Interface image, and receives settings for the session unit, the time unit, and the prediction target column based on the UI image.
  • An output format determining unit that determines an output format of series data extracted from the flow data based on the session unit, the time unit, and the prediction target in the flow data set by the setting unit.
  • the metadata generation unit generates the series data extracted from the flow data based on the output format determined according to the settings of the session unit, the time unit, and the prediction target, among the flow data.
  • the information processing device according to ⁇ 2>, wherein the metadata is generated from the information processing device.
  • ⁇ 5> Selection of determining the prediction accuracy related to the prediction of the prediction target for each of the series data extracted from the flow data based on the output format, and selecting series data higher than a predetermined accuracy threshold. further equipped with a department;
  • the metadata generation unit generates the metadata from the series data selected by the selection unit from among the series data extracted from the flow data, based on the output format.
  • Information processing device .
  • the selection unit calculates a feature amount for each partial sequence for each of the sequence data extracted from the flow data based on the output format, and adds the feature amount to the prediction model for predicting the prediction target.
  • the prediction target is predicted, and from the comparison of the prediction target and the prediction result by the prediction model, the prediction accuracy related to the prediction of the prediction target for each of the series data is calculated.
  • the information processing apparatus according to ⁇ 5>, wherein the information processing apparatus calculates the sequence data higher than the predetermined accuracy threshold.
  • the feature generation unit generates a feature from the series data using the feature generation method estimated by the estimation unit, and generates an intra-session feature based on the generated feature for each session.
  • the information processing device which generates an amount.
  • An effectiveness score calculation unit that calculates an effectiveness score for the prediction of the prediction target for each of the feature amounts forming the intra-session feature amount; an in-session feature selection unit that selects a feature higher than a predetermined score threshold based on the effectiveness score from among the features constituting the in-session feature and reconstructs the in-session feature; Furthermore, the information processing device according to ⁇ 7>.
  • the information processing device further including an inter-session feature generating unit that generates an inter-session feature including the inter-session feature based on the intra-session feature.
  • the effectiveness score is calculated by calculating the effectiveness score for the prediction of the prediction target for each of the feature amounts forming the inter-session feature amount, an inter-session feature selection unit that selects a feature higher than a predetermined score threshold based on the effectiveness score from among the features constituting the inter-session feature and reconstructs the inter-session feature; Furthermore, the information processing device according to ⁇ 9>. ⁇ 11> The effectiveness score calculation unit calculates mutual information between each of the intra-session feature amounts and the inter-session feature amounts and the prediction target as the effectiveness score. ⁇ 10 >The information processing device described in >.
  • the effectiveness score calculation unit calculates the prediction accuracy for predicting the prediction target using a machine learning model that is simply generated based on the intra-session feature amounts and the feature amounts constituting the inter-session feature amounts. is calculated as the effectiveness score,
  • the intra-session feature quantity selection unit selects a subset of the feature quantities for which the effectiveness score does not become lower than a predetermined score threshold, and reconstructs the intra-session feature quantity;
  • a coupling unit that couples the reconstructed intra-session feature quantity and the reconstructed inter-session feature quantity; Features combined by the combining unit based on the effectiveness scores of the respective features of the reconstructed intra-session feature and the reconstructed inter-session feature, which are combined by the linking unit. further comprising a determination unit that calculates an overall effectiveness score and determines whether the overall effectiveness score is smaller than a predetermined threshold; When the overall effectiveness score is smaller than a predetermined threshold, the determination unit reduces the score threshold by a predetermined value, and controls processing by the intra-session feature selection unit and the inter-session feature selection unit.
  • the information processing device according to ⁇ 10>, wherein the information processing device is caused to execute again.
  • the estimating unit calculates the metadata of the flow data and a distribution of a method of creating features used for learning a predetermined machine learning model, which is generated from the series data extracted from the flow data.
  • the estimation model is a pair of information, and is an estimation model generated by learning based on the pair of information, and estimates a method of generating the feature amount based on the metadata.
  • information processing equipment ⁇ 15>
  • the flow data further includes attribute data consisting of data that does not change over time. The information processing device according to any one of the above.
  • ⁇ 16> Generate metadata of flow data including at least time series data, Based on the metadata, estimating a feature generation method from series data forming the flow data, An information processing method comprising the step of generating a feature amount from the series data using an estimated generation method.
  • a metadata generation unit that generates metadata of flow data including at least time-series data; an estimating unit that estimates a method of generating feature amounts from series data forming the flow data based on the metadata;
  • a program that causes a computer to function as a feature value generation unit that generates a feature value from the series data using a generation method estimated by the estimation unit.
  • Information processing device 61 UI control unit, 62 Data processing unit, 63 Machine learning model generation unit, 101 Flow data input unit, 102 Task setting unit, 103 Generated feature visualization unit, 121 Column estimation unit, 122 Output format determination unit , 123 Generation source selection unit, 124 Intra-session feature generation unit, 124a Metadata extraction unit, 124b Estimation model, 125 Feature selection unit, 126 Inter-session feature generation unit, 127 Combining unit, 128 Feature data storage, 129 Loop judgment section

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present disclosure relates to an information processing device, an information processing method, and a program which make it possible to efficiently find and extract, from time-series data, feature quantities effective in creating a machine learning model. The present disclosure involves: generating metadata from flow data including at least time-series data; estimating a method for generating feature quantities from series data constituting the flow data, on the basis of the generated metadata; and generating feature quantities from the series data using the estimated generation method. The present disclosure can be applied to technology for generating feature quantities necessary to train machine learning models.

Description

情報処理装置、および情報処理方法、並びにプログラムInformation processing device, information processing method, and program
 本開示は、情報処理装置、および情報処理方法、並びにプログラムに関し、特に、時系列データから機械学習モデルの作成において有効な特徴量を効率的に探索して抽出できるようにした情報処理装置、および情報処理方法、並びにプログラムに関する。 The present disclosure relates to an information processing device, an information processing method, and a program, and in particular, an information processing device that can efficiently search and extract features effective in creating a machine learning model from time-series data, and Related to information processing methods and programs.
 IoT(Internet of Things)においては複数の時系列データからなるデータ集合が蓄積されることが多くなってきている。 In the Internet of Things (IoT), data sets consisting of multiple time-series data are increasingly being accumulated.
 一方、その様なデータを用いた機械学習モデルや因果モデルの構築は高度な専門性が求められるため、専門性の乏しい人でもモデルを構築可能なツールが期待されている。 On the other hand, building machine learning models and causal models using such data requires a high level of expertise, so there are expectations for tools that allow even people with limited expertise to build models.
 そこで、時系列データから目的事象の発生を予測する機械学習モデルを生成する際に、機械学習の入力データ数が膨大になるのを抑制し、負例の時系列データの基準日を決定するものが提案されている(特許文献1参照)。 Therefore, when generating a machine learning model that predicts the occurrence of a target event from time-series data, it is necessary to suppress the amount of input data for machine learning from becoming enormous and to determine the reference date of time-series data for negative examples. has been proposed (see Patent Document 1).
特開2021-189833号公報JP 2021-189833 Publication
 しかしながら、特許文献1の技術においては、問題設定が限定されていると共に、前処理である特徴量を生成する作業が非常に煩雑であった。 However, in the technique of Patent Document 1, problem settings are limited, and the task of generating feature amounts as preprocessing is extremely complicated.
 本開示は、このような状況に鑑みてなされたものであり、特に、時系列データから機械学習モデルの作成において有効な特徴量を効率的に探索して抽出できるようにするものである。 The present disclosure has been made in view of this situation, and in particular, it is intended to enable efficient searching and extraction of feature quantities effective in creating a machine learning model from time-series data.
 本開示の一側面の情報処理装置、およびプログラムは、少なくとも時系列データを含むフローデータのメタデータを生成するメタデータ生成部と、前記メタデータに基づいて、前記フローデータを構成する系列データより特徴量の生成方法を推定する推定部と、前記推定部により推定された生成方法で、前記系列データより特徴量を生成する特徴量生成部とを備える情報処理装置、およびプログラムである。 An information processing device and a program according to an aspect of the present disclosure include a metadata generation unit that generates metadata of flow data including at least time series data, and a metadata generation unit that generates metadata of flow data including at least time series data, and a metadata generation unit that generates metadata of flow data that includes at least time series data, and The present invention provides an information processing device and a program, comprising: an estimating section that estimates a feature amount generation method; and a feature amount generating section that generates a feature amount from the series data using the generation method estimated by the estimating section.
 本開示の一側面の情報処理方法は、少なくとも時系列データを含むフローデータのメタデータを生成し、前記メタデータに基づいて、前記フローデータを構成する系列データより特徴量の生成方法を推定し、推定された生成方法で、前記系列データより特徴量を生成するステップを含む情報処理方法である。 An information processing method according to one aspect of the present disclosure generates metadata of flow data including at least time-series data, and estimates a method for generating feature amounts from series data constituting the flow data based on the metadata. , an information processing method including the step of generating feature amounts from the series data using an estimated generation method.
 本開示の一側面においては、少なくとも時系列データを含むフローデータのメタデータが生成され、前記メタデータに基づいて、前記フローデータを構成する系列データより特徴量の生成方法が推定され、推定された生成方法で、前記系列データより特徴量が生成される。 In one aspect of the present disclosure, metadata of flow data including at least time-series data is generated, and based on the metadata, a method of generating a feature amount from the series data constituting the flow data is estimated. A feature amount is generated from the series data using a generation method.
本開示のフローデータを説明する図である。FIG. 3 is a diagram illustrating flow data of the present disclosure. フローデータにおけるセッション単位、時刻単位、属性データ、および時系列データの例を説明する図である。FIG. 3 is a diagram illustrating examples of session units, time units, attribute data, and time series data in flow data. 本開示の情報処理装置の構成例を説明するハードウェアブロック図である。FIG. 1 is a hardware block diagram illustrating a configuration example of an information processing device according to the present disclosure. 図3のUI制御部、データ処理部、および機械学習モデル生成部により実現される機能を説明する機能ブロック図である。4 is a functional block diagram illustrating functions realized by the UI control unit, data processing unit, and machine learning model generation unit in FIG. 3. FIG. フローデータにおける属性データおよび時系列データの構成例を説明する図である。It is a figure explaining the example of composition of attribute data and time series data in flow data. フローデータにおけるセッション単位の列、時刻単位の列、および予測対象列の設定を促すUIの表示画像例を説明する図である。FIG. 6 is a diagram illustrating an example of a display image of a UI that prompts setting of a column for each session, a column for each time, and a prediction target column in flow data. 出力フォーマットとしてのメルトフォーマットの例を説明する図である。FIG. 2 is a diagram illustrating an example of a melt format as an output format. 出力フォーマットとしてのピボットフォーマットの例を説明する図である。FIG. 2 is a diagram illustrating an example of a pivot format as an output format. 野球の所定の打者の投球ログに係るフローデータに基づいて、球速列を予測対象に設定した場合の出力フォーマットの例を説明する図である。FIG. 6 is a diagram illustrating an example of an output format when a ball speed sequence is set as a prediction target based on flow data related to a pitching log of a predetermined baseball batter. 野球の所定の打者の投球ログに係るフローデータに基づいて、結果列を予測対象に設定した場合の出力フォーマットの例を説明する図である。FIG. 6 is a diagram illustrating an example of an output format when a result string is set as a prediction target based on flow data related to a pitching log of a predetermined baseball batter. 時系列データの特徴量の生成方法を説明する図である。FIG. 3 is a diagram illustrating a method for generating feature amounts of time-series data. 特徴量の生成に係る窓の設定例を説明する図である。FIG. 6 is a diagram illustrating an example of setting a window related to generation of feature amounts. 特徴量の生成に係る窓のその他の設定例を説明する図である。FIG. 7 is a diagram illustrating another setting example of a window related to generation of a feature amount. 特徴量の生成元となる系列データの選択を説明する図である。FIG. 3 is a diagram illustrating selection of series data from which feature amounts are generated. セッション内特徴量データの生成例を説明する図である。FIG. 3 is a diagram illustrating an example of generation of intra-session feature amount data. セッション間特徴量データの生成例を説明する図である。FIG. 3 is a diagram illustrating an example of generation of inter-session feature amount data. 本開示のフローデータにおけるセッションID、時刻単位、属性データ、時系列データ、セッション内特徴量、セッション間特徴量、およびセッション集合IDのそれぞれの例を説明する図である。FIG. 6 is a diagram illustrating each example of a session ID, time unit, attribute data, time series data, intra-session feature amount, inter-session feature amount, and session set ID in the flow data of the present disclosure. 特徴量データの提示例を説明する図である。FIG. 3 is a diagram illustrating an example of presentation of feature amount data. 特徴データ生成処理を説明するフローチャートである。It is a flowchart explaining feature data generation processing. 生成元選択処理を説明するフローチャートである。It is a flowchart explaining generation source selection processing. セッション内特徴量データ生成処理を説明するフローチャートである。12 is a flowchart illustrating intra-session feature amount data generation processing. セッションをクラスタリングする変形例を説明する図である。FIG. 7 is a diagram illustrating a modification example of clustering sessions. セッションをクラスタリングする変形例を説明する図である。FIG. 7 is a diagram illustrating a modification example of clustering sessions. 汎用のコンピュータの構成例を示している。An example of the configuration of a general-purpose computer is shown.
 以下に添付図面を参照しながら、本開示の好適な実施の形態について詳細に説明する。なお、本明細書及び図面において、実質的に同一の機能構成を有する構成要素については、同一の符号を付することにより重複説明を省略する。 Preferred embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings. Note that, in this specification and the drawings, components having substantially the same functional configurations are designated by the same reference numerals and redundant explanation will be omitted.
 以下、本技術を実施するための形態について説明する。説明は以下の順序で行う。
 1.本開示の概要
 2.好適な実施の形態
 3.ソフトウェアにより実行させる例
Hereinafter, a mode for implementing the present technology will be described. The explanation will be given in the following order.
1. Summary of this disclosure 2. Preferred embodiment 3. Example of execution by software
 <<1.本開示の概要>>
 <フローデータ>
 本開示は、特に、時系列データから機械学習モデルの作成に有効な特徴量を効率的に探索して抽出できるようにするものである。
<<1. Summary of this disclosure >>
<Flow data>
In particular, the present disclosure makes it possible to efficiently search and extract feature amounts effective for creating a machine learning model from time-series data.
 本明細書においては、複数の時系列データからなるデータ集合をフローデータと称するものとし、フローデータから機械学習モデルの作成に有効な特徴量を効率的に探索して抽出する技術を説明する。 In this specification, a data set consisting of a plurality of time-series data is referred to as flow data, and a technique for efficiently searching and extracting feature quantities effective for creating a machine learning model from flow data will be described.
 そこで、まず、本明細書において使用する用語について定義する。 Therefore, first, the terms used in this specification will be defined.
 フローデータは、1つ以上の時系列データを必須構成とし、オプションとして1つ以上の属性データを備えることができるデータセットである。すなわち、フローデータにおいては、少なくとも1つ以上の時系列データが必ず含まれるのに対して、属性データは含まれないことがあってもよいが、複数の属性データが含まれてもよい。 Flow data is a data set that requires one or more time series data and can optionally include one or more attribute data. That is, while flow data always includes at least one piece of time-series data, it may not include attribute data, but it may include a plurality of attribute data.
 ここで、時系列データは、時間の経過に伴って変化するデータであり、属性データは時間の経過に対し不変なデータである。 Here, time-series data is data that changes over time, and attribute data is data that does not change over time.
 例えば、病院内の患者につけたセンサから取得されるバイタル信号の場合、例えば、心拍や単位時間あたりの呼吸数、および測定装置における動作ログなどが、時系列データとなり、測定対象となる各患者の性別や体重などが属性データとなる。 For example, in the case of vital signals obtained from sensors attached to patients in a hospital, for example, heartbeat, respiratory rate per unit time, and operation log of a measuring device are time-series data, and each patient's Attribute data includes gender, weight, etc.
 これらの時系列データと属性データとが患者毎に蓄積されるとき、一人の患者を集合単位とするフローデータが構成される。 When these time-series data and attribute data are accumulated for each patient, flow data is constructed with one patient as a set unit.
 また、工場で使用されるロボットアームの稼働状態をセンサで計測する場合、ロボットアームから取得できるセンサデータが時系列データとなり、個体別の故障回数などが属性データとなる。 Furthermore, when measuring the operating status of a robot arm used in a factory with a sensor, the sensor data that can be obtained from the robot arm becomes time series data, and the number of failures for each individual becomes attribute data.
 そして、これらの時系列データと属性データとがロボットアーム毎に蓄積されるとき、1つのロボットアームを集合単位とするフローデータが構成される。 When these time-series data and attribute data are accumulated for each robot arm, flow data with one robot arm as a collection unit is constructed.
 さらに、野球の試合の投球履歴が蓄積される場合、打席内で投球された球の速度等が、時系列データとなり、投手や打者の情報が属性データとなる。 Further, when the pitching history of a baseball game is accumulated, the speed of a pitched ball in the at-bat becomes time-series data, and the information about the pitcher and batter becomes attribute data.
 そして、これらの時系列データと属性データとが打席毎に蓄積されるとき、1つの打席を集合単位とするフローデータが構成される。 When these time-series data and attribute data are accumulated for each turn at bat, flow data is constructed with one turn at bat as a set unit.
 すなわち、図1で示されるように、フローデータは、矢印で示される時間軸上の丸印で示されるタイミングにおいて時系列に測定される、データDt1,Dt2などからなる時系列データと、測定対象の人物の性別や体重などのデータDa1や、測定装置の装置名や、測定装置の設定値などのデータDa2などからなる属性データとから構成される。 That is, as shown in FIG. 1, the flow data includes time-series data consisting of data Dt1, Dt2, etc., which are measured in time series at the timings indicated by circles on the time axis indicated by arrows, and the person to be measured. It is composed of data Da1 such as the gender and weight of the user, and attribute data including data Da2 such as the name of the measuring device and the setting values of the measuring device.
 尚、図1においては、病院において、患者のバイタル信号からなるデータDt1,Dt2を時系列データとし、患者の性別や体重のデータDa1、および測定装置の装置名および設定値のデータDa2を属性データとした時の例である。 In FIG. 1, in a hospital, data Dt1 and Dt2 consisting of patient's vital signals are time series data, and data Da1 of the patient's gender and weight, and data Da2 of the device name and setting value of the measuring device are attribute data. This is an example when
 また、時系列データを構成するデータDt1で示されるように、丸印で示される個々の時系列データの時間間隔は、間隔T1,T2で示されるように不均等でもよいし、図示しないが均等であってもよい。 Further, as shown by the data Dt1 constituting the time-series data, the time intervals of the individual time-series data indicated by circles may be uneven as shown by the intervals T1 and T2, or even though not shown. It may be.
 さらに、時系列データと属性データとからなるフローデータが、例えば、患者毎、測定装置毎、設定値毎等の一つの集合を構成するとき、この集合単位をセッションと称する。図1においては、所定の条件で構成されるフローデータの集合がセッションSSであることが示されている。 Further, when flow data consisting of time series data and attribute data constitutes one set for each patient, each measurement device, each set value, etc., this set unit is referred to as a session. In FIG. 1, it is shown that a collection of flow data configured under predetermined conditions is a session SS.
 そして、複数のセッションSSからなるフローデータに基づいて、様々な予測対象が予測される。 Then, various prediction targets are predicted based on the flow data consisting of a plurality of sessions SS.
 図2は、病院バイタルログ、工場ロボットログ、および野球投球ログのそれぞれがフローデータを構成する場合の、セッション単位の例、時刻単位の例、属性データの例、および時系列データの例をまとめたものである。 Figure 2 summarizes examples of session units, time units, attribute data examples, and time series data examples when hospital vital logs, factory robot logs, and baseball pitching logs each constitute flow data. It is something that
 すなわち、フローデータが病院バイタルログから構成されるものである場合、セッション単位の例は、患者であり、時刻単位の例は、日時であり、属性データの例は、患者の性別であり、時系列データの例は、心拍信号である。 That is, when the flow data is composed of hospital vital logs, an example of a session unit is a patient, an example of a time unit is a date and time, and an example of attribute data is a patient's gender and time. An example of series data is a heartbeat signal.
 また、フローデータが工場ロボットログである場合、セッション単位の例は、ロボットであり、時刻単位の例は、日時であり、属性データの例は、ロボットの故障回数であり、時系列データの例は、トルクセンサ信号である。 In addition, when the flow data is a factory robot log, an example of session unit is robot, an example of time unit is date and time, an example of attribute data is the number of robot failures, and an example of time series data. is the torque sensor signal.
 さらに、フローデータが野球投球ログである場合、セッション単位の例は、打席であり、時刻単位の例は、打席内球数であり、属性データの例は、投手の左/右投げであり、時系列データの例は、球速である。 Further, when the flow data is a baseball pitching log, an example of the session unit is a turn at bat, an example of the time unit is the number of pitches in an at bat, an example of attribute data is a pitcher's left/right pitching, An example of time series data is ball speed.
 この様にフローデータは、様々な実体として存在し、IoTの普及に伴って今後大量に生成されうるデータである。 In this way, flow data exists as various entities, and is data that can be generated in large quantities in the future as IoT becomes more widespread.
 ところで、フローデータに対して機械学習モデルによる予測が行われる場合、フローデータから機械学習モデル用の特徴量を作成する必要があるが、予測精度に寄与する機械学習モデル用の特徴量を作成する処理(特徴量エンジニアリング)は、手間と時間が掛かる処理であった。 By the way, when predictions are made using a machine learning model on flow data, it is necessary to create features for the machine learning model from the flow data, but creating features for the machine learning model that contribute to prediction accuracy The processing (feature engineering) was a process that took time and effort.
 より具体的には、一般ユーザは自分が予測したい対象系列や時刻情報については理解しているものの、ユーザが行いたいタスクに対する機械学習モデルを構築する上で必要なデータ加工については加工方法がわからない事が多い。 More specifically, although general users understand the target sequence and time information they want to predict, they do not know how to process the data required to build a machine learning model for the task they want to perform. There are many things.
 また、機械学習モデル用の特徴量を生成するツールがいくつか提案されてはいるが、時系列データが等間隔である・時系列データの将来値予測のみに対応するなど、予測対象に対する制約が存在し、ユーザが行いたい予測対象の予測を網羅できないことが多い。 In addition, although some tools have been proposed to generate features for machine learning models, they have limitations on the target of prediction, such as time series data being at equal intervals and only being able to predict future values of time series data. However, in many cases, it is not possible to cover all the predictions that the user wants to make.
 さらに、フローデータは膨大かつ複数時系列に渡ることが多いため、ユーザのデータセットの関連性への理解に限界があり、系列同士の関連性を踏まえてデータセットから特徴量を作成することが困難または煩雑になる。 Furthermore, because flow data is huge and often spans multiple time series, there are limits to users' understanding of the relationships between datasets, and it is difficult to create features from datasets based on the relationships between series. become difficult or cumbersome.
 また、反対に、全く事前知識なしに総当たりで特徴量を作成すると無駄な特徴量を作成することになり、不要な計算コストがかかってしまう。 On the other hand, if feature quantities are created by brute force without any prior knowledge, unnecessary feature quantities will be created and unnecessary calculation costs will be incurred.
 そこで、本開示においては、ユーザにより、フローデータに対して、最低限の設定入力がなされることで、幅広いタスクに対して有効な特徴量を容易に生成できるようにする。 Therefore, in the present disclosure, the user can easily generate feature amounts that are effective for a wide range of tasks by inputting the minimum settings to the flow data.
 より具体的には、本開示においてば、フローデータにおける、時刻を示す列およびセッション単位を示す列、並びに予測対象列が、ユーザにより入力されると、時系列データの将来値の予測、時系列データにおいて特定のイベントが発生するか否かの予測、および、時系列でない(時刻によって変化しない)データの予測等に有効な特徴量を、現実的な時間内で生成することが可能となる。 More specifically, in the present disclosure, when a column indicating time, a column indicating session unit, and a prediction target column in flow data are input by a user, prediction of future values of time series data, time series It becomes possible to generate feature amounts that are effective for predicting whether a specific event will occur in data, predicting data that is not time-series (does not change with time), etc. within a realistic amount of time.
 <<2.好適な実施の形態>>
 <本開示の情報処理装置の構成例>
 次に、図3を参照して、本開示の情報処理装置の構成例について説明する。
<<2. Preferred embodiment >>
<Example of configuration of information processing device of the present disclosure>
Next, with reference to FIG. 3, a configuration example of the information processing apparatus of the present disclosure will be described.
 情報処理装置31は、制御部51、入力部52、出力部53、記憶部54、通信部55、ドライブ56、およびリムーバブル記憶媒体57より構成されており、相互にバス58を介して接続されており、データやプログラムを送受信することができる。 The information processing device 31 includes a control section 51, an input section 52, an output section 53, a storage section 54, a communication section 55, a drive 56, and a removable storage medium 57, which are connected to each other via a bus 58. It is possible to send and receive data and programs.
 制御部51は、プロセッサやメモリから構成されており、情報処理装置31の動作の全体を制御する。また、制御部51は、UI制御部61、データ処理部62、および機械学習モデル生成部63を備えている。 The control unit 51 is composed of a processor and a memory, and controls the entire operation of the information processing device 31. The control unit 51 also includes a UI control unit 61, a data processing unit 62, and a machine learning model generation unit 63.
 UI制御部61は、フローデータの入力を受け付けると、タスク設定としての時刻を示す列やセッション単位を示す列、および、予測対象となる列の入力を促すUI(User Interface)を生成し、出力部53の表示部71や音声出力部72を制御して提示する。 When the UI control unit 61 receives input of flow data, it generates a UI (User Interface) that prompts the input of a column indicating time as a task setting, a column indicating a session unit, and a column to be predicted, and outputs it. The display section 71 and the audio output section 72 of the section 53 are controlled and presented.
 そして、UI制御部61は、これに応じてユーザにより入力部52が操作されることで、入力されるタスク設定を受け付けて、入力されたフローデータと共にデータ処理部62に出力する。 Then, the UI control unit 61 receives the input task settings by operating the input unit 52 by the user in response, and outputs them to the data processing unit 62 together with the input flow data.
 また、UI制御部61は、出力部53の表示部71や音声出力部72を制御して、データ処理部62により生成された特徴量の情報を、出力部53の表示部71や音声出力部72を制御してユーザに提示する。 The UI control unit 61 also controls the display unit 71 and the audio output unit 72 of the output unit 53 to display information on the feature amount generated by the data processing unit 62 on the display unit 71 and the audio output unit of the output unit 53. 72 and presents it to the user.
 データ処理部62は、UI制御部61より供給されるフローデータとタスク設定を取得して、機械学習モデルの生成において有効な特徴量(以下、有効特徴量とも称する)を特徴量データとして生成し、UI制御部61、および機械学習モデル生成部63に出力する。 The data processing unit 62 acquires the flow data and task settings supplied from the UI control unit 61, and generates effective feature quantities (hereinafter also referred to as effective feature quantities) in generating a machine learning model as feature data. , the UI control unit 61, and the machine learning model generation unit 63.
 機械学習モデル生成部63は、データ処理部62より供給される有効特徴量からなる特徴量データに基づいて、機械学習モデルを生成する。 The machine learning model generation unit 63 generates a machine learning model based on feature amount data consisting of effective feature amounts supplied from the data processing unit 62.
 尚、UI制御部61、およびデータ処理部62により実現される機能の詳細については、図4の機能ブロック図を参照して後述する。 Note that details of the functions realized by the UI control unit 61 and the data processing unit 62 will be described later with reference to the functional block diagram of FIG. 4.
 入力部52は、ユーザが操作コマンドを入力するキーボード、マウス、タッチパネルなどの入力デバイスより構成され、入力された各種の信号を制御部51に供給する。 The input unit 52 is composed of input devices such as a keyboard, a mouse, and a touch panel through which the user inputs operation commands, and supplies various input signals to the control unit 51.
 出力部53は、制御部51により制御され、表示部、および音声出力部を備えている。出力部53は、操作画面や処理結果の画像を、LCD(Liquid Crystal Display)や有機EL(Electro Luminescence)などからなる表示デバイスからなる表示部71に出力して表示する。また、出力部53は、音声出力デバイスからなる音声出力部72を制御して、各種の音声や音楽、効果音などを再生する。 The output section 53 is controlled by the control section 51 and includes a display section and an audio output section. The output unit 53 outputs and displays images of the operation screen and processing results on a display unit 71 that is a display device such as an LCD (Liquid Crystal Display) or an organic EL (Electro Luminescence). The output unit 53 also controls an audio output unit 72 consisting of an audio output device to reproduce various voices, music, sound effects, and the like.
 記憶部54は、HDD(Hard Disk Drive)、SSD(Solid State Drive)、または、半導体メモリなどからなり、制御部51により制御され、各種のデータおよびプログラムを書き込む、または、読み出す。 The storage unit 54 is composed of an HDD (Hard Disk Drive), an SSD (Solid State Drive), or a semiconductor memory, and is controlled by the control unit 51 to write or read various data and programs.
 通信部55は、制御部51により制御され、有線または無線により、LAN(Local Area Network)やブルートゥース(登録商標)等に代表される通信を実現し、必要に応じてネットワークを介して、各種の装置との間で各種のデータやプログラムを送受信する。 The communication unit 55 is controlled by the control unit 51 and realizes wired or wireless communications such as LAN (Local Area Network) and Bluetooth (registered trademark), and performs various types of communication via the network as necessary. Sends and receives various data and programs to and from devices.
 ドライブ56は、磁気ディスク(フレキシブルディスクを含む)、光ディスク(CD-ROM(Compact Disc-Read Only Memory)、DVD(Digital Versatile Disc)を含む)、光磁気ディスク(MD(Mini Disc)を含む)、もしくは半導体メモリなどのリムーバブル記憶媒体57に対してデータを読み書きする。 The drive 56 includes magnetic disks (including flexible disks), optical disks (including CD-ROMs (Compact Disc-Read Only Memory) and DVDs (Digital Versatile Discs)), magneto-optical disks (including MDs (Mini Discs)), Alternatively, data is read and written from and to a removable storage medium 57 such as a semiconductor memory.
 <UI制御部およびデータ処理部により実現される機能>
 次に、図4の機能ブロック図を参照して、UI制御部61およびデータ処理部62により実現される機能について説明する。
<Functions realized by the UI control unit and data processing unit>
Next, functions realized by the UI control section 61 and the data processing section 62 will be described with reference to the functional block diagram of FIG. 4.
 UI制御部61は、フローデータ入力部101、タスク設定部102、および生成特徴量可視化部103を備えている。 The UI control unit 61 includes a flow data input unit 101, a task setting unit 102, and a generated feature amount visualization unit 103.
 フローデータ入力部101は、入力部52の操作入力、記憶部54、通信部55、およびドライブ56を介したリムーバブル記憶媒体57の少なくともいずれかからフローデータの入力を受け付けて、データ処理部62、および生成特徴量可視化部103に出力する。 The flow data input unit 101 receives operation input from the input unit 52 and input of flow data from at least one of the storage unit 54, the communication unit 55, and the removable storage medium 57 via the drive 56, and inputs the flow data to the data processing unit 62, and output to the generated feature amount visualization unit 103.
 タスク設定部102は、データ処理部62より供給されるフローデータの列推定結果を取得すると、列推定結果を示すUIを生成すると共に、その列推定結果を提示するUI上にタスク設定としての時刻を示す列やセッション単位を示す列、および、予測対象の列の入力を促す情報を付加して、出力部53の表示部71や音声出力部72を制御して提示する。タスク設定部102は、さらに、予測対象列の予測頻度と予測時刻についてもタスク設定として入力を促すようにしてもよい。 When the task setting unit 102 acquires the column estimation results of the flow data supplied from the data processing unit 62, it generates a UI that shows the column estimation results, and also displays the time as a task setting on the UI that presents the column estimation results. The display section 71 and the audio output section 72 of the output section 53 are displayed by adding a column indicating the session unit, a column indicating the session unit, and information prompting the input of the column to be predicted. The task setting unit 102 may further prompt the user to input the prediction frequency and prediction time of the prediction target column as task settings.
 タスク設定部102は、このUIに促されてユーザにより入力部52が操作されて、時刻を示す列やセッション単位を示す列、および、予測対象、並びに、必要に応じて、予測対象列の予測頻度と予測時刻を加えたタスク設定の情報をデータ処理部62に出力する。 The task setting unit 102 is prompted by this UI and the user operates the input unit 52 to set a column indicating time, a column indicating a session unit, a prediction target, and, if necessary, predicting the prediction target column. Information on task settings including the frequency and predicted time is output to the data processing unit 62.
 尚、タスク設定については、図5,図6を参照して詳細を後述する。 Note that the task settings will be described in detail later with reference to FIGS. 5 and 6.
 生成特徴量可視化部103は、フローデータ入力部101より供給されるフローデータと、データ処理部62より供給される有効特徴量からなる特徴量データとを取得するとUIとして可視化し、出力部53の表示部71や音声出力部72を制御して提示する。 When the generated feature visualization unit 103 acquires the flow data supplied from the flow data input unit 101 and the feature data consisting of the effective features supplied from the data processing unit 62, it visualizes it as a UI and displays it in the output unit 53. The display section 71 and the audio output section 72 are controlled and presented.
 尚、生成特徴量可視化部103により特徴量データの提示例については、図13を参照して、詳細を後述する。 Note that an example of presentation of feature data by the generated feature visualization unit 103 will be described in detail later with reference to FIG. 13.
 データ処理部62は、列推定部121、出力フォーマット決定部122、生成元選択部123、セッション内特徴量生成部124、特徴量選択部125、セッション間特徴量生成部126、結合部127、特徴量データストレージ128、およびループ判定部129を備えている。 The data processing unit 62 includes a column estimation unit 121, an output format determination unit 122, a generation source selection unit 123, an intra-session feature generation unit 124, a feature selection unit 125, an inter-session feature generation unit 126, a combination unit 127, and a feature generation unit 124. It includes a quantity data storage 128 and a loop determination section 129.
 列推定部121は、UI制御部61より供給されるフローデータのデータフォーマットを解析し、時刻を示す列やセッション単位を示す列などとなり得る列を推定し、列推定結果としてUI制御部61に出力する。 The column estimation unit 121 analyzes the data format of the flow data supplied from the UI control unit 61, estimates columns that can be columns indicating time, columns indicating session units, etc., and sends the column estimation results to the UI control unit 61. Output.
 出力フォーマット決定部122は、UI制御部61のタスク設定部102より供給されるタスク設定としての時刻を示す列やセッション単位を示す列、および、予測対象の列の情報に基づいて、フローデータの出力フォーマットを決定し、生成元選択部123に出力する。 The output format determining unit 122 determines the flow data based on the information of the column indicating time and the column indicating the session unit as task settings supplied from the task setting unit 102 of the UI control unit 61, and the column to be predicted. The output format is determined and output to the generation source selection unit 123.
 この際、出力フォーマット決定部122は、タスク設定として予測対象列の予測頻度と予測時刻についても供給される場合、予測対象列の予測頻度と予測時刻についても考慮した出力フォーマットを決定する。 At this time, if the prediction frequency and prediction time of the prediction target sequence are also supplied as task settings, the output format determining unit 122 determines an output format that also takes into consideration the prediction frequency and prediction time of the prediction target sequence.
 尚、出力フォーマットの決定については、図7乃至図10を参照して、詳細を後述する。 Note that the details of determining the output format will be described later with reference to FIGS. 7 to 10.
 生成元選択部123は、出力フォーマット決定部122より供給される出力フォーマットに従って、フローデータから、特徴量の生成元となる系列データを選択的に抽出する処理を実行して、処理結果をセッション内特徴量生成部124に出力する。 The generation source selection unit 123 executes a process of selectively extracting series data from which feature quantities are generated from the flow data according to the output format supplied from the output format determination unit 122, and stores the processing results within the session. It is output to the feature value generation unit 124.
 尚、特徴量の生成元となる系列データを選択的に抽出する処理については、図14を参照して、詳細を後述する。 Note that the process of selectively extracting series data from which feature quantities are generated will be described in detail later with reference to FIG. 14.
 セッション内特徴量生成部124は、生成元選択部123より供給されるフローデータのうち、特徴量の生成に必要とされる系列データに基づいて、セッション内特徴量データを生成して特徴量選択部125に出力する。 The in-session feature amount generation unit 124 generates in-session feature amount data and selects the feature amount based on the series data required for feature generation out of the flow data supplied from the generation source selection unit 123. 125.
 より詳細には、セッション内特徴量生成部124は、メタデータ抽出部124a、および推定モデル124bを備えている。 More specifically, the intra-session feature generation unit 124 includes a metadata extraction unit 124a and an estimation model 124b.
 メタデータ抽出部124aは、フローデータ内の時系列データの数や系列長(1系列あたりのサンプル数)や、各変数の統計値(平均および分散等)などからなるメタデータを抽出し、推定モデル124bに出力する。 The metadata extraction unit 124a extracts metadata consisting of the number of time series data in the flow data, sequence length (number of samples per sequence), statistical values of each variable (average, variance, etc.), and performs estimation. Output to model 124b.
 推定モデル124bは、メタデータと、予測対象の予測に用いる特徴量の生成方法とをペアにして予め学習されたモデルであり、メタデータに基づいて、予測対象の予測に必要とされる特徴量の生成方法を推定する。 The estimation model 124b is a model that has been trained in advance by pairing metadata and a method for generating feature amounts used for predicting the prediction target, and based on the metadata, the feature values required for predicting the prediction target. Estimate the generation method.
 セッション内特徴量生成部124は、推定モデル124bによりメタデータから推定された特徴量の生成方法により、生成元選択部123により選択された生成元となる系列データを用いてセッション内特徴量データを生成する。 The in-session feature amount generation unit 124 generates in-session feature amount data using the series data to be the generation source selected by the generation source selection unit 123, using the feature generation method estimated from the metadata by the estimation model 124b. generate.
 尚、セッション内特徴量データの生成方法については、図15を参照して、詳細を後述する。 Note that the method for generating intra-session feature data will be described in detail later with reference to FIG. 15.
 特徴量選択部125は、セッション内特徴量データやセッション間特徴量データを構成する特徴量のそれぞれについて、予測対象に対する予測に係る有効度スコアを求めて、所定の有効度スコアよりも高い特徴量を選択し、それ以外を除外して、セッション内特徴量データやセッション間特徴量データを再構成する。 The feature quantity selection unit 125 calculates the effectiveness score related to prediction for the prediction target for each of the feature quantities constituting the intra-session feature data and the inter-session feature data, and selects feature quantities higher than a predetermined effectiveness score. , and exclude the others to reconstruct intra-session feature data and inter-session feature data.
 より詳細には、特徴量選択部125は、セッション内特徴量選択部141、セッション間特徴量選択部142、および有効度スコア算出部143を備えている。 More specifically, the feature selection unit 125 includes an intra-session feature selection unit 141, an inter-session feature selection unit 142, and an effectiveness score calculation unit 143.
 セッション内特徴量選択部141は、有効度スコア算出部143を制御して、セッション内特徴量データを構成する特徴量のそれぞれについて有効度スコアを算出させ、所定の有効度スコアよりも高い特徴量を選択すると共に、所定の有効度スコアよりも低い特徴量を除外することで、セッション内特徴量データを再構成して、セッション間特徴量生成部126、および結合部127に出力する。 The intra-session feature quantity selection unit 141 controls the effectiveness score calculation unit 143 to calculate an effectiveness score for each of the feature quantities constituting the intra-session feature data, and selects a feature quantity higher than a predetermined effectiveness score. By selecting and excluding feature values lower than a predetermined effectiveness score, the intra-session feature data is reconfigured and output to the inter-session feature generation unit 126 and the combining unit 127.
 セッション間特徴量選択部142は、有効度スコア算出部143を制御して、セッション間特徴量生成部126より供給されるセッション間特徴量データを構成する特徴量のそれぞれについて有効度スコアを算出させ、所定の有効度スコアよりも高い特徴量を選択して、所定の有効度スコアよりも低い特徴量を除外することで、セッション間特徴量データを再構成して、結合部127に出力する。 The inter-session feature selection unit 142 controls the effectiveness score calculation unit 143 to calculate an effectiveness score for each of the features forming the inter-session feature data supplied from the inter-session feature generation unit 126. , by selecting feature quantities higher than a predetermined effectiveness score and excluding feature quantities lower than a predetermined effectiveness score, the inter-session feature data is reconfigured and output to the combining unit 127.
 有効度スコア算出部143は、セッション内特徴量データ、およびセッション間特徴量データを構成する特徴量のそれぞれについて、例えば、予測対象との相互情報量を有効度スコアとして算出し、セッション内特徴量選択部141およびセッション間特徴量選択部142、並びにループ判定部129に出力する。 The effectiveness score calculation unit 143 calculates, for example, the amount of mutual information with the prediction target as the effectiveness score for each of the features forming the intra-session feature data and the inter-session feature data, and calculates the mutual information with the prediction target as the effectiveness score. It is output to the selection unit 141 , the inter-session feature quantity selection unit 142 , and the loop determination unit 129 .
 また、有効度スコア算出部143は、セッション内特徴量データ、およびセッション間特徴量データを用いて生成される、機械学習モデルに対する精度を有効度スコアとして求めるようにしてもよい。ただし、この場合、機械学習モデルは、機械学習モデル生成部63により生成される機械学習モデルよりも簡易的な機械学習アルゴリズム又はハイパパラメタにより求められる機械学習モデルを用いるものとする。 Furthermore, the effectiveness score calculation unit 143 may calculate the accuracy of the machine learning model generated using the intra-session feature data and the inter-session feature data as the effectiveness score. However, in this case, the machine learning model used is a machine learning model determined by a simpler machine learning algorithm or hyperparameter than the machine learning model generated by the machine learning model generation unit 63.
 この場合、セッション内特徴量選択部141、およびセッション間特徴量選択部142は、生成された機械学習モデルの精度等から計算された有効度スコアが所定値よりも低下しない、セッション内特徴量データやセッション間特徴量データのそれぞれを構成する特徴量の部分集合を選択して、セッション内特徴量データ、およびセッション間特徴量データを再構成する。 In this case, the intra-session feature quantity selection unit 141 and the inter-session feature quantity selection unit 142 select intra-session feature quantity data whose effectiveness score calculated from the accuracy etc. of the generated machine learning model does not fall below a predetermined value. The intra-session feature data and the inter-session feature data are reconstructed by selecting a subset of the features constituting each of the intra-session feature data and the inter-session feature data.
 セッション間特徴量生成部126は、特徴量選択部125より出力される、所定の有効度スコアよりも高い特徴量からなる再構成されたセッション内特徴量データに基づいて、セッション間特徴量データを生成して、特徴量選択部125に出力する。 The inter-session feature amount generation unit 126 generates inter-session feature amount data based on the reconstructed intra-session feature amount data that is output from the feature amount selection unit 125 and is composed of features higher than a predetermined effectiveness score. It is generated and output to the feature selection unit 125.
 尚、セッション間特徴量データの生成方法については、図16を参照して、詳細を後述する。 Note that the method for generating inter-session feature data will be described in detail later with reference to FIG. 16.
 結合部127は、特徴量選択部125より供給される所定の有効度スコアよりも高い特徴量からなる再構成されたセッション内特徴量データと、セッション間特徴量データとを結合して、特徴量データを構成し、特徴量データストレージ128に格納する。 The combining unit 127 combines the reconstructed intra-session feature data consisting of features higher than a predetermined effectiveness score supplied from the feature selection unit 125 and the inter-session feature data, and generates a feature value. The data is configured and stored in the feature data storage 128.
 特徴量データストレージ128は、結合部127より供給される特徴量データをストレージすると共に、必要に応じてループ判定部129に対して、ストレージしている特徴量データを供給する。 The feature data storage 128 stores the feature data supplied from the combining unit 127, and also supplies the stored feature data to the loop determination unit 129 as needed.
 ループ判定部129は、特徴量データストレージ128に格納されたセッション内特徴量データと、セッション間特徴量データとが結合された特徴量データを構成する特徴量の有効度スコアに基づいて、予測対象を予測する上での特徴量データの全体としての有効度スコアを、例えば、全体の平均値として算出する。 The loop determination unit 129 selects a prediction target based on the effectiveness score of the feature amount constituting the feature amount data that is a combination of the intra-session feature amount data and the inter-session feature amount data stored in the feature amount data storage 128. The overall effectiveness score of the feature data in predicting is calculated as, for example, the overall average value.
 ループ判定部129は、特徴量データの全体としての有効度スコアが所定値よりも低い場合、再度生成元選択部123に同一のフローデータから、現状の特徴量の数よりも多くの特徴量を抽出するように、処理を再度ループさせるように指示する。 If the overall effectiveness score of the feature data is lower than a predetermined value, the loop determination unit 129 sends the generation source selection unit 123 again to generate more feature values than the current number of features from the same flow data. Instructs the process to loop again to extract.
 そして、ループ判定部129は、所定の時間が経過したとき、または、有効度が所定値よりも高い場合、その時点で特徴量データストレージ128に格納された特徴量データと、特徴量データの全体の有効度スコアの情報をUI制御部61、および機械学習モデル生成部63に出力する。 Then, when a predetermined time has elapsed or when the effectiveness is higher than a predetermined value, the loop determination unit 129 selects the feature data stored in the feature data storage 128 at that time and the entire feature data. The information on the effectiveness score of is output to the UI control unit 61 and the machine learning model generation unit 63.
 生成特徴量可視化部103は、生成された特徴量データと、特徴量データの全体の有効度スコアの情報とを、UIとして可視化して提示する。 The generated feature amount visualization unit 103 visualizes and presents the generated feature amount data and information on the overall effectiveness score of the feature amount data as a UI.
 この際、例えば、生成された特徴量データの有効度スコアが十分であるとみなされて、ユーザにより、選択された特徴量データに基づいた、機械学習モデルの生成が指示されるとき、機械学習モデル生成部63は、供給された特徴量データに基づいて、機械学習モデルを生成するようにしてもよい。 At this time, for example, when the validity score of the generated feature data is deemed to be sufficient and the user instructs the generation of a machine learning model based on the selected feature data, the machine learning The model generation unit 63 may generate a machine learning model based on the supplied feature data.
 <タスク設定について>
 タスク設定は、フローデータから、時系列データの将来値の予測、時系列データにおいて特定のイベントが発生するか否かの予測、および時系列でない(時刻によって変化しない)データの予測といったタスクを実現するための設定である。
<About task settings>
Task settings enable tasks such as predicting the future value of time-series data, predicting whether a specific event will occur in time-series data, and predicting non-time-series data (that does not change depending on time) from flow data. This is the setting for
 より具体的には、タスク設定は、フローデータにおける時刻を示す列、またはセッション単位を示す列、および予測対象となる列の設定であり、さらに、必要に応じて、予測対象列の予測頻度と予測時刻の設定も含む。 More specifically, the task settings are settings for a column indicating time in flow data, a column indicating a session unit, and a column to be predicted, and if necessary, the prediction frequency and the prediction target column for the prediction target column. Also includes settings for predicted time.
 例えば、図5で示されるようなフローデータである場合におけるタスク設定について説明する。 For example, task settings in the case of flow data as shown in FIG. 5 will be explained.
 図5は、野球の所定の打者の投球ログに係るフローデータの例を示している。図5のフローデータFDは、属性データADおよび時系列データTDから構成されている。 FIG. 5 shows an example of flow data related to a pitching log of a predetermined baseball batter. Flow data FD in FIG. 5 is composed of attribute data AD and time series data TD.
 属性データADは、3つのデータ列から構成されており、図中の左から投手ID列、打席ID列、および結果列から構成される。 The attribute data AD is composed of three data columns, which from the left in the figure are a pitcher ID column, a turn at bat ID column, and a result column.
 投手ID列は、所定の打者に対して投球した投手を識別するIDが登録される列であり、図中においては、上から投手ID=A,B,Aが登録されている。 The pitcher ID column is a column in which IDs that identify pitchers who have pitched to a predetermined batter are registered, and in the figure, pitcher IDs = A, B, and A are registered from the top.
 打席ID列は、所定の打者の打席を識別するIDが登録される列であり、図中のおいては、上から打席ID=0,1,2が登録されている。 The turn-at-bat ID column is a column in which IDs that identify the turn-at-bat of a predetermined batter are registered, and in the figure, turn-at-bat IDs=0, 1, and 2 are registered from the top.
 結果列は、所定の打者の、投手IDで識別される投手による投球に対する、打席IDで識別される打席における結果が登録される列であり、図中においては、上から「ヒット」、「アウト」、および「アウト」が登録されている。 The result column is a column in which the results of a given batter's turn at bat identified by the at-bat ID for pitches by the pitcher identified by the pitcher ID are registered. ”, and “out” are registered.
 これにより、所定の打者は、投手ID=Aの投手による投球に対して、打席ID=0で識別される打席において、ヒットを放ったことが登録されている。 As a result, it is registered that the predetermined batter made a hit in the turn at bat identified by the turn at bat ID=0 in response to the pitch by the pitcher with the pitcher ID=A.
 また、所定の打者は、投手ID=Bの投手による投球に対して、打席ID=1で識別される打席において、アウトになったことが登録されている。 Additionally, it is registered that the predetermined batter was out in the turn at bat identified by turn ID=1 in response to a pitch by pitcher ID=B.
 さらに、所定の打者は、投手ID=Aの投手による投球に対して、打席ID=2で識別される打席において、アウトになったことが登録されている。 Furthermore, it is registered that the predetermined batter was out in the turn at bat identified by turn ID=2 in response to a pitch by pitcher ID=A.
 時系列データTDは、3つのデータ列から構成されており、図中の左から打席ID列、投球ID列、および球速列から構成される。 The time-series data TD is composed of three data columns, from the left in the figure: a turn ID column, a pitch ID column, and a pitch speed column.
 打席ID列は、所定の打者の打席を識別するIDが登録される列であり、図中のおいては、上から打席ID=0,0,0,1,1,2,2,2が登録されている。 The at-bat ID column is a column in which IDs that identify a given batter's at-bat are registered, and in the figure, from the top, the at-bat IDs are 0, 0, 0, 1, 1, 2, 2, 2. Registered.
 投球ID列は、所定の打者に対する、投手から投じられた投球を識別するIDが、時系列に登録される列であり、図中においては、上から投球ID=0,1,2,0,1,0,1,2が登録されている。 The pitch ID column is a column in which IDs identifying pitches pitched by a pitcher to a predetermined batter are registered in chronological order, and in the figure, from the top, pitch IDs are 0, 1, 2, 0, 1, 0, 1, 2 are registered.
 球速列は、所定の打者が投手IDで識別される投手により、打席IDで識別される打席において投じられた球速(km/h)が登録される列であり、図中においては、上から140,150,120,120,110,90,130,155が登録されている。 The ball speed column is a column in which the ball speed (km/h) pitched by a given batter in the at-bat identified by the at-bat ID by the pitcher identified by the pitcher ID is registered. , 150, 120, 120, 110, 90, 130, and 155 are registered.
 これにより、所定の打者の、打席ID=0で識別される打席において、投球ID=0で識別される最初の投球の球速が140km/hであり、投球ID=1で識別される2球目の投球の球速が150km/hであり、投球ID=2で識別される3球目の球速が120km/hであることが登録されている。 As a result, in a given batter's turn at bat identified by turn ID=0, the ball speed of the first pitch identified by pitch ID=0 is 140 km/h, and the second pitch identified by pitch ID=1. It is registered that the ball speed of the pitch is 150 km/h, and the ball speed of the third pitch identified by pitch ID = 2 is 120 km/h.
 また、所定の打者の、打席ID=1で識別される打席において、投球ID=0で識別される最初の投球の球速が120km/hであり、投球ID=1で識別される2球目の投球の球速が110km/hであることが登録されている。 Also, in a given batter's turn at bat identified by turn ID = 1, the ball speed of the first pitch identified by pitch ID = 0 is 120 km/h, and the speed of the second pitch identified by pitch ID = 1. The pitching speed is registered as 110km/h.
 さらに、所定の打者の、打席ID=2で識別される打席において、投球ID=0で識別される最初の投球の球速が90km/hであり、投球ID=1で識別される2球目の投球の球速が130km/hであり、投球ID=2で識別される3球目の投球の球速が155km/hであることが登録されている。 Furthermore, in a given batter's turn at bat identified by turn ID = 2, the speed of the first pitch identified by pitch ID = 0 is 90 km/h, and the speed of the second pitch identified by pitch ID = 1. It is registered that the ball speed of the pitch is 130 km/h, and that the ball speed of the third pitch identified by pitch ID=2 is 155 km/h.
 この場合、時系列データTDにおける投球列の情報は、時系列に登録される情報であるため、時刻列とされる。 In this case, the information on the pitch sequence in the time series data TD is information that is registered in time series, so it is treated as a time sequence.
 また、時系列データTDと、属性データADとのそれぞれには、セッション列として、共通の打席ID列が存在する。 Furthermore, a common turn-at-bat ID column exists as a session column in each of the time-series data TD and the attribute data AD.
 さらに、投手ID列は、セッション列としての打席ID列の上位としてクラスタリングされた集合(セッションクラスタ)であると考えることもできる。 Furthermore, the pitcher ID string can also be thought of as a clustered set (session cluster) above the turn-at-bat ID string as a session string.
 尚、時刻列は順序のわかる値(float,int)、または日時型(YY:MM:DD hh:mm:ssなど)であればよい。 Note that the time string may be a value whose order is known (float, int) or a date/time type (YY:MM:DD hh:mm:ss, etc.).
 データ処理部62の列推定部121は、例えば、図5で示されるような時刻列やセッション列を推定して列推定結果として、UI制御部61のタスク設定部102に供給する。 The column estimating section 121 of the data processing section 62 estimates, for example, a time column or a session column as shown in FIG. 5, and supplies the result to the task setting section 102 of the UI control section 61 as a column estimation result.
 このUIに基づいて、入力部52が操作されて、時刻を示す列およびセッション単位を示す列、予測対象列、並びに、予測対象列の予測頻度と予測時刻を設定する情報が入力されると、タスク設定部102は、データ処理部の出力フォーマット決定部122に出力する。 Based on this UI, when the input unit 52 is operated and information for setting a column indicating time, a column indicating a session unit, a prediction target column, and a prediction frequency and prediction time of the prediction target column is input, The task setting unit 102 outputs to the output format determining unit 122 of the data processing unit.
 より詳細には、タスク設定部102は、列推定結果に基づいて、出力部53の表示部71や音声出力部72を制御して、フローデータをユーザに提示する。 More specifically, the task setting unit 102 controls the display unit 71 and the audio output unit 72 of the output unit 53 based on the column estimation results to present the flow data to the user.
 この際、タスク設定部102は、タスク設定として、時刻単位を示す列およびセッション単位を示す列、および予測対象列を設定するように促すUIを提示して、UIに応じて設定されたタスク設定の情報をデータ処理部62の出力フォーマット決定部122に出力する。 At this time, the task setting unit 102 presents a UI that prompts to set a column indicating a time unit, a column indicating a session unit, and a prediction target column as task settings, and the task settings are set according to the UI. The information is output to the output format determining section 122 of the data processing section 62.
 より具体的には、タスク設定部102は、例えば、図6で示されるようなUIからなる表示画像PVを提示する。 More specifically, the task setting unit 102 presents a display image PV consisting of a UI as shown in FIG. 6, for example.
 図6の表示画像PVにおいて提示されるUIにおいては、上段に「時刻単位を示す列およびセッション単位を示す列、並びに予測対象列を設定して下さい。」と表記されており、時刻単位を示す列およびセッション単位を示す列、並びに予測対象列の設定を促す情報が提示されている。 In the UI presented in the display image PV of FIG. 6, it is written in the upper row, "Please set the column indicating the time unit, the column indicating the session unit, and the prediction target column.", which indicates the time unit. Information prompting the user to set a column, a column indicating a session unit, and a prediction target column is presented.
 また、その下には、左側に属性データADが表示され、右側に時系列データTDが表示されている。 Further, below that, attribute data AD is displayed on the left side, and time series data TD is displayed on the right side.
 さらに、設定を促す情報の提示に応じて、図6においては、点線で示される打席ID列が、セッション単位を示す列として設定され、一点鎖線で示される投球列が、時刻単位を示す列として設定され、実線で示される球速列が、予測対象として設定されている例が示されている。 Furthermore, in response to the presentation of information prompting the setting, in FIG. 6, the batting turn ID column indicated by a dotted line is set as a column indicating session units, and the pitching column indicated by a dashed line is set as a column indicating time units. An example is shown in which a ball speed sequence that has been set and is indicated by a solid line is set as a prediction target.
 タスク設定部102は、図6で示される点線、一点鎖線、および実線で示されるような枠などを用いて設定された、時刻単位を示す列およびセッション単位を示す列、および予測対象列の情報を出力フォーマット決定部122に出力する。 The task setting unit 102 stores information on columns indicating time units, columns indicating session units, and prediction target columns, which are set using frames such as the dotted lines, dashed lines, and solid lines shown in FIG. is output to the output format determining section 122.
 尚、この際、時刻単位を示す列およびセッション単位を示す列、および予測対象列の情報に加えて、予測対象列の予測頻度と予測時刻をタスク設定として入力させるようにしてもよい。 At this time, in addition to the column indicating the time unit, the column indicating the session unit, and the information on the prediction target column, the prediction frequency and predicted time of the prediction target column may be input as task settings.
 <出力フォーマットの決定>
 出力フォーマット決定部122は、UI制御部61のタスク設定部102より供給される時刻を示す列およびセッション単位を示す列、予測対象列、並びに、予測対象列の予測頻度と予測時刻を設定する情報に基づいて、出力フォーマットを決定する。
<Determining the output format>
The output format determination unit 122 receives information for setting a column indicating the time, a column indicating the session unit, a prediction target column, and a prediction frequency and prediction time of the prediction target column, which are supplied from the task setting unit 102 of the UI control unit 61. Determine the output format based on.
 出力フォーマットの形式は、例えば、図7で示されるようなメルトフォーマットや、図8のピボットフォーマットなどである。 Examples of the output format include the melt format shown in FIG. 7 and the pivot format shown in FIG. 8.
 図7のメルトフォーマットは、左から、id列、time列、name列、およびvalue列から構成される。図7のメルトフォーマットでは、name列が、セッション単位を構成し、id列がセッション単位をグルーピングするセッションクラスタであり、time列がサンプリングの時刻列であり、value列がサンプリングされた時系列データ列となる。 The melt format in FIG. 7 is composed of an id column, a time column, a name column, and a value column from the left. In the melt format shown in Figure 7, the name column constitutes a session unit, the id column is a session cluster that groups session units, the time column is a sampling time column, and the value column is a sampled time series data column. becomes.
 すなわち、図7においては、セッション単位をグルーピングする上位のセッションクラスタについて、A,Bの2つ存在し、セッション単位内においてx,yの2つの系列があり、それぞれの系列において時刻t1,t2が設定されている。 That is, in FIG. 7, there are two upper session clusters, A and B, for grouping session units, and within the session unit there are two series, x and y, and in each series, times t1 and t2 are It is set.
 このようにセッション単位で系列毎に、時刻設定が異なる場合、図7で示されるようなメルトフォーマットが有効である。 In this way, when the time settings differ for each session and each series, the melt format shown in FIG. 7 is effective.
 図7においては、時系列データとして、上からx(A,t1),x(A,t2),y(A,t1),y(A,t2),x(B,t1),x(B,t2),y(B,t1),y(B,t2)が登録されている。 In FIG. 7, the time series data from the top are x (A, t1), x (A, t2), y (A, t1), y (A, t2), x (B, t1), x (B , t2), y(B, t1), and y(B, t2) are registered.
 これに対して、セッション単位で全ての系列について、サンプリングの時刻列が共通である場合、図8で示されるようなピボットフォーマットでもよい。 On the other hand, if the sampling time sequence is common for all series in session units, a pivot format as shown in FIG. 8 may be used.
 すなわち、図8においては、id列、time列、value x列、およびvalue y列から構成される。図8のピボットフォーマットにおいては、セッション単位でx,yの2つの系列において、サンプリングの時刻列が共通化されており、value x列、およびvalue y列が、並列に登録される。 That is, in FIG. 8, it is composed of an id column, a time column, a value x column, and a value y column. In the pivot format shown in FIG. 8, the sampling time sequence is shared between the two x and y sequences for each session, and the value x and value y columns are registered in parallel.
 図8においては、value x列として、上からx(A,t1),x(A,t2),x(B,t1),x(B,t2)が登録され、value y列として、y(A,t1),y(A,t2),y(B,t1),y(B,t2)が登録されている。 In FIG. 8, x (A, t1), x (A, t2), x (B, t1), x (B, t2) are registered from the top as the value x column, and y ( A, t1), y (A, t2), y (B, t1), and y (B, t2) are registered.
 より具体的には、例えば、図9の左部で示されるように、点線で示される打席ID列がセッション単位を示す列として設定され、一点鎖線で示される投球列が、時刻単位を示す列として設定され、実線で示される球速列が、予測対象列として設定されている場合、出力フォーマット決定部122は、例えば、図9の右部で示されるような出力フォーマットFIS1に決定する。 More specifically, for example, as shown in the left part of FIG. 9, the at-bat ID column indicated by a dotted line is set as a column indicating session units, and the pitching column indicated by a dashed dotted line is set as a column indicating time units. If the ball speed sequence shown by the solid line is set as the prediction target sequence, the output format determining unit 122 determines, for example, the output format FIS1 as shown on the right side of FIG. 9.
 図9の右部における出力フォーマットFIS1は、図7を参照して説明したメルトフォーマットから構成されており、左からセッション単位を示す打席ID列、投球列、球速列、1球前球速列、および前打席結果列が設けられている。 The output format FIS1 on the right side of FIG. 9 is composed of the melt format described with reference to FIG. A previous at-bat result column is provided.
 図9の右部における出力フォーマットFIS1において、打席ID列には、上から0,0,0,1,1,2,2,2が登録され、投球列には、上から0,1,2,0,1,0,1,2が登録されている。 In the output format FIS1 in the right part of FIG. , 0, 1, 0, 1, 2 are registered.
 また、球速列には、上から140,150,120,120,110,90,130,155が登録され、1球前球速列には、上からNaN,140,150,NaN,120,NaN,90,130が登録され、前打席結果列に上からNaN,NaN,NaN,ヒット,ヒット,アウト,アウト,アウトが登録されている。 In addition, 140, 150, 120, 120, 110, 90, 130, 155 are registered from the top in the ball speed column, and NaN, 140, 150, NaN, 120, NaN, 90 and 130 are registered, and NaN, NaN, NaN, hit, hit, out, out, out are registered in the previous at-bat result column from the top.
 すなわち、ここでは、予測対象となる列が「球速」であるので、球速列のデータが、時系列データとして配置されるように、1時刻1行(1投球1列)のフォーマットとされる。 That is, here, since the column to be predicted is "ball speed", the data in the ball speed column is formatted as one row per time (one column per pitch) so that it is arranged as time series data.
 また、例えば、図10の左部で示されるように、点線で示される打席ID列がセッション単位を示す列として設定され、一点鎖線で示される投球列が、時刻単位を示す列として設定され、実線で示される結果列が、予測対象列として設定されている場合、出力フォーマット決定部122は、例えば、図10の右部で示されるような出力フォーマットFIS2に決定する。 Further, for example, as shown on the left side of FIG. 10, the turn ID column indicated by a dotted line is set as a column indicating session units, the pitching column indicated by a dashed-dotted line is set as a column indicating time units, When the result column indicated by the solid line is set as the prediction target column, the output format determining unit 122 determines, for example, the output format FIS2 as illustrated on the right side of FIG. 10.
 図10の右部における出力フォーマットFIS2は、図8を参照して説明したピボットフォーマットから構成されており、左から投手ID列、セッション単位を示す打席ID列、結果列、各打席球速平均列、および前打席結果列が設けられている。 The output format FIS2 on the right side of FIG. 10 is composed of the pivot format described with reference to FIG. 8, and from the left is a pitcher ID column, a turn at bat ID column indicating the session unit, a result column, an average ball speed column for each turn at bat, and a previous at-bat result column.
 図10の右部における出力フォーマットFIS2において、投手ID列には、上からA,B,Aが登録され、打席ID列には、上から0,1,2が登録され、結果列に上からヒット、アウト、アウトが登録され、各打席球速平均列には、上から145,115,110が登録され、前打席結果列に上からNaN,ヒット,アウトが登録されている。 In the output format FIS2 on the right side of FIG. 10, A, B, and A are registered from the top in the pitcher ID column, 0, 1, and 2 are registered from the top in the turn ID column, and 0, 1, and 2 are registered from the top in the result column. Hit, out, and out are registered, 145, 115, and 110 are registered from the top in the ball speed average column for each at-bat, and NaN, hit, and out are registered from the top in the previous at-bat result column.
 すなわち、ここでは、予測対象が「結果」であるので、1行は1セッション(1行1打席ID列)となる。時系列データは、各打席球速平均とされ、統計量を使って時間情報が集約された特徴量が追加された形式とされている。 That is, here, since the prediction target is the "result", one row corresponds to one session (one row, one at-bat ID column). The time-series data is the average ball speed for each turn at bat, and is in a format in which features are added that aggregate time information using statistics.
 <特徴量の生成方法>
 次に、系列データ毎の特徴量の生成方法について説明する。
<How to generate features>
Next, a method for generating feature amounts for each series data will be explained.
 特徴量は、時系列データから時間方向に得られる系列データ毎の複数の統計量を要素とするベクトルとして構成される。 The feature quantity is configured as a vector whose elements are a plurality of statistical quantities for each series data obtained in the time direction from the time series data.
 例えば、所定の系列の時系列データが、図11で示されるように、時間t方向に変化する波形Ldtで表現される場合、時刻t1,t2,t3・・・のそれぞれにおける、時間方向に所定の時間幅wの窓を設定し、各窓の波形Ldtの値が、部分系列X1,X2,X3,・・・として取得される。 For example, when a predetermined series of time series data is expressed by a waveform Ldt that changes in the time direction as shown in FIG. A window with a time width w is set, and the values of the waveform Ldt in each window are obtained as partial series X1, X2, X3, .
 また、部分系列X1,X2,X3,・・・のそれぞれに対応する所定時刻だけ未来の時刻t11,t12,t13における波形Ldtの値が、予測対象y1,y2,y3,・・・として取得される。 Further, the values of the waveform Ldt at future times t11, t12, t13 corresponding to each of the partial series X1, X2, X3, . . . are acquired as prediction targets y1, y2, y3, . Ru.
 そして、取得された部分系列X1,X2,X3,・・・については、それぞれ時刻t1,t2,t3,・・・から時間方向に所定時間幅wの窓における所定の統計値f(X1),f(X2),f(X3),・・・に変換され、変換された統計値と、予測対象y1,y2,y3,・・・とを要素とするベクトルが構成されることにより、系列データ毎の特徴量が形成される。 Then, for the acquired partial sequences X1, X2, X3, . . . , predetermined statistical values f(X1), The sequence data is converted into f(X2), f(X3), ..., and a vector whose elements are the converted statistical values and the prediction targets y1, y2, y3, ... is constructed. A feature quantity for each is formed.
 より詳細には、所定の系列データの特徴量からなる列および予測対象列は、例えば、以下の式(1),式(2)のように表現される。 More specifically, the column consisting of the feature amount of the predetermined series data and the prediction target column are expressed, for example, as in the following equations (1) and (2).
 Fs=(f(X1),f(X2),f(X3),・・・)
                          ・・・(1)
 Fp=(y1,y2,y3,・・・)
                          ・・・(2)
Fs=(f(X1), f(X2), f(X3),...)
...(1)
Fp=(y1, y2, y3,...)
...(2)
 ここで、Fsは、所定の系列データの特徴量であり、f(X1),f(X2),f(X3),・・・は、それぞれ波形Ldtで表現される所定の系列データの部分系列Xnの統計量からなる要素である。また、y1,y2,y3,・・・は、それぞれ部分系列X1,X2,X3,・・・に対応する予測対象である。 Here, Fs is a feature amount of predetermined series data, and f(X1), f(X2), f(X3), ... are partial series of predetermined series data expressed by waveform Ldt, respectively. This is an element consisting of the statistical amount of Xn. Further, y1, y2, y3, . . . are prediction targets corresponding to the partial sequences X1, X2, X3, .
 また、部分系列Xnに対応する系列データの特徴量Fsを構成する各要素f(Xn)は、例えば、以下の式(3)のように表現される。 Further, each element f(Xn) constituting the feature amount Fs of the series data corresponding to the partial series Xn is expressed, for example, as in the following equation (3).
 f(Xn)=(Ave(Xn),Min(Xn),Max(Xn),Var(Xn),Stde(Xn),・・・)
                          ・・・(3)
f(Xn) = (Ave(Xn), Min(Xn), Max(Xn), Var(Xn), Stde(Xn),...)
...(3)
 ここで、f(Xn)は、部分系列Xnの系列データの特徴量Fsの各要素であり、Ave(Xn)は、部分系列Xnの平均値であり、Min(Xn)は、部分系列Xnの最小値であり、Max(Xn)は、部分系列Xnの最大値であり、Var(Xn)は、部分系列Xnの分散であり、Stde(Xn)は、部分系列Xnの標準偏差である。 Here, f(Xn) is each element of the feature amount Fs of the series data of the subsequence Xn, Ave(Xn) is the average value of the subsequence Xn, and Min(Xn) is the average value of the subsequence Xn. Max(Xn) is the maximum value of the subsequence Xn, Var(Xn) is the variance of the subsequence Xn, and Stde(Xn) is the standard deviation of the subsequence Xn.
 尚、部分系列Xnに対しては、上述した平均値、最小値、最大値、分散、および標準偏差以外の他の統計量が用いられるようにしてもよい。 Note that statistics other than the above-mentioned average value, minimum value, maximum value, variance, and standard deviation may be used for the partial series Xn.
 また、式(3)の部分系列Xnの特徴量Fsの各要素f(Xn)は、各統計量を要素としたベクトルとして表現される例について説明してきたが、各統計量を用いたカーネル関数による重み付き積和(畳み込みカーネル)で表現されるようにしてもよい。尚、畳み込みカーネルについては、https://arxiv.org/abs/1910.13051等を参照されたい。 In addition, we have explained an example in which each element f(Xn) of the feature amount Fs of the subsequence Xn in equation (3) is expressed as a vector with each statistic as an element, but the kernel function using each statistic It may also be expressed as a weighted sum of products (convolution kernel). For the convolution kernel, please refer to https://arxiv.org/abs/1910.13051 etc.
 <窓の設定方法>
 上述した部分系列Xnを構成する窓は、様々な手法で設定されるようにしてもよい。
<How to set the window>
The windows forming the above-mentioned partial series Xn may be set using various methods.
 例えば、図12の左部で示されるように、セッションの開始時刻tbを基準として、開始時刻からのオフセットoffset-fbを変化させながら、所定の時間幅wbで設定される窓WBを単位として部分系列Xnが設定されてもよい。 For example, as shown in the left part of FIG. 12, with the session start time tb as a reference, while changing the offset offset-fb from the start time, A series Xn may be set.
 また、図12の右部で示されるように、予測開始時刻を、例えば、基準時刻tsとしたとき、基準時刻tsからのオフセットoffset-fsを変化させながら、所定の時間幅wsで設定される窓WSを単位として部分系列Xnが設定されてもよい。 Further, as shown in the right part of FIG. 12, when the prediction start time is, for example, the reference time ts, the prediction start time is set in a predetermined time width ws while changing the offset offset-fs from the reference time ts. The partial sequence Xn may be set using the window WS as a unit.
 さらに、図13で示されるように、セッション開始時刻tbから予測実行時刻tsより所定時間オフセット(offset)された時刻tosまでが時間幅で設定される窓WSSを単位として部分系列が設定されてもよい。 Furthermore, as shown in FIG. 13, even if partial sequences are set in units of windows WSS, the time width is set from the session start time tb to the time tos offset by a predetermined time from the predicted execution time ts. good.
 また、図13で示されるように、セッション開始時刻tbから終了までの全範囲が時間幅に設定される窓WAを単位として部分系列が設定されてもよい。 Further, as shown in FIG. 13, a partial sequence may be set in units of windows WA in which the entire range from the session start time tb to the session end is set as the time width.
 さらに、時刻tosから一定期間だけ前にshiftしたときの特定値Ldt(s)が、部分系列として得られるようにしてもよい。 Further, the specific value Ldt(s) when shifted a certain period of time from time tos may be obtained as a partial sequence.
 <特徴量の生成元となる系列の選択について>
 次に、図14を参照して、生成元選択部123による特徴量の生成元となる系列データの選択について説明する。
<About selecting the series from which feature values are generated>
Next, with reference to FIG. 14, selection of series data to be a generation source of a feature amount by the generation source selection unit 123 will be described.
 上述したように、特徴量の生成元となる情報(以降、生成元特徴量と称する)は、系列データ毎にベクトル化した情報として生成される。 As described above, the information that is the generation source of the feature amount (hereinafter referred to as the generation source feature amount) is generated as vectorized information for each series of data.
 しかしながら、全ての系列データが、予測対象の予測に有用な情報とは限らず、予測対象の予測には不要なものが存在する。 However, not all series data are useful information for predicting the prediction target, and there is information that is unnecessary for predicting the prediction target.
 そこで、本開示においては、生成元選択部123が、フローデータから抽出される時系列データ、および属性データのいずれも含む系列データのうち、特徴量の生成元として有用である否かを判断して、必要に応じて除外する。 Therefore, in the present disclosure, the generation source selection unit 123 determines whether or not time series data extracted from flow data and series data including attribute data are useful as a generation source of feature amounts. and exclude as necessary.
 より具体的には、図14で示されるように、例えば、予測対象Tを予測する機械学習モデルの生成元に利用可能な、時系列データとして系列データL1乃至L3が存在している場合について考える。 More specifically, as shown in FIG. 14, for example, consider a case where series data L1 to L3 exist as time series data that can be used as a generation source of a machine learning model that predicts the prediction target T. .
 生成元選択部123は、系列データL1乃至L3のそれぞれが、予測対象の予測に用いる特徴量の生成元として適切であるか否かを判定する。より詳細には、例えば、生成元選択部123は、系列データL1について、予測対象Tの予測に用いる時系列の生成元特徴量として、統計量Fa乃至Fdからなる生成元特徴量F(tn)を時系列に抽出して、特徴量テーブルTBを生成する。 The generation source selection unit 123 determines whether each of the series data L1 to L3 is appropriate as a generation source of a feature amount used for prediction of a prediction target. More specifically, for example, the generation source selection unit 123 selects, for the series data L1, a generation source feature amount F(tn) consisting of statistics Fa to Fd as a time series generation source feature amount used for prediction of the prediction target T. are extracted in time series to generate a feature table TB.
 尚、ここでいう、統計量Fa乃至Fdは、上述した式(3)のAve(Xn),Min(Xn),Max(Xn),Var(Xn),Stde(Xn),・・・に対応し、生成元特徴量F(tn)は、系列データの特徴量Fsを構成する各要素f(Xn)に対応する。 Incidentally, the statistical quantities Fa to Fd mentioned here correspond to Ave (Xn), Min (Xn), Max (Xn), Var (Xn), Stde (Xn), etc. in equation (3) mentioned above. However, the generation source feature amount F(tn) corresponds to each element f(Xn) that constitutes the feature amount Fs of the series data.
 図14においては、系列データL1より抽出される生成元特徴量F(tn)として、生成元特徴量F(t1)(=(Fa(t1),Fb(t1),Fc(t1),Fd(t1))、および生成元特徴量F(t2)(=(Fa(t2),Fb(t2),Fc(t2),Fd(t2)),が抽出されて、特徴量テーブルTBが作成されている例が示されている。尚、図14の特徴量テーブルTBにおいて、予測対象Tの詳細な記載は省略されている。 In FIG. 14, the generator feature amount F(tn) extracted from the series data L1 is the generator feature amount F(t1)(=(Fa(t1), Fb(t1), Fc(t1), Fd( t1)), and the source feature F(t2) (=(Fa(t2), Fb(t2), Fc(t2), Fd(t2)), are extracted, and the feature table TB is created. Note that in the feature amount table TB of FIG. 14, detailed description of the prediction target T is omitted.
 次に、生成元選択部123は、生成元特徴量F(t1)(=(Fa(t1),Fb(t1),Fc(t1),Fd(t1)),F(t2)(=(Fa(t2),Fb(t2),Fc(t2),Fd(t2))に基づいて、系列データL1が予測対象Tの予測に寄与する系列であるか否かを判定する。 Next, the generation source selection unit 123 selects generation source features F(t1) (=(Fa(t1), Fb(t1), Fc(t1), Fd(t1)), F(t2)(=(Fa (t2), Fb(t2), Fc(t2), Fd(t2)), it is determined whether the series data L1 is a series that contributes to the prediction of the prediction target T.
 まず、生成元選択部123は、例えば、系列データL1の時系列の変化がなく、また、予測対象との相関が認められない場合、系列データL1を特徴量から除外する。 First, the generation source selection unit 123 excludes the series data L1 from the feature amount, for example, when there is no time-series change in the series data L1 and no correlation with the prediction target is recognized.
 そして、系列データL1が、時系列の変化があり、予測対象との相関があるものと認められた場合、生成元選択部123は、系列データL1の生成元特徴量F(t1)(=(Fa(t1),Fb(t1),Fc(t1),Fd(t1)),F(t2)(=(Fa(t2),Fb(t2),Fc(t2),Fd(t2)),・・・を、予測モデルPMに入力して、予測対象Tを予測し、その予測結果T’として求める。 Then, when it is recognized that the series data L1 has a time-series change and is correlated with the prediction target, the generation source selection unit 123 selects the generation source feature amount F(t1) (=( Fa (t1), Fb (t1), Fc (t1), Fd (t1)), F (t2) (= (Fa (t2), Fb (t2), Fc (t2), Fd (t2)), ... is input to the prediction model PM to predict the prediction target T, and obtain the prediction result T'.
 尚、予測モデルPMは、比較的簡易で、かつ、軽量な予測モデルであり、所定の系列の生成元特徴量に基づいて、簡易的に予測対象Tを予測するためのモデルである。 Note that the prediction model PM is a relatively simple and lightweight prediction model, and is a model for easily predicting the prediction target T based on the generation source feature of a predetermined series.
 生成元選択部123は、予測対象Tと予測結果T’との比較から予測精度PAを算出し、所定の閾値よりも予測精度が低い場合、系列データL1を、予測対象を求めるための生成元から除外する。 The generation source selection unit 123 calculates the prediction accuracy PA from a comparison between the prediction target T and the prediction result T', and when the prediction accuracy is lower than a predetermined threshold value, the generation source selection unit 123 selects the series data L1 as a generation source for determining the prediction target. Exclude from
 生成元選択部123は、系列データL1のみならず、系列データL2,L3のそれぞれについても同様に予測精度PAを求めて、所定の予測精度よりも低い系列を生成元から除外する。 The generation source selection unit 123 similarly determines the prediction accuracy PA not only for the sequence data L1 but also for each of the sequence data L2 and L3, and excludes sequences lower than a predetermined prediction accuracy from the generation sources.
 <セッション内特徴量データの生成方法>
 次に、セッション内特徴量生成部124によるセッション内特徴量データの生成方法について説明する。
<How to generate intra-session feature data>
Next, a method of generating intra-session feature data by the intra-session feature generation unit 124 will be described.
 セッション内特徴量生成部124は、図15で示されるように、メタデータ抽出部124aを制御して、フローデータに基づいて、例えば、フローデータにおける系列の本数、系列長、各系列の分散、および属性データの数などの情報を、フローデータのメタデータとして抽出させる。 As shown in FIG. 15, the intra-session feature generation unit 124 controls the metadata extraction unit 124a to determine, for example, the number of sequences, sequence length, variance of each sequence, etc. in the flow data, based on the flow data. and the number of attribute data are extracted as metadata of flow data.
 より詳細には、メタデータ抽出部124aは、フローデータのうちの、生成元選択部123において、生成元として選択された系列データからメタデータを抽出する。 More specifically, the metadata extraction unit 124a extracts metadata from the series data selected as the generation source by the generation source selection unit 123 from among the flow data.
 尚、フローデータのメタデータは、フローデータに基づいて生成された特徴量データに基づいて機械学習モデル生成部63において生成される機械学習モデルやアルゴリズムなどでもよい。 Incidentally, the metadata of the flow data may be a machine learning model or algorithm generated by the machine learning model generation unit 63 based on feature data generated based on the flow data.
 セッション内特徴量生成部124は、様々なメタデータと、最終的な有効特徴量の作成方法の分布とをペアの情報として取得してプールし、これらを使った学習により、メタデータに基づいて、有効特徴量の生成方法を推定する推定モデル124bを備えている。 The in-session feature generation unit 124 acquires and pools various metadata and the distribution of the final effective feature generation method as paired information, and learns using these to generate information based on the metadata. , an estimation model 124b for estimating a method of generating effective feature quantities.
 そこで、セッション内特徴量生成部124は、この推定モデル124bを制御して、抽出したフローデータのメタデータに基づいて、有効特徴量の生成方法を推定する。 Therefore, the intra-session feature amount generation unit 124 controls this estimation model 124b to estimate the effective feature amount generation method based on the metadata of the extracted flow data.
 すなわち、推定モデル124bは、ユーザにより設定された時刻を示す列およびセッション単位を示す列、および予測対象列に基づいて、決定された出力フォーマットで抽出される系列データのうち、生成元選択部123により予測精度が所定の精度閾値よりも高い系列データから構成される、フローデータのメタデータに基づいて、有効特徴量の生成方法を推定する。 That is, the estimation model 124b selects the generation source selection unit 123 from among the series data extracted in the determined output format based on the column indicating the time set by the user, the column indicating the session unit, and the prediction target column. A method for generating an effective feature amount is estimated based on metadata of flow data, which is composed of series data whose prediction accuracy is higher than a predetermined accuracy threshold.
 これにより、フローデータを構成する系列データのうち、ユーザにより設定された時刻を示す列およびセッション単位を示す列、並びに予測対象列に設定された系列データのうち、予測対象の予測について、予測精度の高い系列データを用いた、有効特徴量の生成方法が推定されることになる。 As a result, the prediction accuracy of the prediction target of the series data that constitutes the flow data, the column indicating the time set by the user, the column indicating the session unit, and the series data set as the prediction target column. A method for generating effective features using series data with a high value is estimated.
 結果として、ユーザにより設定された時刻を示す列およびセッション単位を示す列、並びに予測対象列を反映した、予測精度の高い機械学習モデルの生成に最適な有効特徴量を生成することが可能となる。 As a result, it is possible to generate effective features that reflect the column indicating the time set by the user, the column indicating the session unit, and the prediction target column, and are optimal for generating a machine learning model with high prediction accuracy. .
 有効特徴量の生成方法の情報は、例えば、有効特徴量に使用する系列データの使用方法、窓の設定方法、および特徴量の要素における各値の割合や重みの設定方法など、有効特徴量の生成方法(=算出方法)を特定する情報である。 Information on how to generate effective features includes, for example, how to use the series data used for effective features, how to set windows, and how to set the proportions and weights of each value in the elements of the feature. This is information that specifies the generation method (=calculation method).
 より具体的には、有効特徴量の生成に使用する系列データの使用方法を特定する情報は、例えば、カテゴリ型の系列データと数値型の系列データの使用割合を、40:60などの所定の割合で使用するといった情報である。 More specifically, the information specifying how to use the series data used to generate effective features may be, for example, setting the usage ratio of categorical series data to numerical series data to a predetermined ratio such as 40:60. This information is used in proportions.
 また、窓の設定方法を特定する情報は、例えば、図12の窓WB、窓WS、および図13の窓WSS、窓WAのそれぞれで得られた情報を、例えば、50:20:20:10で使用するといった情報である。 Further, the information specifying the window setting method is, for example, the information obtained in the window WB and window WS in FIG. 12 and the window WSS and window WA in FIG. 13, for example, 50:20:20:10 This information is used in
 さらに、特徴量の各要素における各値の割合や重みの設定方法を特定する情報は、例えば、Ave(Xn),Min(Xn),Max(Xn),Var(Xn),Stde(Xn)のそれぞれの割合や、重みを割り付けるといった情報である。 Furthermore, information specifying the ratio of each value and the setting method of the weight in each element of the feature amount is, for example, Ave (Xn), Min (Xn), Max (Xn), Var (Xn), Stde (Xn). This information includes the proportions and weights assigned to each.
 そして、セッション内特徴量生成部124は、各系列データを用いて、推定された作成方法で、有効特徴量を作成し、作成した有効特徴量を用いてセッション内特徴量データを生成し、出力する。 Then, the in-session feature amount generation unit 124 uses each series data to create an effective feature amount using the estimated creation method, uses the created effective feature amount to generate in-session feature amount data, and outputs it. do.
 <セッション間特徴量データ>
 次に、セッション間特徴量生成部126により生成されるセッション間特徴量データについて説明する。
<Inter-session feature data>
Next, the inter-session feature amount data generated by the inter-session feature amount generation unit 126 will be explained.
 セッション間特徴量生成部126は、以上のように、フローデータのうち、予測モデルPMに基づいて、予測精度PAが所定値よりも高い系列データを用いて、フローデータのメタデータから推定される有効特徴量の生成方法により生成された特徴量からなるセッション内特徴量データの時間の前後関係から得られる特徴量を用いて、セッション間特徴量データを生成する。 As described above, the inter-session feature amount generation unit 126 uses series data whose prediction accuracy PA is higher than a predetermined value based on the prediction model PM among the flow data, and is estimated from the metadata of the flow data. Inter-session feature data is generated using feature values obtained from the time context of intra-session feature data, which is comprised of features generated by the effective feature generation method.
 すなわち、例えば、フローデータが野球投球ログである場合、同一の打者、同一の投手、または、同一の打者で、かつ、同一の投手であるときのいずれかにおける、x打席前の特徴量、または、過去全体の特徴量は、セッション間特徴量として扱うことができる。 That is, for example, when the flow data is a baseball pitching log, the feature amount before x at-bats for either the same batter, the same pitcher, or the same batter and the same pitcher, or , the entire past feature can be treated as an inter-session feature.
 セッション単位として打席ID列が設定される場合、打席IDには順序が認められる、いわゆる整数型データであるので、セッション単位での前後関係を想定した時系列データとして扱い、セッション間特徴量とすることができる。 When a turn-at-bat ID column is set as a session unit, since turn-at-bat IDs are so-called integer-type data that has an order, they are treated as time-series data that assumes the context of each session, and are used as inter-session features. be able to.
 また、予測対象としてヒットやアウトなどの結果列が設定される場合、前打席のヒットまたはアウトといった情報は、いわゆるstring型データであり、順序が認められないが、時刻列の値に基づいて、順序が特定されるようにして、セッション間特徴量とすることができる。 In addition, when a result column such as a hit or an out is set as a prediction target, information such as a hit or an out in the previous turn at bat is so-called string type data, and no order is recognized, but based on the value of the time column, The inter-session feature amount can be set so that the order is specified.
 さらに、セッション単位である打席ID列をクラスタリングする投手ID列の情報を集合として用いることでグルーピングされたセッション内のクラスを単位とした前後関係を計算するようにしてもよい。例えば、投手ID列を、セッション単位である打席IDをクラスタリングする集合とすることができる場合、”同じ投手の”前打席のヒット、またはアウトといった予測対象となる結果列や球速平均をセッション間特徴量とすることができる。 Furthermore, by using the information of the pitcher ID string that clusters the turn-at-bat ID string, which is a session unit, as a set, the context may be calculated in units of classes within the grouped sessions. For example, if the pitcher ID column can be a set that clusters at-bat IDs in session units, the result column to be predicted such as a hit or out in the previous at-bat of "the same pitcher" or the average ball speed can be used as a feature between sessions. It can be the amount.
 より具体的には、図16の左部で示されるように、セッションの単位として打席IDが設定され、時刻列として投球列が設定され、予測対象として結果列が設定されて、図16の中央部で示されるように、左から、セッションの単位である打席ID列をクラスタリングする投手ID列、セッションの単位である打席ID列、結果列、および各打席球速平均列からなるセッション内特徴量データが生成される場合について考える。 More specifically, as shown in the left part of FIG. 16, the at-bat ID is set as the session unit, the pitch row is set as the time row, the result row is set as the prediction target, and the center of FIG. As shown in the section, intra-session feature data consists of, from the left, a pitcher ID column that clusters the at-bat ID column that is the unit of session, a turn-at-bat ID column that is the session unit, a result column, and an average ball speed column for each at-bat. Consider the case where is generated.
 この場合、打席IDがセッション単位であるので、前打席の特徴量はセッション間特徴量である。このため、図16の中央部で示されるように、セッション内特徴量におけるセッション単位となる各打席球速平均列に対して、図16の右部で示されるように、セッション間特徴量においては、セッション単位となる打席ID列をクラスタリングする投手ID列に対応する投手の前回の球速平均列が追加されている。 In this case, since the turn at bat ID is for each session, the feature amount of the previous turn at bat is an inter-session feature amount. Therefore, as shown in the center part of FIG. 16, for each turn-at-bat ball speed average column that is a session unit in the intra-session feature value, as shown in the right part of FIG. 16, in the inter-session feature value, A pitcher's previous ball speed average column corresponding to the pitcher ID column that clusters the batting turn ID column for each session has been added.
 尚、図16の右部におけるセッション間特徴量における投手の前回の球速平均列には、その右側の、図16の中央部におけるセッション内特徴量における各打席球速平均における投手IDがA,Bにおける打席IDが0,1のそれぞれの137km/h、および115km/hの値が、打席IDが2,3における同一投手の前回の球速平均の値として記載されている。 In addition, in the pitcher's previous ball speed average column in the inter-session features in the right part of FIG. The values of 137 km/h and 115 km/h for batting turns IDs 0 and 1 are recorded as the previous ball speed average values of the same pitcher for batting turns IDs 2 and 3.
 また、投手IDがA,Bにおける打席IDが0,1の投手の前回の球速平均の値は、存在しないので、「NaN」とされている。 Furthermore, the previous ball speed average value of pitchers whose batting IDs are 0 and 1 with pitcher IDs A and B does not exist, so it is set as "NaN".
 さらに、セッション単位をクラスタリングするセッション集合を用いることで、同じセッション集合における前打席の特徴量なども作成できる。 Furthermore, by using a session set that clusters session units, it is also possible to create features such as the previous turn at bat in the same session set.
 <本開示のフローデータの例>
 次に、図17を参照して、本開示のフローデータに対する、病院バイタルログ、工場ロボットログ、および野球投球ログのそれぞれがフローデータを構成する場合の、セッション(単位の)IDの例、時刻単位の例、属性データの例、時系列データの例、セッション内特徴量の例、セッション間特徴量の例、およびセッション単位をクラスタリングしたときのセッション集合(単位の)IDの例について説明する。
<Example of flow data of this disclosure>
Next, with reference to FIG. 17, an example of session (unit) ID and time when a hospital vital log, a factory robot log, and a baseball pitching log constitute flow data for the flow data of the present disclosure. An example of a unit, an example of attribute data, an example of time series data, an example of an intra-session feature amount, an example of an inter-session feature amount, and an example of session set (unit) ID when session units are clustered will be explained.
 すなわち、フローデータが病院バイタルログである場合、セッション(単位の)IDの例は、患者IDであり、時刻単位の例は、日時であり、属性データの例は、患者の性別であり、時系列データの例は、心拍信号であり、セッション内特徴量の例は、患者の平均心拍であり、セッション間特徴量の例は、病院別患者の年齢であり、セッション集合(単位の)IDは、病院IDである。 That is, when the flow data is a hospital vital log, an example of session (unit) ID is patient ID, an example of time unit is date and time, and an example of attribute data is patient gender and time. An example of series data is a heartbeat signal, an example of an intra-session feature is the patient's average heartbeat, an example of an inter-session feature is the age of a patient by hospital, and the session set (unit) ID is , is the hospital ID.
 また、フローデータが工場ロボットログである場合、セッション(単位の)IDの例は、稼働IDであり、時刻単位の例は、日時であり、属性データの例は、ロボットの設置場所であり、時系列データの例は、トルクセンサ信号であり、セッション内特徴量の例は、その日のロボットの平均停止回数であり、セッション間特徴量の例は、ロボット別通算停止回数であり、セッション集合(単位の)IDは、ロボットIDである。 Further, when the flow data is a factory robot log, an example of the session (unit) ID is the operation ID, an example of the time unit is the date and time, an example of the attribute data is the installation location of the robot, An example of time-series data is a torque sensor signal, an example of an intra-session feature is the average number of stops of a robot on that day, an example of an inter-session feature is the total number of stops for each robot, and an example of a session set ( ID of the unit is the robot ID.
 さらに、フローデータが野球投球ログである場合、セッション(単位の)IDの例は、打席IDであり、時刻単位の例は、打席内球数であり、属性データの例は、投手の左/右投げであり、時系列データの例は、球速であり、セッション内特徴量の例は、打席内平均球速であり、セッション間特徴量の例は、同一投手の過去3打席の結果であり、セッション集合(単位の)IDは、投手IDである。 Further, if the flow data is a baseball pitching log, an example of the session (unit) ID is the at-bat ID, an example of the time unit is the number of pitches in the at-bat, and an example of the attribute data is the pitcher's left / He is a right-handed pitcher, an example of time-series data is ball speed, an example of an intra-session feature is the average ball speed within an at-bat, an example of an inter-session feature is the results of the past three at-bats of the same pitcher, The session set (unit) ID is the pitcher ID.
 尚、フローデータに対する、セッション(単位の)IDの例、時刻単位の例、属性データの例、時系列データの例、セッション内特徴量の例、セッション間特徴量の例、およびセッション単位をクラスタリングしたときのセッション集合(単位の)IDの例については、図17に限定されるものではない。 In addition, for flow data, examples of session (unit) IDs, examples of time units, examples of attribute data, examples of time series data, examples of intra-session features, examples of inter-session features, and clustering of session units. Examples of session set (unit) IDs in this case are not limited to those shown in FIG. 17.
 <特徴量データの提示例>
 次に、図18を参照して、生成特徴量可視化部103によりフローデータと特徴量データとが可視化されて提示される際の提示例について説明する。
<Example of presentation of feature data>
Next, with reference to FIG. 18, a presentation example in which flow data and feature data are visualized and presented by the generated feature visualization unit 103 will be described.
 図18は、フローデータが野球投球ログである場合の提示例を示している。 FIG. 18 shows an example of presentation when the flow data is a baseball pitching log.
 図18の特徴量データの提示例においては、上段に特徴量データテーブルが表示され、下段には、上段の特徴量データテーブル内において、指定されたセッション内特徴量データの一部の詳細データを表示するグラフが表示されている。また、特徴量データテーブルの右上部には、特徴量データ全体の有効度スコア表示欄が設けられており、図18においては、「特徴量データ全体の有効度スコア:85/100」と表記されており、例えば、予測対象の予測に対する有効度スコアが100点満点中85点であることが示されている。 In the feature data presentation example in FIG. 18, the feature data table is displayed in the upper row, and the detailed data of a part of the specified in-session feature data in the upper feature data table is displayed in the lower row. The graph to be displayed is displayed. In addition, in the upper right corner of the feature data table, there is a field for displaying the effectiveness score of the entire feature data, and in FIG. For example, it is shown that the effectiveness score for the prediction of the prediction target is 85 points out of 100 points.
 特徴量データテーブルには、左からデータID列、セッション単位としての打席ID列、属性データとしての投手ID列、および結果列、時刻列としての投球ID列、予測対象としての球速列、セッション内特徴量データとしての1球前の球速列、直近3球球速平均列、および1球前の球種列、並びに、セッション間特徴量データとしての投手の前回の球速平均列が設けられている。 The feature data table includes, from the left, a data ID column, a turn-at-bat ID column as a session unit, a pitcher ID column as attribute data, a result column, a pitch ID column as a time column, a pitch speed column as a prediction target, and within a session. The pitch speed row of the previous pitch, the average speed of the most recent three pitches, and the pitch type row of the previous pitch are provided as feature data, and the pitcher's previous average pitch speed row of the pitcher as inter-session feature data.
 図18においては、データID列においては、上から順に1,2,3,4,5,6と表示されている。 In FIG. 18, in the data ID column, 1, 2, 3, 4, 5, and 6 are displayed in order from the top.
 また、セッション単位の列としての打席ID列においては、上から順に1,1,1,1,1,2と表示されており、データID=1乃至5までのデータが、打席ID=1のものであり、データID=6のデータが、打席ID=2のものであることが示されている。 In addition, in the turn-at-bat ID column, which is a column for each session, 1, 1, 1, 1, 1, 2 are displayed in order from the top, and the data from data ID = 1 to 5 is the turn-at-bat ID column of turn ID = 1. This shows that the data with data ID=6 is for turn at bat ID=2.
 さらに、属性データとしての投手ID列においては、上から順に、A,A,A,A,A,Bと表示されており、データID=1乃至5までのデータが、投手ID=Aのものであり、データID=6のデータが、投手ID=Bのものであることが示されている。 Furthermore, in the pitcher ID column as attribute data, A, A, A, A, A, B are displayed in order from the top, and data from data ID = 1 to 5 is for pitcher ID = A. This shows that the data with data ID=6 belongs to pitcher ID=B.
 また、結果列においては、上から「ヒット」、「ヒット」、「ヒット」、「ヒット」、「ヒット」、「アウト」と表示されており、データID=1乃至5までの結果列が、ヒットであり、データID=6の結果列が、アウトであることが示されている。 In addition, in the result column, "hit", "hit", "hit", "hit", "hit", and "out" are displayed from the top, and the result column for data ID = 1 to 5 is It is a hit, and the result column with data ID=6 is shown to be out.
 時刻列としての投球ID列においては、上から順に1,2,3,4,5,1と表示されており、データID=1乃至5までの投球ID列が、打席ID=1における同一投手ID=Aの投手により投じられた第1球目から第5球目までのデータであり、投球ID=5の投球において打者がヒットを放ったことが示されている。 In the pitching ID column as a time column, 1, 2, 3, 4, 5, 1 are displayed in order from the top, and the pitching ID columns from data ID = 1 to 5 are the same pitcher at bat ID = 1. This is data from the first pitch to the fifth pitch thrown by the pitcher with ID=A, and shows that the batter made a hit with the pitch with pitch ID=5.
 また、データID=6の投球ID=1の投球のデータであることが示されている。 It is also shown that the data is data for a pitch with a pitch ID of 6 and a pitch with a pitch ID of 1.
 予測対象である球速列においては、上から順に143.9,140.2,130.9,90.4,124.3,150.2と表示されており、打席ID=1の打席における、投球ID=1の1球目の球速が143.9km/hであり、投球ID=2の2球目の球速が140.2km/hであり、投球ID=3の3球目の球速が130.9km/hであり、投球ID=4の4球目の球速が90.4km/hであり、投球ID=5の5球目の球速が124.3km/hであり、打席ID=2の打席における、投球ID=1の1球目の球速が150.2km/hであることが示されている。 In the ball speed column that is the prediction target, 143.9, 140.2, 130.9, 90.4, 124.3, 150.2 are displayed in order from the top, and the ball speed of the first pitch of pitch ID = 1 in the turn of bat with turn ID = 1 is 143.9 km. /h, the speed of the second pitch of pitch ID = 2 is 140.2 km/h, the speed of the third pitch of pitch ID = 3 is 130.9 km/h, and the speed of the fourth pitch of pitch ID = 4 is 140.2 km/h. The ball speed is 90.4 km/h, the speed of the 5th pitch of pitch ID = 5 is 124.3 km/h, and the speed of the 1st ball of pitch ID = 1 in the at-bat ID = 2 is 150.2 km/h. It has been shown that
 セッション内特徴量データとしての1球前の球速列においては、上からNaN,143.9,140.2,130.9,90.4,NaNと表示されており、データID=1乃至6のそれぞれの1球前の球速が表示されている。 In the ball speed column from the previous ball as intra-session feature data, NaN, 143.9, 140.2, 130.9, 90.4, NaN are displayed from the top, and the ball speed from the previous ball for each data ID = 1 to 6 is Displayed.
 直近3球球速平均列においては、上からNaN,NaN,NaN,138.3,120.5,NaNと表示されており、データID=1乃至6のそれぞれの直近3球の平均球速が表示されている。 In the most recent three ball ball speed average column, NaN, NaN, NaN, 138.3, 120.5, NaN are displayed from the top, and the average ball speed of the most recent three balls for each of data IDs = 1 to 6 is displayed.
 1球前の球種列においては、上からNaN、ストレート、スライダー、チェンジアップ、スローボール、NaNと表記されており、打席ID=1の打席において、投手ID=Aの投手が、投球ID=2乃至5の投球において1球前に投球された球種が、それぞれストレート、スライダー、チェンジアップ、スローボールであることが示されている。 In the pitch type row before the first pitch, from the top it is written as NaN, straight, slider, changeup, slow ball, NaN, and in the turn at bat with turn ID = 1, the pitcher with pitcher ID = A, pitcher ID = It is shown that the types of pitches pitched one pitch before in pitches 2 to 5 are a straight ball, a slider, a changeup, and a slow ball, respectively.
 セッション間特徴量データとしての投手の前回の球速平均列においては、上から順に、120.4,120.4,120.4,120.4,144.2と表示されており、打席ID=1の打席の前の打席における、球速平均が120.4km/hであることが表示されており、打席ID=2の打席の前の打席における、球速平均が144.2km/hであることが表示されている。 In the pitcher's previous ball speed average column as inter-session feature data, 120.4, 120.4, 120.4, 120.4, 144.2 are displayed in order from the top. is displayed as 120.4 km/h, and it is displayed that the average ball speed in the turn at bat before the turn at bat with turn ID=2 is 144.2 km/h.
 さらに、下段においては、特徴量データテーブルのデータID=4である投球ID=4における、直近3球球速平均が指定された場合の、詳細データを表示するグラフ表示例が表示されている。 Furthermore, in the lower row, a graph display example is displayed that displays detailed data when the average velocity of the most recent three pitches for pitch ID=4, which is data ID=4 in the feature data table, is specified.
 下段のグラフにおいては、投球ID=1乃至5の球速が143.9km/h、140.2km/h、130.9km/h、90.4km/h、および124.3km/hであることを示す位置がプロットされ、プロットされた各点が直線により接続されたグラフが表示されている。 In the lower graph, positions showing that the ball speeds of pitching IDs = 1 to 5 are 143.9 km/h, 140.2 km/h, 130.9 km/h, 90.4 km/h, and 124.3 km/h are plotted, A graph is displayed in which each plotted point is connected by a straight line.
 さらに、このうち、投球ID=4における、直近3球の球速が、それぞれ143.9km/h、140.2km/h、および130.9km/hであり、この球速平均が138.3km/hであることが表記されている。 Furthermore, among these, the ball speeds of the most recent three pitches with pitch ID = 4 are 143.9 km/h, 140.2 km/h, and 130.9 km/h, respectively, and the average ball speed is 138.3 km/h. has been done.
 図18で示される例では、予測対象となる球速に対して、セッション内特徴量データとして、1球前の球速、直近3球球速平均、および1週前の球種が提示され、セッション間特徴量データとして、投手の前回の球速平均が生成されていることが提示される。 In the example shown in FIG. 18, for the ball speed to be predicted, the ball speed of the previous ball, the average ball speed of the last three balls, and the pitch type of one week ago are presented as intra-session feature data, and the inter-session features It is presented that the pitcher's previous ball speed average is generated as quantitative data.
 図18で示されるような提示により、ユーザは、予測対象となる球速を予測する機械学習モデルを生成する上で、セッション内特徴量データとして、1球前の球速、直近3球球速平均、および1週前の球種が提案され、セッション間特徴量データとして、投手の前回の球速平均が提案されたことを認識することができる。 With the presentation shown in FIG. 18, the user can generate a machine learning model that predicts the ball speed to be predicted by using the ball speed of the previous ball, the average speed of the last three balls, and the ball speed of the previous ball as intra-session feature data. It can be recognized that the pitch type from one week ago has been proposed, and that the pitcher's previous pitch average speed has been proposed as the inter-session feature amount data.
 また、有効度スコアが提示されることにより、特徴量データを用いて生成される機械学習モデルを用いた予測において期待される精度を、ある程度認識することが可能となる。 Furthermore, by presenting the effectiveness score, it becomes possible to recognize to some extent the accuracy expected in prediction using a machine learning model generated using feature data.
 結果として、フローデータを入力し、フローデータに対する時刻を示す列、およびセッション単位を示す列、並びに予測対象列を指定するだけで、機械学習モデルの生成に必要とされる特徴量データを生成することが可能となる。 As a result, the feature data required to generate a machine learning model can be generated by simply inputting flow data and specifying a column indicating the time, a column indicating the session unit, and a prediction target column for the flow data. becomes possible.
 尚、図18により提示される特徴量データを参照して、有効度スコアが低く、また、提示された特徴量データを参照しても、機械学習モデルの生成に十分な特徴量データが得られていないと判断される場合については、例えば、フローデータに対して指定する時刻を示す列、およびセッション単位を示す列をし直して、再度、特徴量データを生成させるようにしてもよいし、他のフローデータを使用するようにしてもよい。 Note that when referring to the feature data presented in FIG. 18, the validity score is low, and even when referring to the feature data presented, sufficient feature data for generating a machine learning model cannot be obtained. If it is determined that the flow data is not specified, for example, the column indicating the time specified for the flow data and the column indicating the session unit may be changed and the feature data may be generated again. Other flow data may also be used.
 <特徴量データ生成処理>
 次に、図19のフローチャートを参照して、図4のUI制御部61、およびデータ処理部62の機能により実現される特徴量データ生成処理について説明する。
<Feature amount data generation process>
Next, with reference to the flowchart in FIG. 19, the feature data generation process realized by the functions of the UI control unit 61 and data processing unit 62 in FIG. 4 will be described.
 ステップS31において、フローデータ入力部101は、フローデータの入力を受け付けて、生成特徴量可視化部103、およびデータ処理部62に出力する。 In step S31, the flow data input unit 101 receives input of flow data and outputs it to the generated feature quantity visualization unit 103 and the data processing unit 62.
 ステップS32において、データ処理部62の列推定部121は、フローデータを解析して、フローデータを構成する列を推定し、推定結果をUI制御部61に出力する。 In step S32, the column estimation unit 121 of the data processing unit 62 analyzes the flow data, estimates the columns that make up the flow data, and outputs the estimation result to the UI control unit 61.
 ステップS33において、タスク設定部102は、フローデータの列の推定結果を取得すると、推定結果と共に、タスク設定として、セッションの単位列、時刻の単位列、および予測対象の入力を促す、例えば、図6を参照して説明した表示画像PVで示されるようなUIを生成して、提示する。 In step S33, when the task setting unit 102 obtains the estimation result of the flow data column, the task setting unit 102 prompts for input of a session unit column, a time unit column, and a prediction target as task settings together with the estimation result. A UI as shown in the display image PV described with reference to 6 is generated and presented.
 そして、タスク設定部102は、ユーザからの入力を受け付けて、タスク設定として入力されたセッションの単位列、時刻の単位列、および予測対象の情報をデータ処理部62に出力する。 Then, the task setting unit 102 receives input from the user and outputs the session unit sequence, time unit sequence, and prediction target information input as the task setting to the data processing unit 62.
 このとき、タスク設定部102は、タスク設定として、さらに、予測対象列の予測頻度と予測時刻の入力を促す情報もUIで提示し、予測対象列の予測頻度と予測時刻の情報についても入力を受け付けて、データ処理部62に出力する。 At this time, the task setting unit 102 further presents information on the UI prompting the user to input the prediction frequency and prediction time of the prediction target column as a task setting, and also prompts the user to input the prediction frequency and prediction time information of the prediction target column. It accepts and outputs it to the data processing section 62.
 ステップS34において、出力フォーマット決定部122は、タスク設定として供給されたセッションの単位列、時刻の単位列、および予測対象の情報に基づいて、フローデータより読み出す出力フォーマットを決定し、生成元選択部123に出力する。 In step S34, the output format determining unit 122 determines the output format to be read from the flow data based on the session unit sequence, time unit sequence, and prediction target information supplied as the task settings, and 123.
 ステップS35において、生成元選択部123は、出力フォーマットに従って、フローデータより系列データを抽出すると共に、生成元選択処理を実行し、フローデータより出力フォーマットに基づいて抽出される系列データのうち、予測対象の予測に有効性の高い系列データを選択して、セッション内特徴量生成部124に出力する。 In step S35, the generation source selection unit 123 extracts sequence data from the flow data according to the output format, executes generation source selection processing, and selects the predicted sequence data from the sequence data extracted from the flow data based on the output format. Sequence data that is highly effective in predicting the target is selected and output to the intra-session feature generation unit 124.
 尚、生成元選択処理については、図20のフローチャートを参照して、詳細を後述する。 Note that details of the generation source selection process will be described later with reference to the flowchart in FIG. 20.
 ステップS36において、セッション内特徴量生成部124は、セッション内特徴量生成処理を実行し、選択された系列データを用いて、セッション内特徴量データを生成して特徴量選択部125に出力する。 In step S36, the intra-session feature generation unit 124 executes an intra-session feature generation process, uses the selected series data to generate intra-session feature data, and outputs it to the feature selection unit 125.
 尚、セッション内特徴量生成処理については、図21のフローチャートを参照して、詳細を後述する。 Note that details of the intra-session feature amount generation process will be described later with reference to the flowchart of FIG. 21.
 ステップS37において、特徴量選択部125のセッション内特徴量選択部141は、有効度スコア算出部143を制御して、供給されてきたセッション内特徴量データを構成する、それぞれの特徴量の予測対象の予測に係る有効度スコアを算出させると共に、算出した有効度スコアを自らとループ判定部129に出力させる。 In step S37, the intra-session feature selection unit 141 of the feature selection unit 125 controls the effectiveness score calculation unit 143 to predict the prediction target of each feature forming the supplied intra-session feature data. It calculates the effectiveness score related to the prediction, and outputs the calculated effectiveness score to itself and to the loop determination unit 129.
 ステップS38において、セッション内特徴量選択部141は、セッション内特徴量データを構成する、それぞれの特徴量のうち、有効度スコアが所定のスコア閾値よりも高い特徴量を有効特徴量として選択すると共に、その他の特徴量を除外し、有効特徴量からなるセッション内特徴量データを再構成してセッション間特徴量生成部126、および結合部127に出力する。 In step S38, the intra-session feature quantity selection unit 141 selects, as an effective feature quantity, a feature quantity whose effectiveness score is higher than a predetermined score threshold from among the respective feature quantities constituting the intra-session feature quantity data. , other features are excluded, intra-session feature data consisting of effective features is reconfigured and output to the inter-session feature generating section 126 and the combining section 127.
 ステップS39において、セッション間特徴量生成部126は、特徴量選択部125より供給されたセッション内特徴量データを取得すると、記憶すると共に、他のセッション内特徴量データを利用して、セッション間特徴量データを生成し、特徴量選択部125に出力する。 In step S39, upon acquiring the intra-session feature data supplied from the feature selection unit 125, the inter-session feature generation unit 126 stores the intra-session feature data and uses other intra-session feature data to create an inter-session feature data. Quantity data is generated and output to the feature quantity selection unit 125.
 ステップS40において、特徴量選択部125のセッション間特徴量選択部142は、有効度スコア算出部143を制御して、供給されてきたセッション間特徴量データを構成する、それぞれの特徴量の予測対象の予測に係る有効度スコアを算出させると共に、算出した有効度スコアを自らとループ判定部129に出力させる。 In step S40, the inter-session feature selection unit 142 of the feature selection unit 125 controls the effectiveness score calculation unit 143 to predict the prediction target of each feature forming the supplied inter-session feature data. It calculates the effectiveness score related to the prediction, and outputs the calculated effectiveness score to itself and to the loop determination unit 129.
 ステップS41において、セッション間特徴量選択部142は、セッション間特徴量データを構成する、それぞれの特徴量のうち、有効度スコアが所定のスコア閾値よりも高い特徴量を有効特徴量として選択すると共に、その他の特徴量を除外し、有効特徴量からなるセッション間特徴量データを再構成して結合部127に出力する。 In step S41, the inter-session feature quantity selection unit 142 selects, as an effective feature quantity, a feature quantity whose effectiveness score is higher than a predetermined score threshold from among the respective feature quantities constituting the inter-session feature quantity data. , other feature quantities are excluded, and inter-session feature data consisting of effective feature quantities is reconstructed and output to the combining unit 127.
 ステップS42において、結合部127は、セッション内特徴量データと、セッション間特徴量データとを結合して、特徴量データを生成し、生成した特徴量データを特徴量データストレージ128に格納する。 In step S42, the combining unit 127 combines the intra-session feature data and the inter-session feature data to generate feature data, and stores the generated feature data in the feature data storage 128.
 ステップS43において、ループ判定部129は、特徴量データストレージ128に格納された特徴量データに対応する、セッション内特徴量データと、セッション間特徴量データとのそれぞれの特徴量毎の有効度スコアに基づいて、特徴量データの全体の有効度スコアを算出し、所定値以上か、または、処理開始からの経過時間が所定時間を経過したか否かを判定する。 In step S43, the loop determination unit 129 determines the effectiveness score for each feature of the intra-session feature data and the inter-session feature data, which correspond to the feature data stored in the feature data storage 128. Based on this, the overall effectiveness score of the feature amount data is calculated, and it is determined whether the effectiveness score is greater than or equal to a predetermined value or whether the elapsed time from the start of the process has exceeded a predetermined time.
 ステップS43において、特徴量データの全体の有効度スコアが所定値よりも小さく、かつ、処理開始からの経過時間が所定時間を経過していないと判定された場合、処理は、ステップS44に進む。 If it is determined in step S43 that the overall effectiveness score of the feature data is smaller than the predetermined value and that the elapsed time from the start of the process has not exceeded the predetermined time, the process proceeds to step S44.
 ステップS44において、ループ判定部129は、生成元選択処理において利用される精度閾値、および、有効度スコアに対して設定されるスコア閾値を所定値から小さくさせるように生成元選択部123、および特徴量選択部125を制御して、処理は、ステップS35に戻り、再度、特徴量データの生成処理を実行させる。 In step S44, the loop determination unit 129 selects the generation source selection unit 123 and the feature so that the accuracy threshold used in the generation source selection process and the score threshold set for the effectiveness score are reduced from predetermined values. Controlling the amount selection unit 125, the process returns to step S35 and executes the feature amount data generation process again.
 すなわち、ステップS43において、特徴量データの全体の有効度スコアが所定値よりも小さく、かつ、処理開始からの経過時間が所定時間を経過していない場合、除外した系列データや特徴量にも有効なものが存在する可能性があるので、精度閾値およびスコア閾値の設定を所定値だけ小さくさせて、再度、特徴量データを生成させる。 That is, in step S43, if the overall effectiveness score of the feature data is smaller than a predetermined value and the elapsed time from the start of processing has not exceeded the predetermined time, the validity score is also applied to the excluded series data and feature data. Therefore, the accuracy threshold and score threshold are set smaller by predetermined values, and the feature amount data is generated again.
 ただし、この場合、この処理までに生成された特徴量データについては、特徴量データストレージ128に格納されたままの状態とし、以降においても有効であるものとする。また、以降においては、既に、特徴量データとして生成された特徴量は生成済みとして扱い、これまでの処理で、除外された生成元となる系列データや特徴量を復活させるようにして、再度、特徴量データが生成されるようにする。例えば、特徴量選択部125における有効度スコア計算において、ストレージ128に格納された特徴量、及び新規生成特徴量との和集合を用いてそれぞれ機械学習モデルを作成し、その精度の改善幅を新たな有効度スコアとして計算してもよい。 However, in this case, the feature amount data generated up to this process will remain stored in the feature amount data storage 128 and will remain valid thereafter. In addition, from now on, the feature values that have already been generated as feature data will be treated as already generated, and the series data and feature values that have been excluded in the previous processing will be restored, and then again. Enable feature data to be generated. For example, in calculating the effectiveness score in the feature quantity selection unit 125, a machine learning model is created using the union of the feature quantities stored in the storage 128 and the newly generated feature quantities, and the accuracy improvement range is newly calculated. It may also be calculated as a validity score.
 そして、ステップS43において、特徴量データの全体の有効度スコアが所定値以上か、または、処理開始からの経過時間が所定時間を経過したと判定された場合、処理は、ステップS45に進む。 If it is determined in step S43 that the overall effectiveness score of the feature data is equal to or greater than the predetermined value, or that the elapsed time from the start of the process has exceeded the predetermined time, the process proceeds to step S45.
 ステップS45において、ループ判定部129は、特徴量データストレージ128に格納されている特徴量データのうち、特徴量データの全体の有効度スコアが最も高い特徴量データを読み出して、UI制御部61に出力してユーザに提示させると共に、機械学習モデル生成部63に出力する。 In step S45, the loop determination unit 129 reads out the feature data having the highest overall effectiveness score of the feature data from among the feature data stored in the feature data storage 128, and sends it to the UI control unit 61. It is output and presented to the user, and is also output to the machine learning model generation unit 63.
 これに応じて、UI制御部61の生成特徴量可視化部103は、フローデータと特徴量データとに基づいて、UIを生成して、ユーザに提示する。 In response, the generated feature visualization unit 103 of the UI control unit 61 generates a UI based on the flow data and feature data and presents it to the user.
 尚、最初の処理で、ステップS43の処理により、特徴量データの全体の有効度スコアが所定値よりも小さいまま、処理開始からの経過時間が所定時間を経過したと判定された場合、特徴量データの有効度スコアが不十分であり、特徴量データに基づいて生成される機械学習モデルによる予測精度が不十分である可能性があるので、生成特徴量可視化部103は、現状の有効度スコアを提示する共に、現状の特徴量データでは、予測精度が不十分な可能性があることを提示するようにしてもよい。 In addition, in the first process, if it is determined in the process of step S43 that the elapsed time from the start of the process has passed the predetermined time while the overall effectiveness score of the feature data remains smaller than the predetermined value, the feature data Since the validity score of the data is insufficient and the prediction accuracy of the machine learning model generated based on the feature data may be insufficient, the generated feature visualization unit 103 calculates the current validity score. At the same time, it may also be possible to present that the prediction accuracy may be insufficient with the current feature amount data.
 <生成元選択処理>
 次に、図20のフローチャートを参照して、生成元選択部123による生成元選択処理について説明する。
<Generation source selection process>
Next, generation source selection processing by the generation source selection unit 123 will be described with reference to the flowchart of FIG. 20.
 ステップS71において、生成元選択部123は、フローデータより出力フォーマットに基づいて抽出される系列データのうち、時間の経過に伴って変化のない時系列データなど、予測対象の予測に対して無関係とみられる系列データを除外する。 In step S71, the generation source selection unit 123 determines that among the series data extracted from the flow data based on the output format, time series data that does not change over time is irrelevant to the prediction of the prediction target. Exclude series data that is
 ステップS72において、生成元選択部123は、系列データ毎に部分系列を取得して、所定の統計量からなる特徴量テーブルを作成する。 In step S72, the generation source selection unit 123 acquires a partial sequence for each sequence data and creates a feature amount table consisting of predetermined statistics.
 ステップS73において、生成元選択部123は、系列データ毎に特徴量テーブルに基づいて、予測対象を予測する予測モデルを生成する。 In step S73, the generation source selection unit 123 generates a prediction model that predicts the prediction target based on the feature table for each series of data.
 ステップS74において、生成元選択部123は、系列データ毎に予測モデルに基づいた予測結果の予測精度を算出する。 In step S74, the generation source selection unit 123 calculates the prediction accuracy of the prediction result based on the prediction model for each series of data.
 ステップS75において、生成元選択部123は、予測モデルに基づいた予測結果の予測精度が、所定の精度閾値よりも高い系列データを、セッション内特徴量の生成元として選択し、セッション内特徴量生成部124に出力する。 In step S75, the generation source selection unit 123 selects the series data whose prediction accuracy of the prediction result based on the prediction model is higher than a predetermined accuracy threshold as the generation source of the intra-session feature amount, and generates the intra-session feature amount. 124.
 すなわち、以上の処理により、フローデータより出力フォーマットに基づいて抽出される系列データのうち、予測対象を予測するのに有効性の高い系列データをセッション内特徴量の生成元の系列データとして選択して、セッション内特徴量生成部124に出力することが可能となる。 In other words, through the above processing, among the series data extracted from the flow data based on the output format, the series data that is highly effective for predicting the prediction target is selected as the series data from which the intra-session feature values are generated. Then, it becomes possible to output it to the intra-session feature amount generation unit 124.
 結果として、セッション内特徴量データおよびセッション間特徴量データからなる特徴量データに基づいて、機械学習により生成される機械学習モデルの予測精度を向上させることが可能となる。 As a result, it is possible to improve the prediction accuracy of a machine learning model generated by machine learning based on feature data consisting of intra-session feature data and inter-session feature data.
 <セッション内特徴量データ生成処理>
 次に、図21のフローチャートを参照して、セッション内特徴量生成部124によるセッション内特徴量データ生成処理について説明する。
<Intra-session feature data generation process>
Next, with reference to the flowchart of FIG. 21, the intra-session feature quantity data generation process by the intra-session feature quantity generation unit 124 will be described.
 ステップS91において、セッション内特徴量生成部124は、メタデータ抽出部124aを制御して、フローデータよりメタデータを抽出して、生成させる。 In step S91, the intra-session feature generation unit 124 controls the metadata extraction unit 124a to extract and generate metadata from the flow data.
 ステップS92において、セッション内特徴量生成部124は、推定モデル124bを用いて、メタデータから有効特徴量の作成方法を推定させる。 In step S92, the intra-session feature generation unit 124 uses the estimation model 124b to estimate a method for creating an effective feature from the metadata.
 ステップS93において、セッション内特徴量生成部124は、推定モデル124bにより推定された有効特徴量の作成方法に基づいて、生成元選択部123より供給されるセッション内特徴量の生成元として選択された系列データを利用して、特徴量を生成し、生成した特徴量に基づいて、セッション内特徴量データを生成し、特徴量選択部125に出力する。 In step S93, the intra-session feature quantity generation unit 124 selects the generation source of the intra-session feature quantity supplied from the generation source selection unit 123 based on the creation method of the effective feature quantity estimated by the estimation model 124b. A feature amount is generated using the series data, and based on the generated feature amount, in-session feature amount data is generated and output to the feature amount selection unit 125.
 以上の処理により、セッション内特徴量データを構成する特徴量は、生成元選択部123において、フローデータより抽出された系列データのうち、予測対象の予測に有効とされる系列データを生成元として利用した上で、さらに、フローデータから生成されるメタデータに基づいて推定された有効特徴量の生成方法を用いて生成される。 Through the above processing, the generation source selection unit 123 selects the series data that is effective for predicting the prediction target from among the series data extracted from the flow data as the generation source for the feature quantities that constitute the intra-session feature data. After using the flow data, the effective feature amount is further generated using a method of generating an estimated effective feature amount based on metadata generated from the flow data.
 また、上述したように、セッション内特徴量データを構成する特徴量のうち、特徴量選択部125において、さらに、有効度スコアが求められ、有効度スコアが所定のスコア閾値よりも高いものだけが選択されて、セッション内特徴量データが生成される。 Furthermore, as described above, the feature selection unit 125 further calculates the effectiveness score among the features constituting the in-session feature data, and selects only the features whose effectiveness scores are higher than a predetermined score threshold. Once selected, intra-session feature data is generated.
 さらに、このセッション内特徴量データに基づいて、セッション間特徴量データが生成され、このセッション間特徴量データを構成する特徴量においても、有効度スコアが所定のスコア閾値よりも高いものが選択されて、セッション間特徴量データが再構成される。 Furthermore, inter-session feature data is generated based on this intra-session feature data, and among the features that make up this inter-session feature data, those whose effectiveness scores are higher than a predetermined score threshold are selected. Then, the inter-session feature data is reconstructed.
 すなわち、予測対象の予測に係る有効度スコアの高い特徴量からなるセッション内特徴量データと、そのセッション内特徴量データに基づいて、セッション間特徴量データが生成された上で、さらに、有効度スコアに基づいた特徴量が選択されて、セッション間特徴量データが生成される。 That is, intra-session feature data consisting of features with high effectiveness scores related to prediction of the prediction target, and inter-session feature data are generated based on the intra-session feature data, and then further effectiveness scores are generated. Features based on the scores are selected to generate inter-session feature data.
 そして、このようにして生成されたセッション内特徴量データと、セッション間特徴量データとが結合されて特徴量データが生成されるので、予測対象の予測に有効性の高い特徴量データを生成することが可能となる。 Then, the intra-session feature data generated in this way and the inter-session feature data are combined to generate feature data, so feature data that is highly effective in predicting the prediction target is generated. becomes possible.
 また、生成された特徴量データの全体の有効度スコアが所定の閾値よりも高く、予測対象の予測に十分であると認められるときには、設定された処理時間内である限り、より多くの除外した系列データや特徴量にも有効なものが存在する可能性があるので、精度閾値およびスコア閾値の設定を所定値だけ小さくさせて、再度、特徴量データを生成させる。 In addition, when the overall effectiveness score of the generated feature data is higher than a predetermined threshold and is deemed to be sufficient for predicting the prediction target, as long as the set processing time is within the set processing time, more excluded Since there is a possibility that effective series data and feature quantities exist, the accuracy threshold and score threshold are set smaller by predetermined values, and the feature quantity data is generated again.
 結果として、予測対象を予測する機械学習モデルの生成に使用する、高い有効性を備えた特徴量データを、より多く生成することが可能となる。 As a result, it becomes possible to generate a larger amount of highly effective feature data used to generate a machine learning model that predicts a prediction target.
 <変形例>
 セッションをクラスタリングすることにより、例えば、設定されたセッションの上位となる集合を作成するようにしてもよい。
<Modified example>
By clustering sessions, for example, a set of upper classes of set sessions may be created.
 例えば、セッションを設定する際に、予めセッションをクラスタリングした上で、セッションよりも上位集合を設定し、その上位となる集合ごとにセッションが設定されるようにしてもよい。 For example, when setting a session, sessions may be clustered in advance, a superset of the session may be set, and a session may be set for each superordinate set.
 例えば、時系列データをシェープレット分解して、特徴的な部分波形の集合を離散化して、離散化された部分波形を単語とみなし、時系列データやセッションを文章とみなすようにして、TF-IDF(Term Frequency-Inverse Document Frequency)値を求めて、セッションの上位となる集合を求めるようにしてもよい。 For example, by decomposing time series data into shapelets, discretizing a set of characteristic partial waveforms, and treating the discretized partial waveforms as words, and treating time series data and sessions as sentences, TF- The IDF (Term Frequency-Inverse Document Frequency) value may be determined to determine the upper set of sessions.
 すなわち、例えば、図22の上部で示されるようなセッションFW1乃至FW3が存在するような場合について考える。 That is, for example, consider a case where sessions FW1 to FW3 as shown in the upper part of FIG. 22 exist.
 ここで、セッションFW1を特徴的な部分波形PW1-1,PW2-1,PW3-1からなる集合とみなし、セッションFW2を特徴的な部分波形PW1-11,PW3-11,PW3-12からなる集合とみなし、セッションFW3を特徴的な部分波形PW2-21,PW1-21からなる集合とみなし、それぞれを離散化し、部分波形についてTF-IDFを実行する。 Here, session FW1 is regarded as a set consisting of characteristic partial waveforms PW1-1, PW2-1, PW3-1, and session FW2 is regarded as a set consisting of characteristic partial waveforms PW1-11, PW3-11, PW3-12. Session FW3 is regarded as a set consisting of characteristic partial waveforms PW2-21 and PW1-21, each of which is discretized, and TF-IDF is performed on the partial waveforms.
 図22の下部においては、セッションFW1の(PW1,PW2,PW3)のTF-IDF値が、(0,0.1353,0.1353)とされ、セッションFW2の(PW1,PW2,PW3)のTF-IDF値が、(0,0,0.2706)とされ、セッションFW3の(PW1,PW2,PW3)のTF-IDF値が、(0,0.2050,0)とされている。 In the lower part of FIG. 22, the TF-IDF values of (PW1, PW2, PW3) of session FW1 are (0, 0.1353, 0.1353), and the TF-IDF values of (PW1, PW2, PW3) of session FW2 are , (0, 0, 0.2706), and the TF-IDF value of (PW1, PW2, PW3) of session FW3 is (0, 0.2050, 0).
 そして、セッション毎のTF-IDF値に基づくベクトルに基づいたクラスタリングにより、類似度が高いセッション同士を同一クラスにして、上位集合が設定されるようにしてもよい。 Then, by clustering based on a vector based on the TF-IDF value for each session, sessions with a high degree of similarity may be placed in the same class, and a superset may be set.
 また、図23の左部で示されるような、打席IDがセッションとして設定され、1セッション1行で設定されるような場合については、フローデータよりメタデータを抽出し、抽出したメタデータに基づいて、セッションである打席ID毎に、属性データの統計量、例えば、投手IDの頻度などから、セッションである打席IDのクラスタリングを実行することにより、セッションをグルーピングし、セッション上位集合列(図中クラスタID列)を新規作成するようにしてもよい。 In addition, in the case where the turn at bat ID is set as a session and one line per session as shown in the left part of Fig. 23, metadata is extracted from the flow data and based on the extracted metadata. Then, for each at-bat ID, which is a session, clustering is performed on the at-bat ID, which is a session, based on the statistics of the attribute data, such as the frequency of pitcher IDs, to group the sessions, and create a session superset column (in the figure). You may create a new cluster ID column).
 図23の右部においては、例えば、フローデータのメタデータとして抽出された、例えば、投手IDにより、セッションである打席IDで分類される打席毎の相手投手によりクラスタリングすることでクラスタIDとして、図中上からA,B,Aと分類される例が示されている。すなわち、ここでは、クラスタIDは、投手IDに対応するものとなる。 In the right part of FIG. 23, for example, the pitcher ID extracted as metadata of the flow data is clustered by the opposing pitcher for each turn at bat, which is classified by the turn at bat ID that is the session, and the cluster ID is generated as a cluster ID. An example of classification as A, B, and A from top to bottom is shown. That is, here, the cluster ID corresponds to the pitcher ID.
 <<3.ソフトウェアにより実行させる例>>
 ところで、上述した一連の処理は、ハードウェアにより実行させることもできるが、ソフトウェアにより実行させることもできる。一連の処理をソフトウェアにより実行させる場合には、そのソフトウェアを構成するプログラムが、専用のハードウェアに組み込まれているコンピュータ、または、各種のプログラムをインストールすることで、各種の機能を実行することが可能な、例えば汎用のコンピュータなどに、記録媒体からインストールされる。
<<3. Example of execution using software >>
Incidentally, the series of processes described above can be executed by hardware, but can also be executed by software. When a series of processes is executed by software, the programs that make up the software can execute various functions by using a computer built into dedicated hardware or by installing various programs. It is installed from a recording medium onto a computer that can be used, for example, a general-purpose computer.
 図24は、汎用のコンピュータの構成例を示している。このコンピュータは、CPU(Central Processing Unit)1001を内蔵している。CPU1001にはバス1004を介して、入出力インタフェース1005が接続されている。バス1004には、ROM(Read Only Memory)1002およびRAM(Random Access Memory)1003が接続されている。 FIG. 24 shows an example of the configuration of a general-purpose computer. This computer has a built-in CPU (Central Processing Unit) 1001. An input/output interface 1005 is connected to the CPU 1001 via a bus 1004. A ROM (Read Only Memory) 1002 and a RAM (Random Access Memory) 1003 are connected to the bus 1004 .
 入出力インタフェース1005には、ユーザが操作コマンドを入力するキーボード、マウスなどの入力デバイスよりなる入力部1006、処理操作画面や処理結果の画像を表示デバイスに出力する出力部1007、プログラムや各種データを格納するハードディスクドライブなどよりなる記憶部1008、LAN(Local Area Network)アダプタなどよりなり、インターネットに代表されるネットワークを介した通信処理を実行する通信部1009が接続されている。また、磁気ディスク(フレキシブルディスクを含む)、光ディスク(CD-ROM(Compact Disc-Read Only Memory)、DVD(Digital Versatile Disc)を含む)、光磁気ディスク(MD(Mini Disc)を含む)、もしくは半導体メモリなどのリムーバブル記憶媒体1011に対してデータを読み書きするドライブ1010が接続されている。 The input/output interface 1005 includes an input unit 1006 consisting of input devices such as a keyboard and mouse for inputting operation commands by the user, an output unit 1007 for outputting processing operation screens and images of processing results to a display device, and an output unit 1007 for outputting programs and various data. A storage unit 1008 consisting of a hard disk drive for storing data, a communication unit 1009 consisting of a LAN (Local Area Network) adapter, etc., and executing communication processing via a network typified by the Internet are connected. In addition, magnetic disks (including flexible disks), optical disks (including CD-ROM (Compact Disc-Read Only Memory), DVD (Digital Versatile Disc)), magneto-optical disks (including MD (Mini Disc)), or semiconductor A drive 1010 that reads and writes data to and from a removable storage medium 1011 such as a memory is connected.
 CPU1001は、ROM1002に記憶されているプログラム、または磁気ディスク、光ディスク、光磁気ディスク、もしくは半導体メモリ等のリムーバブル記憶媒体1011ら読み出されて記憶部1008にインストールされ、記憶部1008からRAM1003にロードされたプログラムに従って各種の処理を実行する。RAM1003にはまた、CPU1001が各種の処理を実行する上において必要なデータなども適宜記憶される。 The CPU 1001 executes programs stored in the ROM 1002 or read from a removable storage medium 1011 such as a magnetic disk, optical disk, magneto-optical disk, or semiconductor memory, installed in the storage unit 1008, and loaded from the storage unit 1008 into the RAM 1003. Execute various processes according to the programmed program. The RAM 1003 also appropriately stores data necessary for the CPU 1001 to execute various processes.
 以上のように構成されるコンピュータでは、CPU1001が、例えば、記憶部1008に記憶されているプログラムを、入出力インタフェース1005及びバス1004を介して、RAM1003にロードして実行することにより、上述した一連の処理が行われる。 In the computer configured as described above, the CPU 1001 executes the above-described series by, for example, loading a program stored in the storage unit 1008 into the RAM 1003 via the input/output interface 1005 and the bus 1004 and executing it. processing is performed.
 コンピュータ(CPU1001)が実行するプログラムは、例えば、パッケージメディア等としてのリムーバブル記憶媒体1011に記録して提供することができる。また、プログラムは、ローカルエリアネットワーク、インターネット、デジタル衛星放送といった、有線または無線の伝送媒体を介して提供することができる。 A program executed by the computer (CPU 1001) can be provided by being recorded on a removable storage medium 1011 such as a package medium, for example. Additionally, programs may be provided via wired or wireless transmission media, such as local area networks, the Internet, and digital satellite broadcasts.
 コンピュータでは、プログラムは、リムーバブル記憶媒体1011をドライブ1010に装着することにより、入出力インタフェース1005を介して、記憶部1008にインストールすることができる。また、プログラムは、有線または無線の伝送媒体を介して、通信部1009で受信し、記憶部1008にインストールすることができる。その他、プログラムは、ROM1002や記憶部1008に、あらかじめインストールしておくことができる。 In the computer, a program can be installed in the storage unit 1008 via the input/output interface 1005 by attaching the removable storage medium 1011 to the drive 1010. Further, the program can be received by the communication unit 1009 via a wired or wireless transmission medium and installed in the storage unit 1008. Other programs can be installed in the ROM 1002 or the storage unit 1008 in advance.
 なお、コンピュータが実行するプログラムは、本明細書で説明する順序に沿って時系列に処理が行われるプログラムであっても良いし、並列に、あるいは呼び出しが行われたとき等の必要なタイミングで処理が行われるプログラムであっても良い。 Note that the program executed by the computer may be a program in which processing is performed chronologically in accordance with the order described in this specification, in parallel, or at necessary timing such as when a call is made. It may also be a program that performs processing.
 尚、図24におけるCPU1001が、図2の情報処理装置31の制御部51の機能を実現させる。 Note that the CPU 1001 in FIG. 24 realizes the functions of the control unit 51 of the information processing device 31 in FIG. 2.
 また、本明細書において、システムとは、複数の構成要素(装置、モジュール(部品)等)の集合を意味し、すべての構成要素が同一筐体中にあるか否かは問わない。したがって、別個の筐体に収納され、ネットワークを介して接続されている複数の装置、及び、1つの筐体の中に複数のモジュールが収納されている1つの装置は、いずれも、システムである。 Furthermore, in this specification, a system refers to a collection of multiple components (devices, modules (components), etc.), regardless of whether all the components are located in the same casing. Therefore, multiple devices housed in separate casings and connected via a network, and a single device with multiple modules housed in one casing are both systems. .
 なお、本開示の実施の形態は、上述した実施の形態に限定されるものではなく、本開示の要旨を逸脱しない範囲において種々の変更が可能である。 Note that the embodiments of the present disclosure are not limited to the embodiments described above, and various changes can be made without departing from the gist of the present disclosure.
 例えば、本開示は、1つの機能をネットワークを介して複数の装置で分担、共同して処理するクラウドコンピューティングの構成をとることができる。 For example, the present disclosure can take a cloud computing configuration in which one function is shared and jointly processed by multiple devices via a network.
 また、上述のフローチャートで説明した各ステップは、1つの装置で実行する他、複数の装置で分担して実行することができる。 Furthermore, each step described in the above flowchart can be executed by one device or can be shared and executed by multiple devices.
 さらに、1つのステップに複数の処理が含まれる場合には、その1つのステップに含まれる複数の処理は、1つの装置で実行する他、複数の装置で分担して実行することができる。 Further, when one step includes multiple processes, the multiple processes included in that one step can be executed by one device or can be shared and executed by multiple devices.
 尚、本開示は、以下のような構成も取ることができる。
<1> 少なくとも時系列データを含むフローデータのメタデータを生成するメタデータ生成部と、
 前記メタデータに基づいて、前記フローデータを構成する系列データより特徴量の生成方法を推定する推定部と、
 前記推定部により推定された生成方法で、前記系列データより特徴量を生成する特徴量生成部と
 を備える情報処理装置。
<2> 前記フローデータにおける、セッション単位、時刻単位、および予測対象の設定を受け付ける設定部をさらに含み、
 前記メタデータ生成部は、前記フローデータのうち、前記セッション単位、前記時刻単位、および前記予測対象の設定に応じて、前記フローデータより抽出された前記系列データより前記メタデータを生成する
 <1>に記載の情報処理装置。
<3> 前記フローデータを構成する列を推定する列推定部をさらに含み、
 前記設定部は、前記列推定部により推定された列を提示して、前記列を単位として、前記フローデータにおける、前記セッション単位、前記時刻単位、および前記予測対象の列の設定を促すUI(User Interface)画像を生成して提示し、前記UI画像に基づいて、前記セッション単位、前記時刻単位、および前記予測対象の列の設定を受け付ける
 <2>に記載の情報処理装置。
<4> 前記設定部により設定された、前記フローデータにおける、前記セッション単位、前記時刻単位、および前記予測対象に基づいて、前記フローデータより抽出する系列データの出力フォーマットを決定する出力フォーマット決定部をさらに含み、
 前記メタデータ生成部は、前記フローデータのうち、前記セッション単位、前記時刻単位、および前記予測対象の設定に応じて決定された前記出力フォーマットに基づいて、前記フローデータより抽出された前記系列データより前記メタデータを生成する
 <2>に記載の情報処理装置。
<5> 前記出力フォーマットに基づいて、前記フローデータより抽出された前記系列データのそれぞれについての、前記予測対象の予測に係る予測精度を求め、所定の精度閾値よりも高い系列データを選択する選択部をさらに備え、
 前記メタデータ生成部は、前記出力フォーマットに基づいて、前記フローデータより抽出された前記系列データのうち、前記選択部により選択された前記系列データより前記メタデータを生成する
 <4>に記載の情報処理装置。
<6> 前記選択部は、前記出力フォーマットに基づいて、前記フローデータより抽出された系列データのそれぞれについて、部分系列毎の特徴量を求め、前記予測対象を予測するための予測モデルに、前記部分系列毎の特徴量を入力することで、前記予測対象を予測し、前記予測対象と、前記予測モデルによる予測結果との比較から、前記系列データ毎の、前記予測対象の予測に係る予測精度を求め、前記所定の精度閾値よりも高い系列データを選択する
 <5>に記載の情報処理装置。
<7> 前記特徴量生成部は、前記推定部により推定された前記特徴量の生成方法で、前記系列データより特徴量を生成し、生成した前記セッション単位の特徴量に基づいて、セッション内特徴量を生成する
 <2>に記載の情報処理装置。
<8> 前記セッション内特徴量を構成する特徴量のそれぞれについて、前記予測対象の予測に対する有効度スコアを算出する有効度スコア算出部と、
 前記セッション内特徴量を構成する特徴量のうち、前記有効度スコアに基づいて、所定のスコア閾値より高い特徴量を選択して、前記セッション内特徴量を再構成するセッション内特徴量選択部をさらに含む
 <7>に記載の情報処理装置。
<9> 前記セッション内特徴量に基づいて、前記セッション間の特徴量を含む、セッション間特徴量を生成するセッション間特徴量生成部をさらに含む
 <8>に記載の情報処理装置。
<10> 前記有効度スコアは、前記セッション間特徴量を構成する特徴量のそれぞれについても、前記予測対象の予測に対する有効度スコアを算出し、
 前記セッション間特徴量を構成する特徴量のうち、前記有効度スコアに基づいて、所定のスコア閾値より高い特徴量を選択して、前記セッション間特徴量を再構成するセッション間特徴量選択部をさらに含む
 <9>に記載の情報処理装置。
<11> 前記有効度スコア算出部は、前記セッション内特徴量、および前記セッション間特徴量を構成する特徴量のそれぞれと、前記予測対象との相互情報量を前記有効度スコアとして算出する
 <10>に記載の情報処理装置。
<12> 前記有効度スコア算出部は、前記セッション内特徴量、および前記セッション間特徴量を構成する特徴量に基づいて簡易的に生成される機械学習モデルにより、前記予測対象を予測する予測精度を前記有効度スコアとして算出し、
 前記セッション内特徴量選択部は、前記有効度スコアが所定のスコア閾値よりも低くならない、前記特徴量の部分集合を選択して、前記セッション内特徴量を再構成し、
 前記セッション間特徴量選択部は、前記有効度スコアが所定のスコア閾値よりも低くならない、前記特徴量の部分集合を選択して、前記セッション間特徴量を再構成する
 <10>に記載の情報処理装置。
<13> 再構成された前記セッション内特徴量、および再構成された前記セッション間特徴量を結合する結合部と、
 前記結合部により結合された、再構成された前記セッション内特徴量、および再構成された前記セッション間特徴量のそれぞれの特徴量の有効度スコアに基づいて、前記結合部により結合された特徴量の全体の有効度スコアを算出し、前記全体の有効度スコアが所定の閾値よりも小さいか否かを判定する判定部をさらに含み、
 前記判定部は、前記全体の有効度スコアが所定の閾値よりも小さいとき、前記スコア閾値を所定値だけ小さくして、前記セッション内特徴量選択部、および前記セッション間特徴量選択部による処理を再度実行させる
 <10>に記載の情報処理装置。
<14> 前記推定部は、前記フローデータの前記メタデータと、前記フローデータより抽出された系列データより生成される、所定の機械学習モデルの学習に用いた特徴量の作成方法の分布とをペアの情報とし、前記ペアの情報に基づいた学習により生成された推定モデルであり、前記メタデータに基づいて、前記特徴量の生成方法を推定する
 <1>乃至<13>のいずれかに記載の情報処理装置。
<15> 前記フローデータは、時間の経過に対して変化する、前記時系列データに加えて、前記時間の経過に対して不変なデータからなる属性データをさらに含む
 <1>乃至<14>のいずれかに記載の情報処理装置。
<16> 少なくとも時系列データを含むフローデータのメタデータを生成し、
 前記メタデータに基づいて、前記フローデータを構成する系列データより特徴量の生成方法を推定し、
 推定された生成方法で、前記系列データより特徴量を生成する
 ステップを含む情報処理方法。
<17> 少なくとも時系列データを含むフローデータのメタデータを生成するメタデータ生成部と、
 前記メタデータに基づいて、前記フローデータを構成する系列データより特徴量の生成方法を推定する推定部と、
 前記推定部により推定された生成方法で、前記系列データより特徴量を生成する特徴量生成部と
 してコンピュータを機能させるプログラム。
Note that the present disclosure can also take the following configuration.
<1> A metadata generation unit that generates metadata of flow data including at least time-series data;
an estimating unit that estimates a method of generating feature amounts from series data forming the flow data based on the metadata;
An information processing apparatus comprising: a feature amount generating section that generates a feature amount from the series data using a generation method estimated by the estimating section.
<2> Further including a setting unit that accepts settings of a session unit, a time unit, and a prediction target in the flow data,
The metadata generation unit generates the metadata from the series data extracted from the flow data according to settings of the session unit, the time unit, and the prediction target among the flow data. <1 >The information processing device described in >.
<3> Further including a column estimation unit that estimates columns constituting the flow data,
The setting unit presents a column estimated by the column estimating unit and prompts setting of the session unit, the time unit, and the prediction target column in the flow data using the column as a unit (UI) The information processing device according to <2>, wherein the information processing device generates and presents a User Interface image, and receives settings for the session unit, the time unit, and the prediction target column based on the UI image.
<4> An output format determining unit that determines an output format of series data extracted from the flow data based on the session unit, the time unit, and the prediction target in the flow data set by the setting unit. further including;
The metadata generation unit generates the series data extracted from the flow data based on the output format determined according to the settings of the session unit, the time unit, and the prediction target, among the flow data. The information processing device according to <2>, wherein the metadata is generated from the information processing device.
<5> Selection of determining the prediction accuracy related to the prediction of the prediction target for each of the series data extracted from the flow data based on the output format, and selecting series data higher than a predetermined accuracy threshold. further equipped with a department;
The metadata generation unit generates the metadata from the series data selected by the selection unit from among the series data extracted from the flow data, based on the output format. Information processing device.
<6> The selection unit calculates a feature amount for each partial sequence for each of the sequence data extracted from the flow data based on the output format, and adds the feature amount to the prediction model for predicting the prediction target. By inputting the feature amount for each partial sequence, the prediction target is predicted, and from the comparison of the prediction target and the prediction result by the prediction model, the prediction accuracy related to the prediction of the prediction target for each of the series data is calculated. The information processing apparatus according to <5>, wherein the information processing apparatus calculates the sequence data higher than the predetermined accuracy threshold.
<7> The feature generation unit generates a feature from the series data using the feature generation method estimated by the estimation unit, and generates an intra-session feature based on the generated feature for each session. The information processing device according to <2>, which generates an amount.
<8> An effectiveness score calculation unit that calculates an effectiveness score for the prediction of the prediction target for each of the feature amounts forming the intra-session feature amount;
an in-session feature selection unit that selects a feature higher than a predetermined score threshold based on the effectiveness score from among the features constituting the in-session feature and reconstructs the in-session feature; Furthermore, the information processing device according to <7>.
<9> The information processing device according to <8>, further including an inter-session feature generating unit that generates an inter-session feature including the inter-session feature based on the intra-session feature.
<10> The effectiveness score is calculated by calculating the effectiveness score for the prediction of the prediction target for each of the feature amounts forming the inter-session feature amount,
an inter-session feature selection unit that selects a feature higher than a predetermined score threshold based on the effectiveness score from among the features constituting the inter-session feature and reconstructs the inter-session feature; Furthermore, the information processing device according to <9>.
<11> The effectiveness score calculation unit calculates mutual information between each of the intra-session feature amounts and the inter-session feature amounts and the prediction target as the effectiveness score. <10 >The information processing device described in >.
<12> The effectiveness score calculation unit calculates the prediction accuracy for predicting the prediction target using a machine learning model that is simply generated based on the intra-session feature amounts and the feature amounts constituting the inter-session feature amounts. is calculated as the effectiveness score,
The intra-session feature quantity selection unit selects a subset of the feature quantities for which the effectiveness score does not become lower than a predetermined score threshold, and reconstructs the intra-session feature quantity;
The information according to <10>, wherein the inter-session feature quantity selection unit selects a subset of the feature quantities for which the effectiveness score does not become lower than a predetermined score threshold, and reconstructs the inter-session feature quantity. Processing equipment.
<13> A coupling unit that couples the reconstructed intra-session feature quantity and the reconstructed inter-session feature quantity;
Features combined by the combining unit based on the effectiveness scores of the respective features of the reconstructed intra-session feature and the reconstructed inter-session feature, which are combined by the linking unit. further comprising a determination unit that calculates an overall effectiveness score and determines whether the overall effectiveness score is smaller than a predetermined threshold;
When the overall effectiveness score is smaller than a predetermined threshold, the determination unit reduces the score threshold by a predetermined value, and controls processing by the intra-session feature selection unit and the inter-session feature selection unit. The information processing device according to <10>, wherein the information processing device is caused to execute again.
<14> The estimating unit calculates the metadata of the flow data and a distribution of a method of creating features used for learning a predetermined machine learning model, which is generated from the series data extracted from the flow data. The estimation model is a pair of information, and is an estimation model generated by learning based on the pair of information, and estimates a method of generating the feature amount based on the metadata. information processing equipment.
<15> In addition to the time-series data that changes over time, the flow data further includes attribute data consisting of data that does not change over time. The information processing device according to any one of the above.
<16> Generate metadata of flow data including at least time series data,
Based on the metadata, estimating a feature generation method from series data forming the flow data,
An information processing method comprising the step of generating a feature amount from the series data using an estimated generation method.
<17> A metadata generation unit that generates metadata of flow data including at least time-series data;
an estimating unit that estimates a method of generating feature amounts from series data forming the flow data based on the metadata;
A program that causes a computer to function as a feature value generation unit that generates a feature value from the series data using a generation method estimated by the estimation unit.
 31 情報処理装置, 61 UI制御部, 62 データ処理部,63 機械学習モデル生成部, 101 フローデータ入力部, 102 タスク設定部, 103 生成特徴量可視化部, 121 列推定部, 122 出力フォーマット決定部, 123 生成元選択部, 124 セッション内特徴量生成部, 124a メタデータ抽出部, 124b 推定モデル, 125 特徴量選択部, 126 セッション間特徴量生成部, 127 結合部, 128 特徴量データストレージ, 129 ループ判定部 31 Information processing device, 61 UI control unit, 62 Data processing unit, 63 Machine learning model generation unit, 101 Flow data input unit, 102 Task setting unit, 103 Generated feature visualization unit, 121 Column estimation unit, 122 Output format determination unit , 123 Generation source selection unit, 124 Intra-session feature generation unit, 124a Metadata extraction unit, 124b Estimation model, 125 Feature selection unit, 126 Inter-session feature generation unit, 127 Combining unit, 128 Feature data storage, 129 Loop judgment section

Claims (17)

  1.  少なくとも時系列データを含むフローデータのメタデータを生成するメタデータ生成部と、
     前記メタデータに基づいて、前記フローデータを構成する系列データより特徴量の生成方法を推定する推定部と、
     前記推定部により推定された生成方法で、前記系列データより特徴量を生成する特徴量生成部と
     を備える情報処理装置。
    a metadata generation unit that generates metadata of flow data including at least time-series data;
    an estimating unit that estimates a method of generating feature amounts from series data forming the flow data based on the metadata;
    An information processing apparatus comprising: a feature amount generating section that generates a feature amount from the series data using a generation method estimated by the estimating section.
  2.  前記フローデータにおける、セッション単位、時刻単位、および予測対象の設定を受け付ける設定部をさらに含み、
     前記メタデータ生成部は、前記フローデータのうち、前記セッション単位、前記時刻単位、および前記予測対象の設定に応じて、前記フローデータより抽出された前記系列データより前記メタデータを生成する
     請求項1に記載の情報処理装置。
    further comprising a setting unit that accepts settings for each session, each time, and a prediction target in the flow data,
    The metadata generation unit generates the metadata from the series data extracted from the flow data according to settings of the session unit, the time unit, and the prediction target among the flow data. 1. The information processing device according to 1.
  3.  前記フローデータを構成する列を推定する列推定部をさらに含み、
     前記設定部は、前記列推定部により推定された列を提示して、前記列を単位として、前記フローデータにおける、前記セッション単位、前記時刻単位、および前記予測対象の列の設定を促すUI(User Interface)画像を生成して提示し、前記UI画像に基づいて、前記セッション単位、前記時刻単位、および前記予測対象の列の設定を受け付ける
     請求項2に記載の情報処理装置。
    further comprising a column estimation unit that estimates columns constituting the flow data,
    The setting unit presents a column estimated by the column estimating unit and prompts setting of the session unit, the time unit, and the prediction target column in the flow data using the column as a unit (UI) The information processing apparatus according to claim 2, wherein the information processing apparatus generates and presents a user interface image, and receives settings for the session unit, the time unit, and the prediction target column based on the UI image.
  4.  前記設定部により設定された、前記フローデータにおける、前記セッション単位、前記時刻単位、および前記予測対象に基づいて、前記フローデータより抽出する系列データの出力フォーマットを決定する出力フォーマット決定部をさらに含み、
     前記メタデータ生成部は、前記フローデータのうち、前記セッション単位、前記時刻単位、および前記予測対象の設定に応じて決定された前記出力フォーマットに基づいて、前記フローデータより抽出された前記系列データより前記メタデータを生成する
     請求項2に記載の情報処理装置。
    The method further includes an output format determining unit that determines an output format of series data extracted from the flow data based on the session unit, the time unit, and the prediction target in the flow data set by the setting unit. ,
    The metadata generation unit generates the series data extracted from the flow data based on the output format determined according to the session unit, the time unit, and the settings of the prediction target, among the flow data. The information processing device according to claim 2 , wherein the metadata is generated from the metadata.
  5.  前記出力フォーマットに基づいて、前記フローデータより抽出された前記系列データのそれぞれについての、前記予測対象の予測に係る予測精度を求め、所定の精度閾値よりも高い系列データを選択する選択部をさらに備え、
     前記メタデータ生成部は、前記出力フォーマットに基づいて、前記フローデータより抽出された前記系列データのうち、前記選択部により選択された前記系列データより前記メタデータを生成する
     請求項4に記載の情報処理装置。
    further comprising a selection unit that determines the prediction accuracy related to the prediction of the prediction target for each of the series data extracted from the flow data based on the output format, and selects series data higher than a predetermined accuracy threshold. Prepare,
    The metadata generation unit generates the metadata from the series data selected by the selection unit from among the series data extracted from the flow data, based on the output format. Information processing device.
  6.  前記選択部は、前記出力フォーマットに基づいて、前記フローデータより抽出された系列データのそれぞれについて、部分系列毎の特徴量を求め、前記予測対象を予測するための予測モデルに、前記部分系列毎の特徴量を入力することで、前記予測対象を予測し、前記予測対象と、前記予測モデルによる予測結果との比較から、前記系列データ毎の、前記予測対象の予測に係る予測精度を求め、前記所定の精度閾値よりも高い系列データを選択する
     請求項5に記載の情報処理装置。
    The selection unit calculates a feature amount for each subsequence for each of the sequence data extracted from the flow data based on the output format, and adds the feature amount for each subsequence to a prediction model for predicting the prediction target. predicting the prediction target by inputting the feature amount, and calculating the prediction accuracy related to the prediction of the prediction target for each series data from a comparison between the prediction target and the prediction result by the prediction model, The information processing device according to claim 5, wherein the sequence data higher than the predetermined accuracy threshold is selected.
  7.  前記特徴量生成部は、前記推定部により推定された前記特徴量の生成方法で、前記系列データより特徴量を生成し、生成した前記セッション単位の特徴量に基づいて、セッション内特徴量を生成する
     請求項2に記載の情報処理装置。
    The feature generation unit generates a feature from the series data using the feature generation method estimated by the estimation unit, and generates an intra-session feature based on the generated feature for each session. The information processing device according to claim 2.
  8.  前記セッション内特徴量を構成する特徴量のそれぞれについて、前記予測対象の予測に対する有効度スコアを算出する有効度スコア算出部と、
     前記セッション内特徴量を構成する特徴量のうち、前記有効度スコアに基づいて、所定のスコア閾値より高い特徴量を選択して、前記セッション内特徴量を再構成するセッション内特徴量選択部をさらに含む
     請求項7に記載の情報処理装置。
    an effectiveness score calculation unit that calculates an effectiveness score for the prediction of the prediction target for each of the feature amounts forming the intra-session feature amount;
    an in-session feature selection unit that selects a feature higher than a predetermined score threshold based on the effectiveness score from among the features constituting the in-session feature and reconstructs the in-session feature; The information processing device according to claim 7, further comprising:
  9.  前記セッション内特徴量に基づいて、前記セッション間の特徴量を含む、セッション間特徴量を生成するセッション間特徴量生成部をさらに含む
     請求項8に記載の情報処理装置。
    The information processing apparatus according to claim 8 , further comprising: an inter-session feature generating unit that generates an inter-session feature including the inter-session feature based on the intra-session feature.
  10.  前記有効度スコアは、前記セッション間特徴量を構成する特徴量のそれぞれについても、前記予測対象の予測に対する有効度スコアを算出し、
     前記セッション間特徴量を構成する特徴量のうち、前記有効度スコアに基づいて、所定のスコア閾値より高い特徴量を選択して、前記セッション間特徴量を再構成するセッション間特徴量選択部をさらに含む
     請求項9に記載の情報処理装置。
    The effectiveness score is calculated by calculating the effectiveness score for the prediction of the prediction target for each of the feature amounts constituting the inter-session feature amount,
    an inter-session feature selection unit that selects a feature higher than a predetermined score threshold based on the effectiveness score from among the features constituting the inter-session feature and reconstructs the inter-session feature; The information processing device according to claim 9, further comprising:
  11.  前記有効度スコア算出部は、前記セッション内特徴量、および前記セッション間特徴量を構成する特徴量のそれぞれと、前記予測対象との相互情報量を前記有効度スコアとして算出する
     請求項10に記載の情報処理装置。
    The effectiveness score calculation unit calculates mutual information between each of the intra-session feature amounts and the inter-session feature amounts and the prediction target as the effectiveness score. information processing equipment.
  12.  前記有効度スコア算出部は、前記セッション内特徴量、および前記セッション間特徴量を構成する特徴量に基づいて簡易的に生成される機械学習モデルにより、前記予測対象を予測する予測精度を前記有効度スコアとして算出し、
     前記セッション内特徴量選択部は、前記有効度スコアが所定のスコア閾値よりも低くならない、前記特徴量の部分集合を選択して、前記セッション内特徴量を再構成し、
     前記セッション間特徴量選択部は、前記有効度スコアが所定のスコア閾値よりも低くならない、前記特徴量の部分集合を選択して、前記セッション間特徴量を再構成する
     請求項10に記載の情報処理装置。
    The effectiveness score calculation unit calculates the prediction accuracy for predicting the prediction target using a machine learning model that is simply generated based on the intra-session feature amount and the feature amount constituting the inter-session feature amount. Calculated as a degree score,
    The intra-session feature quantity selection unit selects a subset of the feature quantities for which the effectiveness score does not become lower than a predetermined score threshold, and reconstructs the intra-session feature quantity;
    The information according to claim 10, wherein the inter-session feature quantity selection unit selects a subset of the feature quantities in which the effectiveness score does not become lower than a predetermined score threshold, and reconstructs the inter-session feature quantity. Processing equipment.
  13.  再構成された前記セッション内特徴量、および再構成された前記セッション間特徴量を結合する結合部と、
     前記結合部により結合された、再構成された前記セッション内特徴量、および再構成された前記セッション間特徴量のそれぞれの特徴量の有効度スコアに基づいて、前記結合部により結合された特徴量の全体の有効度スコアを算出し、前記全体の有効度スコアが所定の閾値よりも小さいか否かを判定する判定部をさらに含み、
     前記判定部は、前記全体の有効度スコアが所定の閾値よりも小さいとき、前記スコア閾値を所定値だけ小さくして、前記セッション内特徴量選択部、および前記セッション間特徴量選択部による処理を再度実行させる
     請求項10に記載の情報処理装置。
    a coupling unit that couples the reconstructed intra-session feature quantity and the reconstructed inter-session feature quantity;
    Features combined by the combining unit based on the effectiveness scores of the respective features of the reconstructed intra-session feature and the reconstructed inter-session feature, which are combined by the linking unit. further comprising a determination unit that calculates an overall effectiveness score and determines whether the overall effectiveness score is smaller than a predetermined threshold;
    When the overall effectiveness score is smaller than a predetermined threshold, the determination unit reduces the score threshold by a predetermined value, and controls processing by the intra-session feature selection unit and the inter-session feature selection unit. The information processing device according to claim 10, wherein the information processing device is executed again.
  14.  前記推定部は、前記フローデータの前記メタデータと、前記フローデータより抽出された系列データより生成される、所定の機械学習モデルの学習に用いた特徴量の作成方法の分布とをペアの情報とし、前記ペアの情報に基づいた学習により生成された推定モデルであり、前記メタデータに基づいて、前記特徴量の生成方法を推定する
     請求項1に記載の情報処理装置。
    The estimation unit generates pair information of the metadata of the flow data and a distribution of a method of creating features used for learning a predetermined machine learning model, which is generated from the series data extracted from the flow data. The information processing device according to claim 1, wherein the estimation model is generated by learning based on the information of the pair, and the method of generating the feature amount is estimated based on the metadata.
  15.  前記フローデータは、時間の経過に対して変化する、前記時系列データに加えて、前記時間の経過に対して不変なデータからなる属性データをさらに含む
     請求項1に記載の情報処理装置。
    The information processing apparatus according to claim 1, wherein the flow data further includes attribute data consisting of data that does not change with the passage of time, in addition to the time series data that changes with the passage of time.
  16.  少なくとも時系列データを含むフローデータのメタデータを生成し、
     前記メタデータに基づいて、前記フローデータを構成する系列データより特徴量の生成方法を推定し、
     推定された生成方法で、前記系列データより特徴量を生成する
     ステップを含む情報処理方法。
    Generate metadata for the flow data including at least time series data,
    Based on the metadata, estimating a feature generation method from series data forming the flow data,
    An information processing method comprising the step of generating a feature amount from the series data using an estimated generation method.
  17.  少なくとも時系列データを含むフローデータのメタデータを生成するメタデータ生成部と、
     前記メタデータに基づいて、前記フローデータを構成する系列データより特徴量の生成方法を推定する推定部と、
     前記推定部により推定された生成方法で、前記系列データより特徴量を生成する特徴量生成部と
     してコンピュータを機能させるプログラム。
    a metadata generation unit that generates metadata of flow data including at least time-series data;
    an estimating unit that estimates a method of generating feature amounts from series data forming the flow data based on the metadata;
    A program that causes a computer to function as a feature value generation unit that generates a feature value from the series data using a generation method estimated by the estimation unit.
PCT/JP2023/029935 2022-09-06 2023-08-21 Information processing device, information processing method, and program WO2024053370A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2022-141171 2022-09-06
JP2022141171 2022-09-06

Publications (1)

Publication Number Publication Date
WO2024053370A1 true WO2024053370A1 (en) 2024-03-14

Family

ID=90190981

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2023/029935 WO2024053370A1 (en) 2022-09-06 2023-08-21 Information processing device, information processing method, and program

Country Status (1)

Country Link
WO (1) WO2024053370A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2019185751A (en) * 2018-03-30 2019-10-24 株式会社日立製作所 Method of feature quantity preparation, system, and program
JP2021060692A (en) * 2019-10-03 2021-04-15 株式会社東芝 Inference result evaluation system, inference result evaluation device, and method thereof
US20220101190A1 (en) * 2020-09-30 2022-03-31 Alteryx, Inc. System and method of operationalizing automated feature engineering
JP2023061486A (en) * 2021-10-20 2023-05-02 三菱重工業株式会社 Feature quantity generation device, feature quantity generation method, and program

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2019185751A (en) * 2018-03-30 2019-10-24 株式会社日立製作所 Method of feature quantity preparation, system, and program
JP2021060692A (en) * 2019-10-03 2021-04-15 株式会社東芝 Inference result evaluation system, inference result evaluation device, and method thereof
US20220101190A1 (en) * 2020-09-30 2022-03-31 Alteryx, Inc. System and method of operationalizing automated feature engineering
JP2023061486A (en) * 2021-10-20 2023-05-02 三菱重工業株式会社 Feature quantity generation device, feature quantity generation method, and program

Similar Documents

Publication Publication Date Title
US11670185B2 (en) Adaptive machine learning system
Spikol et al. Supervised machine learning in multimodal learning analytics for estimating success in project‐based learning
CN110832499B (en) Weak supervision action localization through sparse time pooling network
KR102033050B1 (en) Unsupervised Learning Technique for Time Difference Model
US11769164B2 (en) Interactive behavioral polling for amplified group intelligence
US11436548B2 (en) Identifying workers in a crowdsourcing or microtasking platform who perform low-quality work and/or are really automated bots
KR20170132853A (en) Analysis of health events using recurrent neural networks
KR20170132842A (en) Analysis of health events using recurrent neural networks
CN112232515A (en) Self-healing machine learning system for transformed data
WO2020009210A1 (en) Abnormality predicting system and abnormality predicting method
US11102530B2 (en) Adaptive processing and content control system
US11205418B2 (en) Monotone speech detection
US11928573B2 (en) Computer system, a computer device and a computer implemented method
Kelly et al. Bidirectional long short-term memory for surgical skill classification of temporally segmented tasks
US20230185360A1 (en) Data processing platform for individual use
WO2024053370A1 (en) Information processing device, information processing method, and program
JP6452092B2 (en) CONTENT PROVIDING SUPPORT METHOD AND SERVER DEVICE
Shou et al. Optimizing Parameters for Accurate Position Data Mining in Diverse Classrooms Layouts.
WO2020246325A1 (en) Information processing device, information processing method, and program
US20190160334A1 (en) Adaptive fitness training
US20220351633A1 (en) Learner engagement engine
US20140129507A1 (en) Information processing device, information processing method and program
Miu et al. On strategies for budget-based online annotation in human activity recognition
Ragan et al. Preserving contextual awareness during selection of moving targets in animated stream visualizations
US20220338773A1 (en) Method and system to evaluate concentration of a living being

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23862910

Country of ref document: EP

Kind code of ref document: A1