JP2018190044A

JP2018190044A - Information processing apparatus and program

Info

Publication number: JP2018190044A
Application number: JP2017089817A
Authority: JP
Inventors: 琢士田原; Takuji Tahara; 軼謳王; Yiou Wang
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2017-04-28
Filing date: 2017-04-28
Publication date: 2018-11-29
Anticipated expiration: 2037-04-28
Also published as: JP6972641B2

Abstract

PROBLEM TO BE SOLVED: To automatically add a new feature item based on a combination of a plurality of related items included in past data to a feature used when predicting a prediction subject value using machine learning.SOLUTION: A feature generation unit 26 generates a group of features of a group of past data based on the group of past data and feature definition information 22. A prediction unit 28 constructs a prediction model based on a feature of a group of learning data which is a part of the group of past data and a sale prediction value, and predicts a sale prediction value with respect to each of validation data based on the prediction model and a group of validation data which is a part of the group of past data. An error calculation unit 39 calculates an error between the sales prediction value and a sales actual value with respect to each of the validation data. A feature combination error calculation unit 34 calculates an average error of the plurality of validation data corresponding to the feature combination for each of feature combinations defined by the feature combination definition unit 32. A new feature item addition unit 36 generates a new feature item based on the specific feature combination including a maximum average error and adds it to the feature definition information 22.SELECTED DRAWING: Figure 1

Description

本発明は、情報処理装置及び情報処理プログラムに関する。 The present invention relates to an information processing apparatus and an information processing program.

従来、機械学習を用いて予測対象値を予測することが行われている。機械学習を用いた予測処理の一例として、対象項目（目的変数）、及び、当該対象項目に関連する複数の関連項目（説明変数）に対する過去の実績値である過去データに基づいて学習処理（いわゆる教師あり学習）が行われて予測モデルが構築され、当該予測モデルと、予測対象値に関連する関連項目の値とに基づいて予測対象値を予測する処理が知られている。 Conventionally, a prediction target value is predicted using machine learning. As an example of a prediction process using machine learning, a learning process (so-called a so-called “target variable”) and based on past data that is past actual values for a plurality of related items (explanatory variables) related to the target item A process for predicting a prediction target value based on the prediction model and a value of a related item related to the prediction target value is known by performing supervised learning).

過去データは多数の関連項目を有している場合があり、その中には、対象項目の値にあまり影響しない項目が含まれている場合もある。したがって、予測モデルの予測精度を向上させるべく、過去データが有する複数の関連項目から選択された項目（素性項目）に対する値を用いて、予測モデルが構築される場合がある。なお、この複数の素性項目とそれらに対する値からなる情報は素性と呼ばれる。 The past data may have a large number of related items, and some of them may include items that do not significantly affect the value of the target item. Therefore, in order to improve the prediction accuracy of the prediction model, a prediction model may be constructed using values for items (feature items) selected from a plurality of related items included in past data. Note that the information including the plurality of feature items and values for them is called a feature.

従来、過去データが有する複数の関連項目から素性項目を自動的に選択する技術が提案されている。例えば、特許文献１には、より有効性の高い予測モデルを構築するため、サポートベクタマシンの重みに基づいて、複数の属性（関連項目）の中から適切な属性（素性項目）を抽出して、予測モデルを構築するための素性関数を生成することが記載されている。また、特許文献２にも、複数のデータ項目（関連項目）の中から予測精度が最も高くなるデータ項目の組み合わせを抽出する処理が記載されている。 Conventionally, techniques for automatically selecting feature items from a plurality of related items included in past data have been proposed. For example, in Patent Document 1, in order to construct a more effective prediction model, an appropriate attribute (feature item) is extracted from a plurality of attributes (related items) based on the weight of the support vector machine. Generating a feature function for constructing a prediction model is described. Patent Document 2 also describes a process of extracting a combination of data items with the highest prediction accuracy from a plurality of data items (related items).

特開２００５−０９２６８１号公報Japanese Patent Laying-Open No. 2005-092681 特開２０１５−０７２６４４号公報Japanese Patent Laying-Open No. 2015-072644

上述のように、過去データが有する複数の関連項目から適切な素性項目を選択することで、より精度の高い予測モデルを構築することができる。ここで、複数の関連項目の組み合わせに基づいて定義される新素性項目を用いることで、より精度の高い予測モデルが構築され得る。例えば、ある店舗の売上予測を行う場合などであって、過去データに含まれる関連項目として「月（何月か）」及び「週（その月の第何週か）」を有する場合、「１２月の第４週目であるか否か」という新素性項目を用いることで、より精度の高い予測モデルが構築できる場合がある。もちろん、新素性項目としては、過去データに含まれる複数の関連項目のむやみな組み合わせに基づくものではなく、予測モデルの予測精度が向上するような適切な組み合わせに基づいて生成されるのが肝要となる。 As described above, a more accurate prediction model can be constructed by selecting an appropriate feature item from a plurality of related items included in past data. Here, by using new feature items defined based on a combination of a plurality of related items, a more accurate prediction model can be constructed. For example, when the sales forecast of a certain store is performed and the related items included in the past data include “month (month)” and “week (week of the month)”, “12 By using a new feature item “whether or not it is the fourth week of the month”, a more accurate prediction model may be constructed. Of course, it is important that new feature items are generated based on appropriate combinations that improve the prediction accuracy of the prediction model, not based on irrelevant combinations of multiple related items included in past data. Become.

本発明の目的は、機械学習を用いて予測対象値の予測を行う際に用いられる素性に、過去データが有する複数の関連項目の組み合わせに基づく新素性項目を自動追加することにある。 An object of the present invention is to automatically add a new feature item based on a combination of a plurality of related items included in past data to a feature used when predicting a prediction target value using machine learning.

請求項１に係る発明は、予測の対象となる対象項目及び前記対象項目に関連する関連項目群、に対する過去の実績値を含む過去データ群の一部である学習データ群と、前記関連項目群のうち予測に用いる項目の候補である素性項目群が定義された素性定義情報とに基づいて、前記過去データ群の一部であり検証の対象となる各検証データの対象項目に対する各予測値を予測する予測部と、前記各検証データそれぞれについて、前記予測部が予測した各予測値と、前記対象項目に対する実績値との誤差を算出する誤差算出部と、前記素性定義情報に含まれる素性項目群から選択された複数の素性項目と、当該複数の素性項目に対する値からなる複数の素性組合せそれぞれについて、前記素性組合せに該当する複数の検証データに関する複数の前記誤差の代表値を算出する素性組合せ誤差算出部と、前記複数の素性組合せの中から、各素性組合せに対応する前記誤差の代表値に基づいて特定された、特定素性組合せにより定義される新素性項目を前記素性定義情報に追加する新素性項目追加部と、を備えることを特徴とする情報処理装置である。 The invention according to claim 1 is a learning data group that is a part of a past data group including past performance values for a target item to be predicted and a related item group related to the target item, and the related item group Based on the feature definition information in which the feature item group that is a candidate for the item to be used for prediction is defined, each predicted value for the target item of each verification data that is a part of the past data group and is to be verified A prediction unit for prediction, an error calculation unit for calculating an error between each prediction value predicted by the prediction unit and an actual value for the target item for each of the verification data, and a feature item included in the feature definition information A plurality of feature items selected from the group and a plurality of feature combinations each including a value for the plurality of feature items, a plurality of previous items related to the plurality of verification data corresponding to the feature combination A feature combination error calculation unit for calculating a representative value of error, and a new feature defined by a specific feature combination identified based on the representative value of the error corresponding to each feature combination from the plurality of feature combinations A new feature item addition unit for adding an item to the feature definition information.

請求項２に係る発明は、前記新素性項目追加部は、前記新素性項目を特定した後、前記新素性項目が仮追加された仮素性定義情報に基づいて算出された前記検証データ群に関する誤差の平均値が、前記新素性項目が仮追加される前の前記素性定義情報に基づいて算出された前記検証データ群に関する誤差の平均値よりも小さい場合に、前記新素性項目を前記素性定義情報に追加する、ことを特徴とする請求項１に記載の情報処理装置である。 The invention according to claim 2 is characterized in that the new feature item addition unit specifies the new feature item, and then an error relating to the verification data group calculated based on the provisional feature definition information in which the new feature item is provisionally added. When the average value of the new feature item is smaller than the average value of errors related to the verification data group calculated based on the feature definition information before the new feature item is provisionally added, the new feature item is added to the feature definition information. The information processing apparatus according to claim 1, wherein the information processing apparatus is added to the information processing apparatus.

請求項３に係る発明は、前記素性項目群に含まれる複数の素性項目は階層関係を有しており、前記素性組合せ誤差算出部は、一方の素性項目と、前記階層関係において当該一方の素性項目が属する階層に隣接する層に属する他方の素性項目とを組み合わせて前記素性組合せを定義する、ことを特徴とする請求項１に記載の情報処理装置である。 In the invention according to claim 3, the plurality of feature items included in the feature item group have a hierarchical relationship, and the feature combination error calculation unit includes one feature item and the one feature in the hierarchical relationship. The information processing apparatus according to claim 1, wherein the feature combination is defined by combining with another feature item belonging to a layer adjacent to a hierarchy to which the item belongs.

請求項４に係る発明は、前記過去データ群は、時系列に並ぶ複数の過去データから構成され、前記素性組合せには、前記素性項目群から選択された素性項目に対する注目過去データに関する値と、前記素性項目群から選択された素性項目に対する前記注目過去データ以外の過去データに関する値とが含まれる、ことを特徴とする請求項１に記載の情報処理装置である。 The invention according to claim 4 is configured such that the past data group includes a plurality of past data arranged in time series, and the feature combination includes a value related to attention past data for the feature item selected from the feature item group, and The information processing apparatus according to claim 1, further comprising: a value related to past data other than the past past data for the feature item selected from the feature item group.

請求項５に係る発明は、前記予測部は、前記素性項目群に含まれる各素性項目について、各素性項目に対する値が前記対象項目の値に対して与える影響の大きさを示す寄与度を算出し、前記素性組合せ誤差算出部は、前記素性組合せを定義するにあたり、寄与度が閾値以下である素性項目を選択しない、ことを特徴とする請求項１に記載の情報処理装置である。 In the invention according to claim 5, the predicting unit calculates, for each feature item included in the feature item group, a degree of contribution indicating a magnitude of influence of a value for each feature item on the value of the target item. The information combination apparatus according to claim 1, wherein the feature combination error calculation unit does not select a feature item whose contribution is equal to or less than a threshold when defining the feature combination.

請求項６に係る発明は、前記新素性項目追加部は、各素性組合せに対応する前記誤差の代表値と、各素性組合せに含まれる複数の素性項目の寄与度とに基づいて、前記特定素性組合せを特定する、ことを特徴とする請求項５に記載の情報処理装置である。 The invention according to claim 6 is characterized in that the new feature item adding unit is configured to determine the specific feature based on a representative value of the error corresponding to each feature combination and contributions of a plurality of feature items included in each feature combination. The information processing apparatus according to claim 5, wherein a combination is specified.

請求項８に係る発明は、コンピュータを、予測の対象となる対象項目及び前記対象項目に関連する関連項目群、に対する過去の実績値を含む過去データ群の一部である学習データ群と、前記関連項目群のうち予測に用いる項目の候補である素性項目群が定義された素性定義情報とに基づいて、前記過去データ群の一部であり検証の対象となる各検証データの対象項目に対する各予測値を予測する予測部と、前記各検証データそれぞれについて、前記予測部が予測した各予測値と、前記対象項目に対する実績値との誤差を算出する誤差算出部と、前記素性定義情報に含まれる素性項目群から選択された複数の素性項目と、当該複数の素性項目に対する値からなる複数の素性組合せそれぞれについて、前記素性組合せに該当する複数の検証データに関する複数の前記誤差の代表値を算出する素性組合せ誤差算出部と、前記複数の素性組合せの中から、各素性組合せに対応する前記誤差の代表値に基づいて特定された、特定素性組合せにより定義される新素性項目を前記素性定義情報に追加する新素性項目追加部と、として機能させることを特徴とする情報処理プログラムである。 The invention according to claim 8 is a learning data group that is a part of a past data group including past actual values for a target item to be predicted and a related item group related to the target item; Based on the feature definition information in which the feature item group that is a candidate for the item to be used for prediction among the related item group is defined, Included in the feature definition information, a prediction unit that predicts a prediction value, an error calculation unit that calculates an error between each prediction value predicted by the prediction unit and an actual value for the target item, for each of the verification data A plurality of verification data corresponding to the feature combination for each of a plurality of feature items selected from a plurality of feature items and a plurality of feature combinations consisting of values for the plurality of feature items. A feature combination error calculating unit for calculating a representative value of a plurality of the errors related to each other, and a specific feature combination specified from the plurality of feature combinations based on the representative value of the error corresponding to each feature combination An information processing program that functions as a new feature item adding unit that adds a new feature item to the feature definition information.

請求項１又は７に係る発明によれば、機械学習を用いて予測対象値の予測を行う際に用いられる素性に、過去データが有する複数の関連項目の組み合わせに基づく新素性項目を自動追加することができる。 According to the invention which concerns on Claim 1 or 7, the new feature item based on the combination of the some related item which past data has is automatically added to the feature used when performing prediction of a prediction target value using machine learning. be able to.

請求項２に係る発明によれば、新素性項目を含む素性を用いたときの予測精度が向上することを確認した上で、当該新素性項目を素性に追加することができる。 According to the second aspect of the present invention, it is possible to add the new feature item to the feature after confirming that the prediction accuracy is improved when the feature including the new feature item is used.

請求項３又は５に係る発明によれば、全ての素性項目間の組合せを定義する場合に比して、定義される素性組合せの数を低減させることができる。 According to the invention which concerns on Claim 3 or 5, compared with the case where the combination between all the feature items is defined, the number of defined feature combinations can be reduced.

請求項４に係る発明によれば、素性組合せに注目過去データ以外の過去データに関する値を含めない場合に比して、より予測精度を向上させ得る新素性項目を追加することができる。 According to the invention which concerns on Claim 4, compared with the case where the value regarding past data other than attention past data is not included in a feature combination, the new feature item which can improve a prediction precision more can be added.

請求項６に係る発明によれば、寄与度を考慮しない場合に比して、より予測精度を向上させ得る新素性項目を追加することができる。 According to the invention which concerns on Claim 6, compared with the case where a contribution is not considered, the new feature item which can improve a prediction precision more can be added.

本実施形態に係る情報処理装置の構成概略図である。1 is a schematic configuration diagram of an information processing apparatus according to an embodiment. 過去データＤＢの内容例を示す図である。It is a figure which shows the example of the content of past data DB. 初期素性定義情報の内容例を示す図である。It is a figure which shows the example of the content of initial feature definition information. 素性の内容例を示す図である。It is a figure which shows the example of the content of a feature. 各検証データに対する実績値と予測値の例を示す図である。It is a figure which shows the example of the track record value and prediction value with respect to each verification data. 素性組合せの例を示す図である。It is a figure which shows the example of a feature combination. 更新素性定義情報の内容例を示す図である。It is a figure which shows the example of the content of update feature definition information. 第１実施形態に係る情報処理装置の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of the information processing apparatus which concerns on 1st Embodiment. 新素性項目追加処理の繰り返し処理に対する誤差の平均値を示すグラフである。It is a graph which shows the average value of the error with respect to the repetition process of a new feature item addition process. 各素性項目の階層関係の例を示す図である。It is a figure which shows the example of the hierarchical relationship of each feature item. 各素性項目に対する寄与度の例を示す図である。It is a figure which shows the example of the contribution with respect to each feature item.

以下、本発明の実施形態について説明する。 Hereinafter, embodiments of the present invention will be described.

＜第１実施形態＞
図１には、本実施形態に係る情報処理装置１０の構成概略図が示されている。情報処理装置１０としては、一般のコンピュータ、例えばサーバやパーソナルコンピュータであってよい。図１に示すように、情報処理装置１０は、記憶部１２及び制御部１４を含んで構成される。また、図１には示されていないが、情報処理装置１０は、例えばネットワークアダプタなどから構成され、インターネットなどの通信回線を介して他の装置と通信を行うための通信部、例えば液晶パネルなどから構成され、情報処理装置１０の処理内容（例えば後述の予測値など）を表示するための表示部、例えばマウス、キーボード、あるいはタッチパネルなどから構成され、利用者（ユーザ）からの指示を入力するための入力部などを含んでいてもよい。 <First Embodiment>
FIG. 1 is a schematic configuration diagram of an information processing apparatus 10 according to the present embodiment. The information processing apparatus 10 may be a general computer such as a server or a personal computer. As illustrated in FIG. 1, the information processing apparatus 10 includes a storage unit 12 and a control unit 14. Although not shown in FIG. 1, the information processing apparatus 10 includes, for example, a network adapter and the like, and a communication unit for communicating with other apparatuses via a communication line such as the Internet, such as a liquid crystal panel And a display unit for displaying processing contents of the information processing apparatus 10 (for example, a predicted value to be described later), such as a mouse, a keyboard, or a touch panel, and inputs an instruction from a user (user). May include an input unit or the like.

情報処理装置１０は、過去の実績データに基づいて、未来を予測する処理を行う装置である。本明細書においては、情報処理装置１０がある店舗（予測対象店舗）の予測対象日における売上を予測する処理を例にして説明を行うが、情報処理装置１０が予測するものはこれに限られない。 The information processing apparatus 10 is an apparatus that performs processing for predicting the future based on past performance data. In the present specification, the process of predicting sales on the prediction target date of a store (prediction target store) where the information processing apparatus 10 is present will be described as an example. Absent.

記憶部１２は、例えばハードディスク、ＲＯＭ（Read Only Memory）あるいはＲＡＭ（Random Access Memory）などから構成される。記憶部１２には、情報処理装置１０の各部を動作させるための情報処理プログラムが記憶される。あるいは、記憶部１２には、各種制御データあるいは各種処理データなどが記憶される。さらに、図１に示すように、記憶部１２には過去データＤＢ２０が定義され、また、素性定義情報２２が記憶される。 The storage unit 12 includes, for example, a hard disk, a ROM (Read Only Memory), a RAM (Random Access Memory), or the like. The storage unit 12 stores an information processing program for operating each unit of the information processing apparatus 10. Alternatively, the storage unit 12 stores various control data or various processing data. Further, as shown in FIG. 1, the past data DB 20 is defined in the storage unit 12, and the feature definition information 22 is stored.

過去データＤＢ２０には、予測対象である対象項目に対する過去の実績値、及び、対象項目に関連する関連項目群に対する過去の実績値が蓄積されている。本実施形態では、対象項目は予測対象店舗の売上であり、過去データＤＢ２０には、予測対象店舗の過去の売上の実績値が格納される。また、本実施形態では、関連項目群は売上に関連する各種項目（詳細後述）であり、過去データＤＢ２０には、関連項目群に対する各実績値が格納される。過去データＤＢ２０には、これらのデータがユーザによって格納されてもよいし、自動的に収集されて格納されるようになっていてもよい。 The past data DB 20 stores past performance values for target items that are prediction targets and past performance values for related item groups related to the target items. In the present embodiment, the target item is the sales of the prediction target store, and the past data DB 20 stores the past sales record of the prediction target store. In this embodiment, the related item group is various items related to sales (details will be described later), and the past data DB 20 stores each actual value for the related item group. These data may be stored by the user in the past data DB 20 or may be automatically collected and stored.

図２に、過去データＤＢ２０の内容例が示されている。図２においては、過去データＤＢ２０がテーブル形式で示されているが、過去データＤＢ２０のデータ形式としてはこれに限られない。本実施形態では、日毎に過去データＤＢ２０にデータ（レコード）が蓄積されるようになっている。図２に示されるように、過去データＤＢ２０には、予測対象店舗の売上の実績値が格納されており、また、売上に関連する関連項目として、年、月、日、天気、最高気温、最低気温などに対する実績値が格納されている。もちろん、関連項目としては、売上に対する影響の大小に関わらず、様々な項目を有していてもよい。例えば、曜日、休日か平日か、湿度、風速、店前交通量、平均株価、為替レートなど、種々の項目を有し得る。 FIG. 2 shows an example of the contents of the past data DB 20. In FIG. 2, the past data DB 20 is shown in a table format, but the data format of the past data DB 20 is not limited to this. In the present embodiment, data (records) are accumulated in the past data DB 20 every day. As shown in FIG. 2, the past data DB 20 stores the actual sales value of the prediction target store, and the related items related to the sales are year, month, day, weather, maximum temperature, minimum Stores actual values for temperature and other factors. Of course, the related items may have various items regardless of the influence on sales. For example, it may have various items such as day of the week, holiday or weekday, humidity, wind speed, store front traffic, average stock price, exchange rate and the like.

本実施形態においては、過去データＤＢ２０における１つのレコードが、ある１日の売上の実績値と、関連項目群に対するその日の実績値が関連付けられたデータとなっている。本明細書では、過去データＤＢ２０における１つのレコードを「過去データ」と記載する。つまり、過去データＤＢ２０には、過去データが逐次蓄積されることで、過去データ群が格納されることになる。本実施形態では、過去データＤＢ２０には、予測対象店舗に関する２０１４年から２０１６年までの３年間分の過去データが蓄積されているものとする。 In the present embodiment, one record in the past data DB 20 is data in which the actual value of sales on a certain day is associated with the actual value of that day for the related item group. In this specification, one record in the past data DB 20 is described as “past data”. That is, the past data group is stored in the past data DB 20 by sequentially accumulating past data. In the present embodiment, it is assumed that past data for three years from 2014 to 2016 related to the prediction target store is accumulated in the past data DB 20.

なお、本実施形態における過去データＤＢ２０においては、各過去データに対して、当該過去データを一意に識別可能な過去データＩＤが付されている。 In the past data DB 20 in the present embodiment, a past data ID that can uniquely identify the past data is attached to each past data.

詳細は後述するが、情報処理装置１０においては、過去データＤＢ２０に格納された過去データ群に基づいて、予測対象店舗の予測対象日における予測売上値を予測する。具体的には、各過去データの各関連項目に対する実績値と売上の実績値との関係を学習することで、各関連項目の値から売上値を予測するための予測モデルが構築され、当該予測モデルと、予測対象日の各関連項目に対する値とに基づいて、予測売上値が予測される。 Although details will be described later, the information processing apparatus 10 predicts the predicted sales value of the prediction target store on the prediction target date based on the past data group stored in the past data DB 20. Specifically, by learning the relationship between the actual value of each related item in each past data and the actual value of sales, a prediction model for predicting the sales value from the value of each related item is constructed. Based on the model and the value for each related item on the prediction target date, the predicted sales value is predicted.

素性定義情報２２は、予測モデルの構築に用いる項目の候補である複数の素性項目からなる素性項目群を定義する情報である。素性項目は、過去データに含まれる関連項目群に基づく項目である。上述のように、過去データに含まれる関連項目群には、対象項目である売上に対して関連性の高い項目もあれば、関連性の低い項目も含まれ得る。一般に、対象項目と関連性の低い関連項目の値までをも考慮して予測モデルを構築した場合、当該予測モデルの予測精度はあまり良くならない。逆に、適切な関連項目の値に基づいて予測モデルを構築すれば、当該予測モデルの予測精度が向上され得る。つまり、素性定義情報２２において適切な素性項目群が定義されることによって、関連項目群の中から適切な項目を用いて予測モデルを構築することができるから、予測モデルの予測精度を向上させることができる。このように、素性定義情報２２において定義される素性項目群は、予測モデルの予測精度、ひいては情報処理装置１０における予測処理の予測精度に大きく関わる要素となる。なお、場合によっては、過去データに含まれる関連項目の全てが素性項目として定義されてもよい。 The feature definition information 22 is information that defines a feature item group including a plurality of feature items that are candidates for items used to construct a prediction model. The feature item is an item based on a related item group included in past data. As described above, the related item group included in the past data may include items that are highly related to sales that are the target items and items that are lowly related. In general, when a prediction model is constructed in consideration of values of related items that are less relevant to the target item, the prediction accuracy of the prediction model is not very good. Conversely, if a prediction model is constructed based on appropriate values of related items, the prediction accuracy of the prediction model can be improved. That is, by defining an appropriate feature item group in the feature definition information 22, it is possible to construct a prediction model using an appropriate item from the related item group, thereby improving the prediction accuracy of the prediction model. Can do. As described above, the feature item group defined in the feature definition information 22 is a factor greatly related to the prediction accuracy of the prediction model, and thus the prediction accuracy of the prediction processing in the information processing apparatus 10. In some cases, all the related items included in the past data may be defined as feature items.

詳しくは後述するが、本実施形態では、情報処理装置１０の処理によって、素性定義情報２２に新たな素性項目（新素性項目）が自動的に追加される。本明細書では、素性定義情報２２のうち、情報処理装置１０によって新素性項目が全く追加されていないものを「初期素性定義情報２２ａ」と、情報処理装置１０の処理によって新素性項目の追加処理後のものを「更新素性定義情報２２ｂ」と区別して記載する。 As will be described in detail later, in the present embodiment, a new feature item (new feature item) is automatically added to the feature definition information 22 by the processing of the information processing apparatus 10. In the present specification, among the feature definition information 22, information whose new feature item has not been added by the information processing device 10 is “initial feature definition information 22 a”, and new feature item addition processing is performed by the processing of the information processing device 10. The latter is described separately from the “update feature definition information 22b”.

図３に、本実施形態における初期素性定義情報２２ａの内容例が示されている。初期素性定義情報２２ａは、ユーザによって生成される。図３に示されるように、初期素性定義情報２２ａには、素性項目として、年、月、日、週、曜日、休日か平日かが含まれている。これらの各素性項目は、過去データ群が有する関連項目群から選択された項目である。また、初期素性定義情報２２ａには、各素性項目に対して取り得る値が定義されている。上記の素性項目については、元々過去データにおいて取り得る値がそれほど多くないため、過去データにおいて取り得る値が、そのまま素性項目が取り得る値として定義されている。 FIG. 3 shows an example of the content of the initial feature definition information 22a in the present embodiment. The initial feature definition information 22a is generated by the user. As shown in FIG. 3, the initial feature definition information 22a includes year, month, day, week, day of week, holiday or weekday as feature items. Each of these feature items is an item selected from the related item group included in the past data group. In the initial feature definition information 22a, possible values for each feature item are defined. As for the above-described feature items, since there are not so many values that can be taken in the past data, values that can be taken in the past data are defined as values that the feature item can take as it is.

さらに、初期素性定義情報２２ａには、最高気温は１５度以上か、最低気温は１０度未満か、あるいは雨が降ったか否か、といった素性項目も含まれる。過去データ群においては、最高気温及び最低気温に対する実績値は連続的な（すなわち様々な）値を取り得るし、天気に対する実績値も晴、雨、曇り、晴のち雨などといった様々な値を取り得る。予測モデルの構築の際に、このような様々な値を取り得る生データを直接用いるのは適切ではない。すなわち、処理量が膨大になり得る反面、予測精度向上の効果があまり期待できない。そのため、初期素性定義情報２２ａにおいては、例えば、過去データが有する最高気温という関連項目に基づいて、最高気温が１５度以上かという素性項目が定義され、当該素性項目が取り得る値としては１（１５度以上）及び０（１５度未満）という２つの値が定義されている。最高気温の閾値（この例では１５度）はユーザによって決定されてよい。このように、様々な値を取り得る関連項目については、素性項目を適宜工夫することで、各素性項目が取り得る値の数を低減することができる。これにより、予測モデル構築の処理が簡略化される。 Further, the initial feature definition information 22a includes feature items such as whether the maximum temperature is 15 degrees or higher, the minimum temperature is less than 10 degrees, or whether it rains. In the historical data group, the actual values for the maximum temperature and the minimum temperature can be continuous (that is, various) values, and the actual values for the weather can be various values such as clear, rainy, cloudy, fine and rainy. obtain. When building a prediction model, it is not appropriate to directly use such raw data that can take various values. That is, while the processing amount can be enormous, the effect of improving the prediction accuracy cannot be expected so much. Therefore, in the initial feature definition information 22a, for example, a feature item indicating whether the maximum temperature is 15 degrees or more is defined based on a related item called the maximum temperature that the past data has, and the value that the feature item can take is 1 ( Two values of 15 degrees or more) and 0 (less than 15 degrees) are defined. The maximum temperature threshold (15 degrees in this example) may be determined by the user. As described above, regarding related items that can take various values, the number of values that each feature item can take can be reduced by appropriately devising the feature item. Thereby, the process of construction of a prediction model is simplified.

なお、素性定義情報２２において定義された各素性項目には、素性項目を一意に識別するための素性項目ＩＤが付されている。 Note that a feature item ID for uniquely identifying a feature item is attached to each feature item defined in the feature definition information 22.

制御部１４は、例えばＣＰＵ（Central Processing Unit）あるいはマイクロコントローラなどから構成される。制御部１４は、記憶部１２に記憶された情報処理プログラムに基づいて、情報処理装置１０の各部を制御するものである。また、図１に示されるように、制御部１４は、過去データ分類部２４、素性生成部２６、予測部２８、誤差算出部３０、素性組合せ定義部３２、素性組合せ誤差算出部３４、及び新素性項目追加部３６としても機能する。制御部１４がこれらの機能を発揮することにより、素性定義情報２２に、予測モデルの予測精度を向上させ得る新素性項目が追加される。以下、これらの各機能の詳細について説明する。 The control unit 14 includes, for example, a CPU (Central Processing Unit) or a microcontroller. The control unit 14 controls each unit of the information processing apparatus 10 based on the information processing program stored in the storage unit 12. Further, as shown in FIG. 1, the control unit 14 includes a past data classification unit 24, a feature generation unit 26, a prediction unit 28, an error calculation unit 30, a feature combination definition unit 32, a feature combination error calculation unit 34, and a new It also functions as the feature item adding unit 36. When the control unit 14 performs these functions, a new feature item that can improve the prediction accuracy of the prediction model is added to the feature definition information 22. Details of each of these functions will be described below.

過去データ分類部２４は、過去データＤＢ２０に格納されている過去データ群を学習データ群と検証データ群とに分類する。学習データ群は、予測モデルの構築に用いるものであり、過去データ群の一部である複数の過去データからなるものである。一方、検証データ群は、構築された予測モデルの予測精度を検証するために用いるものであり、過去データ群の一部である複数の過去データからなるものである。上述の通り、本実施形態では、過去データＤＢ２０には、２０１４年から２０１６年までの３年分の過去データ群が格納されているため、２０１４年と２０１５年の２年分の複数の過去データを学習データ群とし、２０１６年分の複数の過去データを検証データ群とする。もちろん、学習データ群と検証データ群の分類方法は、これには限られない。 The past data classification unit 24 classifies the past data group stored in the past data DB 20 into a learning data group and a verification data group. The learning data group is used to construct a prediction model, and includes a plurality of past data that is a part of the past data group. On the other hand, the verification data group is used to verify the prediction accuracy of the constructed prediction model, and is composed of a plurality of past data that is a part of the past data group. As described above, in the present embodiment, since the past data DB 20 stores past data groups for three years from 2014 to 2016, a plurality of past data for two years of 2014 and 2015 is stored. Is a learning data group, and a plurality of past data for 2016 is a verification data group. Of course, the classification method of the learning data group and the verification data group is not limited to this.

素性生成部２６は、過去データＤＢ２０に格納された過去データと、素性定義情報２２に基づいて、複数の素性項目と、各素性項目に対する複数の値からなる素性を生成する。素性生成部２６は、過去データ群に含まれる各過去データに対応する複数の素性を生成する。これにより素性群が生成される。なお、素性に含まれる素性項目は、素性定義情報２２に含まれる素性項目の全てである必要はなく、素性定義情報２２において定義された複数の素性項目の一部の素性項目であってもよい。 The feature generation unit 26 generates a feature composed of a plurality of feature items and a plurality of values for each feature item based on the past data stored in the past data DB 20 and the feature definition information 22. The feature generation unit 26 generates a plurality of features corresponding to each past data included in the past data group. Thereby, a feature group is generated. Note that the feature items included in the feature need not be all of the feature items included in the feature definition information 22, and may be partial feature items of a plurality of feature items defined in the feature definition information 22. .

図４に、素性生成部２６により生成された素性群の例が示されている。図４には、図２に示された各過去データと、図３に示される素性定義情報２２とに基づいて生成された素性群の例が示されている。 FIG. 4 shows an example of a feature group generated by the feature generation unit 26. FIG. 4 shows an example of a feature group generated based on each past data shown in FIG. 2 and the feature definition information 22 shown in FIG.

例えば、図４に示された素性Ｉ１は、過去データＩＤ「２０１４０４０１」が示す過去データ（図２参照）に対応するものである。当該過去データが有する関連項目に対する各値に基づいて、各素性項目に対する値が決定されている。素性Ｉ１においては、素性項目「年」に対して値「２０１４」が決定され、以後同様に他の各素性項目に対する値が決定されている。なお、素性項目「曜日」に対して値「３」が決定されているが、これは、当該過去データの関連項目「曜日」に対して「火曜日」の値を有しているためである。また、当該過去データの関連項目「最高気温」に対する値が「１７．４度」であるから、素性項目「最高気温が１５度以上か」に対する値には、１５度以上を示す「１」が決定されている。 For example, the feature I1 shown in FIG. 4 corresponds to the past data (see FIG. 2) indicated by the past data ID “20140401”. Based on each value for the related item included in the past data, a value for each feature item is determined. In the feature I1, a value “2014” is determined for the feature item “year”, and values for other feature items are similarly determined thereafter. The value “3” is determined for the feature item “day of the week” because it has the value “Tuesday” for the related item “day of the week” of the past data. Since the value for the related item “maximum temperature” in the past data is “17.4 degrees”, “1” indicating 15 degrees or more is included in the value for the feature item “whether the maximum temperature is 15 degrees or more”. It has been decided.

このようにして、素性生成部２６は、学習データ群及び検証データ群を含む全ての過去データ群について素性を生成する。なお、各素性には、素性と過去データとの関連を示すように過去データＩＤが付されている。生成された素性は、予測部２８及び素性組合せ誤差算出部３４に渡される。 In this manner, the feature generation unit 26 generates features for all past data groups including the learning data group and the verification data group. Each feature is assigned a past data ID so as to indicate a relationship between the feature and past data. The generated feature is passed to the prediction unit 28 and the feature combination error calculation unit 34.

予測部２８は、まず、素性生成部２６が学習データ群に基づいて生成した各素性と、各学習データに含まれる対象項目の実績値（本実施形態では売上実績値）とに基づいて、予測モデルを構築する。 The prediction unit 28 first predicts based on each feature generated by the feature generation unit 26 based on the learning data group and the actual value (sales actual value in this embodiment) of the target item included in each learning data. Build a model.

予測モデルの構築には、種々の方法を用いることができる。例えば、各素性と実績値に基づいて生成される複数の弱識別器を組み合わせて予測モデルを構築するアンサンブル学習法であって、１つの弱識別器の学習結果を参考にして次の弱識別器を学習しつつ、予測値と実績値の誤差を定義した損失関数の勾配を考慮して予測モデルを構築する勾配ブースティング法を用いることができる。あるいは、学習データ群からサンプリングされた学習データに基づいて、非終端ノードにおいて識別（分類）に用いる素性項目をランダムに選択することで、相関の低い複数の決定木を作成し、当該複数の決定木を用いて予測モデルを構築するランダムフォレスト法を用いることができる。 Various methods can be used to construct the prediction model. For example, an ensemble learning method for constructing a prediction model by combining a plurality of weak classifiers generated based on each feature and actual value, and the next weak classifier is referred to with reference to the learning result of one weak classifier The gradient boosting method of constructing a prediction model in consideration of the gradient of the loss function that defines the error between the predicted value and the actual value can be used. Alternatively, based on learning data sampled from the learning data group, by randomly selecting feature items used for identification (classification) in a non-terminal node, a plurality of decision trees with low correlation are created, and the plurality of decision trees It is possible to use a random forest method for constructing a prediction model using

次に、予測部２８は、構築した予測モデルと、素性生成部２６が検証データ群に基づいて生成した各素性とに基づいて、各検証データの対象項目の予測値（本実施形態では売上予測値）を予測する。ここで、各検証データは、売上実績値を既に有しているのであるが、予測部２８は、素性定義情報２２に新素性項目を追加するために、予測部２８は、既知である各検証データの売上予測値の予測を行う。 Next, based on the constructed prediction model and each feature generated by the feature generation unit 26 based on the verification data group, the prediction unit 28 predicts the target value of each verification data (sales prediction in this embodiment). Value). Here, each verification data already has a sales record value, but since the prediction unit 28 adds a new feature item to the feature definition information 22, the prediction unit 28 uses each of the known verifications. Predict the sales forecast value of the data.

予測部２８により予測処理が行われると、各検証データに対応する複数の売上予測値が得られる。図５に、各検証データが有する売上実績値に対応する売上予測値の例が示されている。予測部２８により予測された当該複数の売上予測値は、対応する過去データを示す過去ＩＤと、当該過去データの売上実績値と関連付けられて誤差算出部３０に渡される。 When the prediction process is performed by the prediction unit 28, a plurality of sales prediction values corresponding to each verification data are obtained. FIG. 5 shows an example of the sales forecast value corresponding to the sales performance value possessed by each verification data. The plurality of sales prediction values predicted by the prediction unit 28 are passed to the error calculation unit 30 in association with the past ID indicating the corresponding past data and the sales record value of the past data.

誤差算出部３０は、検証データが有する対象項目の実績値（本実施形態では売上実績値）と、当該検証データについて予測部２８が予測した対象項目の予測値（本実施形態では売上予測値）との誤差を算出する。誤差算出部３０は、各検証データそれぞれについて誤差を算出する。これにより、各検証データに対応する複数の誤差が算出される。例えば、図５の例では、誤差算出部は、過去データＩＤ「２０１６０４０１」が示す検証データに対して誤差「２」を算出し、過去データＩＤ「２０１６０４０２」が示す検証データに対して誤差「４」を算出し、以下同様に各検証データについての誤差を算出する。 The error calculation unit 30 includes the actual value of the target item included in the verification data (sales actual value in the present embodiment) and the predicted value of the target item predicted by the prediction unit 28 for the verification data (the predicted sales value in the present embodiment). And the error is calculated. The error calculation unit 30 calculates an error for each verification data. Thereby, a plurality of errors corresponding to each verification data are calculated. For example, in the example of FIG. 5, the error calculation unit calculates the error “2” for the verification data indicated by the past data ID “20160401” and the error “4” for the verification data indicated by the past data ID “20160402”. In the same manner, an error for each verification data is calculated.

誤差算出部３０は、各検証データについての複数の誤差を算出すると、全検証データについての誤差の平均値（以下「全平均誤差」と記載する）を算出する。言うまでもないが、全平均誤差は、各誤差の合計値を検証データ数で割ることで算出される。後述のように、全平均誤差は、新素性項目を追加する処理を終えるときなどの判断材料となる。 When calculating a plurality of errors for each verification data, the error calculation unit 30 calculates an average value of errors for all the verification data (hereinafter referred to as “total average error”). Needless to say, the total average error is calculated by dividing the total value of each error by the number of verification data. As will be described later, the total average error becomes a judgment material when finishing the process of adding a new feature item.

誤差算出部３０により算出された各検証データについての複数の誤差は、各検証データの過去データＩＤと関連付けられて素性組合せ誤差算出部３４に渡される。 A plurality of errors for each piece of verification data calculated by the error calculation unit 30 are associated with the past data ID of each verification data and passed to the feature combination error calculation unit 34.

素性組合せ定義部３２は、素性定義情報２２に基づいて、複数の素性組合せを定義する。素性組合せとは、素性定義情報２２において定義されている複数の素性項目と、当該複数の素性項目に対する複数の値からなるものである。図６に、素性組合せの具体例が示されている。図６に示された素性組合せは、図３に示された素性定義情報２２に基づいて定義されたものである。 The feature combination definition unit 32 defines a plurality of feature combinations based on the feature definition information 22. The feature combination includes a plurality of feature items defined in the feature definition information 22 and a plurality of values for the plurality of feature items. FIG. 6 shows a specific example of the feature combination. The feature combinations shown in FIG. 6 are defined based on the feature definition information 22 shown in FIG.

例えば、図６に示された素性組合せＰ１は、素性項目「月」に対する値「１２」と、素性項目「週」に対する値「３」との組み合わせとなっている。図６に示された素性組合せは一例であり、本実施形態においては、素性組合せ定義部３２は、素性定義情報２２が有する素性項目と取り得る値との間で実現可能な組み合わせの全てを定義する。ただし、同じ素性項目同士の組み合わせは定義しないものとする。例えば、素性項目「月」に対する値「１」と、同じ素性項目「月」に対する値「２」との組み合わせは定義しないものとする。なお、図６に示されるように、各素性組合せに対しては、素性組合せを一意に識別する組合せＩＤが付されている。素性組合せ定義部３２により定義された複数の素性組合せは素性組合せ誤差算出部３４に渡される。 For example, the feature combination P1 shown in FIG. 6 is a combination of a value “12” for the feature item “month” and a value “3” for the feature item “week”. The feature combinations shown in FIG. 6 are an example. In the present embodiment, the feature combination definition unit 32 defines all possible combinations between the feature items included in the feature definition information 22 and the possible values. To do. However, the combination of the same feature items shall not be defined. For example, a combination of a value “1” for the feature item “month” and a value “2” for the same feature item “month” is not defined. As shown in FIG. 6, each feature combination is assigned a combination ID that uniquely identifies the feature combination. A plurality of feature combinations defined by the feature combination definition unit 32 are passed to a feature combination error calculation unit 34.

素性組合せ誤差算出部３４は、素性組合せ定義部３２が定義した複数の素性組合せそれぞれについて、検証データ群のうち、当該素性組合せに該当する複数の検証データに関する複数の誤差の代表値を算出する。 The feature combination error calculation unit 34 calculates, for each of a plurality of feature combinations defined by the feature combination definition unit 32, a representative value of a plurality of errors related to a plurality of verification data corresponding to the feature combination in the verification data group.

素性組合せに該当する検証データとは、当該素性組合せに含まれる素性項目に対する値を有する検証データである。例えば、素性組合せが、素性項目「月」に対する値「１２」と、素性項目「週」に対する値「３」との組み合わせである場合、当該素性組合せに該当する検証データとは、関連項目「月」に対する値が「１２月」であり、且つ、関連項目「週」に対する値が「３」の検証データである。すなわち、本実施形態では、２０１６年１２月の第３週に対応する７つの検証データが当該素性組合せに該当する複数の検証データということになる。このようにして、素性組合せ毎に、該当する複数の検証データが特定される。なお、素性組合せに該当する検証データの特定は、素性生成部２６が検証データ群に基づいて生成した各素性に基づいて行う。 The verification data corresponding to the feature combination is verification data having values for the feature items included in the feature combination. For example, when the feature combination is a combination of a value “12” for the feature item “Month” and a value “3” for the feature item “Week”, the verification data corresponding to the feature combination is related item “Month” "Is" December ", and the value for the related item" week "is" 3 ". That is, in this embodiment, seven verification data corresponding to the third week of December 2016 are a plurality of verification data corresponding to the feature combination. In this way, a plurality of corresponding verification data is specified for each feature combination. The verification data corresponding to the feature combination is specified based on each feature generated by the feature generation unit 26 based on the verification data group.

素性組合せ誤差算出部３４は、各素性組合せに該当する複数の検証データを特定すると、素性組合せ毎に、該当する複数の検証データに関する誤差の代表値を演算する。本実施形態では、代表値として平均値が算出されるが、代表値としては例えば中央値などであってもよい。具体的には、素性組合せ誤差算出部３４は、ある素性組合せを選択し、素性生成部２６から渡された各素性に基づいて、当該素性組合せに該当する複数の検証データの複数の過去データＩＤを特定する。次いで、誤差算出部３０から渡された複数の誤差のうち、特定した複数の過去データＩＤに関連付けられた複数の誤差を抽出する。そして、抽出した複数の誤差の平均値を演算する。個の平均値が、当該素性組合せの誤差の平均値（以下「平均誤差」と記載する）となる。このようにして、素性組合せ誤差算出部３４は、素性組合せ定義部３２が定義した全ての素性組合せの平均誤差を算出する。 When the feature combination error calculation unit 34 specifies a plurality of verification data corresponding to each feature combination, the feature combination error calculation unit 34 calculates a representative value of errors related to the plurality of corresponding verification data for each feature combination. In this embodiment, an average value is calculated as a representative value, but the representative value may be a median value, for example. Specifically, the feature combination error calculation unit 34 selects a certain feature combination, and based on each feature passed from the feature generation unit 26, a plurality of past data IDs of a plurality of verification data corresponding to the feature combination. Is identified. Next, a plurality of errors associated with the specified plurality of past data IDs are extracted from the plurality of errors passed from the error calculation unit 30. Then, an average value of the plurality of extracted errors is calculated. The average value is the average error of the feature combinations (hereinafter referred to as “average error”). In this manner, the feature combination error calculation unit 34 calculates the average error of all the feature combinations defined by the feature combination definition unit 32.

新素性項目追加部３６は、素性組合せ誤差算出部３４により算出された、各素性組合せの誤差の代表値（本実施形態では平均誤差）に基づいて、素性組合せ定義部３２が定義した複数の素性組合せの中から特定素性組合せを特定する。特定素性組合せの特定方法は後述するが、基本的には平均誤差が大きい素性組合せが特定素性組合せとされる。 The new feature item adding unit 36 includes a plurality of features defined by the feature combination defining unit 32 based on the representative value of errors of each feature combination (average error in this embodiment) calculated by the feature combination error calculating unit 34. A specific feature combination is identified from the combinations. Although a specific feature combination specifying method will be described later, a feature combination having a large average error is basically a specific feature combination.

本実施形態においては、新素性項目追加部３６は、各素性組合せの平均誤差が降順となるように複数の素性組合せに対して順位付けを行い、そのうち１位の素性組合せを特定素性組合せとする。すなわち、複数の素性組合せのうち、最も平均誤差が大きかった１つの素性組合せを特定素性組合せとする。なお、新素性項目追加部３６は、上記以外の特定方法により特定素性組合せを特定するようにしてもよい。例えば、各素性組合せの平均誤差が降順となるように複数の素性組合せに対して順位付けを行った上で、当該順位付けにおいて上位にいる複数の素性組合せ（例えば１〜３位など）を特定素性組合せとしてもよい。あるいは、予め誤差閾値を設けておき、当該誤差閾値以上の平均誤差が算出された全ての素性組合せを特定素性組合せとして特定するようにしてもよい。このような特定方法を採用した場合は、複数の素性組合せが特定素性組合せとして特定され得る。 In the present embodiment, the new feature item adding unit 36 ranks a plurality of feature combinations so that the average error of each feature combination is in descending order, and the first feature combination is set as a specific feature combination. . That is, one feature combination having the largest average error among a plurality of feature combinations is set as a specific feature combination. The new feature item adding unit 36 may specify the specific feature combination by a specifying method other than the above. For example, after ranking a plurality of feature combinations so that the average error of each feature combination is in descending order, identify a plurality of feature combinations that are higher in the ranking (for example, 1st to 3rd) It may be a feature combination. Alternatively, an error threshold may be provided in advance, and all feature combinations for which an average error equal to or greater than the error threshold is calculated may be specified as the specific feature combination. When such a specifying method is adopted, a plurality of feature combinations can be specified as specific feature combinations.

次いで、新素性項目追加部３６は、特定素性組合せに基づいて新素性項目を生成する。本実施形態では、特定素性組合せに該当するか否かという新素性項目が生成される。例えば、素性項目「月」に対する値「１２」と、素性項目「週」に対する値「４」との組み合わせからなる特定素性組合せが特定された場合、新素性項目追加部３６は、「１２月の第４週か」という新素性項目を生成する。そして、当該新素性項目に対する取り得る値として、「１（はい）」及び「０（いいえ）」を生成する。 Next, the new feature item adding unit 36 generates a new feature item based on the specific feature combination. In the present embodiment, a new feature item is generated indicating whether or not a specific feature combination is applicable. For example, when a specific feature combination consisting of a combination of a value “12” for the feature item “month” and a value “4” for the feature item “week” is specified, the new feature item adding unit 36 reads “December A new feature item “is the fourth week?” Is generated. Then, “1 (Yes)” and “0 (No)” are generated as possible values for the new feature item.

そして、新素性項目追加部３６は、生成した新素性項目を素性定義情報２２に追加する。これにより、更新素性定義情報２２ｂが生成される。なお、複数の特定素性組合せが特定された場合は、新素性項目追加部３６は、複数の特定素性組合せにそれぞれ対応する複数の新素性項目及びそれらに対する複数の取り得る値を生成し、素性定義情報２２に追加する。 Then, the new feature item adding unit 36 adds the generated new feature item to the feature definition information 22. Thereby, the update feature definition information 22b is generated. In addition, when a plurality of specific feature combinations are specified, the new feature item adding unit 36 generates a plurality of new feature items respectively corresponding to the plurality of specific feature combinations and a plurality of possible values for them, and the feature definition It adds to the information 22.

図７に、新素性項目が追加された更新素性定義情報２２ｂの例が示されている。図７に示された更新素性定義情報２２ｂは、図３に示す初期素性定義情報２２ａに対して、上述の例の１つの新素性項目が追加されたものである。 FIG. 7 shows an example of the updated feature definition information 22b to which a new feature item is added. The updated feature definition information 22b shown in FIG. 7 is obtained by adding one new feature item in the above example to the initial feature definition information 22a shown in FIG.

このように、新素性項目追加部３６により、初期素性定義情報２２ａにおいて定義されていた複数の素性項目とそれらに対する値の組合せである素性組合せから定義される新素性項目が素性定義情報２２に追加される。更新素性定義情報２２ｂに基づいて、予測部２８により新予測モデルが構築された場合、当該新予測モデルは、初期素性定義情報２２ａに基づいて構築された旧予測モデルに対して予測精度の向上が期待されるものとなる。 As described above, the new feature item adding unit 36 adds a new feature item defined by the feature combination that is a combination of a plurality of feature items defined in the initial feature definition information 22 a and values thereof to the feature definition information 22. Is done. When a new prediction model is constructed by the prediction unit 28 based on the updated feature definition information 22b, the new prediction model is improved in prediction accuracy over the old prediction model constructed based on the initial feature definition information 22a. Expected.

特に、多数定義され得る素性組合せの中から、平均誤差が大きい特定素性組合せに基づく新素性項目が追加されるから、新予測モデルの予測精度が旧予測モデルの予測精度を上回ることがより期待できる。例えば、初期素性定義情報２２ａに素性項目「月」及び「週」が含まれている場合、初期素性定義情報２２ａに基づく予測モデルは、月別の売上の変動と、週別の売上の変動とが別個に考慮されて構築され、特定の月と特定の週の組合せまで考慮されて構築されない。この場合、例えば、毎年恒例の特定のイベントによる特定の月の特定の週における突発的な売上の変動が、当該予測モデルにおいて好適に反映されないことになる。本実施形態によれば、例えば、素性項目「月」に対する値「１２」及び素性項目「週」に対する値「４」の素性組合せに基づく新素性項目が追加されることで、予測部２８は、１２月の第４週にのみ発生する突発的な売り上げの変動を考慮した予測モデルを構築することができる。 In particular, a new feature item based on a specific feature combination with a large average error is added from among a large number of feature combinations that can be defined, so the prediction accuracy of the new prediction model can be expected to exceed the prediction accuracy of the old prediction model. . For example, when the initial feature definition information 22a includes feature items “month” and “week”, the prediction model based on the initial feature definition information 22a has a monthly sales fluctuation and a weekly sales fluctuation. It is built by considering it separately, and does not build it up to a specific month and week combination. In this case, for example, sudden sales fluctuations in a specific week in a specific month due to a specific event every year are not suitably reflected in the prediction model. According to the present embodiment, for example, by adding a new feature item based on the feature combination of the value “12” for the feature item “month” and the value “4” for the feature item “week”, the prediction unit 28 It is possible to build a prediction model that takes into account sudden sales fluctuations that occur only in the fourth week of December.

また、本実施形態における素性組合せの特定方法によれば、１回の新素性項目追加処理において１つの特定素性組合せが特定され、それにより１つの新素性項目が追加される。これにより、１回の新素性項目追加処理における素性定義情報２２の変化が最小限に抑えられ、すなわち１回の新素性項目追加処理における予測モデルの変動量が最小限に抑えられる。これは、新素性項目追加処理を繰り返し行って予測モデルを繰り返し変動させていくことを前提とすると、１回の新素性項目の追加処理によって予測モデルの予測精度が向上しない場合があることに鑑みると、予測モデルの変動量を最小に抑えて徐々に変化させていくことは、予測モデルの予測精度をかえってより早期に向上させることに繋がる。 Also, according to the feature combination specifying method in the present embodiment, one specific feature combination is specified in one new feature item adding process, and one new feature item is added accordingly. Thereby, the change of the feature definition information 22 in one new feature item addition process is minimized, that is, the amount of fluctuation of the prediction model in one new feature item addition process is minimized. This is based on the assumption that the prediction accuracy of the prediction model may not be improved by adding a new feature item once, assuming that the prediction model is repeatedly changed by repeatedly performing the new feature item addition processing. And gradually changing the prediction model while minimizing the amount of variation leads to improving the prediction accuracy of the prediction model earlier.

なお、新素性項目追加処理を繰り返し行う場合、２回目以降の処理においては、１回目の処理で追加された新素性項目を含む素性組わせを定義することが可能である。例えば、１回目の処理で図７に示すような更新素性定義情報２２ｂが生成された場合、２回目の処理において、素性組合せ定義部３２は、例えば、素性組合せ「１２月の第４週か」に対する値「１」と、素性項目「最高気温は１５度以上か」に対する値「１」との素性組合せを定義することが可能である。 When the new feature item addition process is repeatedly performed, in the second and subsequent processes, it is possible to define a feature combination including the new feature item added in the first process. For example, when the updated feature definition information 22b as shown in FIG. 7 is generated in the first process, in the second process, the feature combination definition unit 32 determines, for example, whether the feature combination is “fourth week in December?” It is possible to define a feature combination of a value “1” with respect to and a value “1” with respect to the feature item “Is the maximum temperature 15 degrees or higher”.

上述のように、更新素性定義情報２２ｂに基づいて構築された予測モデルの予測精度は、必ずしも、初期素性定義情報２２ａに基づいて構築された予測モデルの予測精度より向上するとは限らない。したがって、新素性項目追加部３６は、特定素性組合せに基づいて定義された新素性項目が初期素性定義情報２２ａに追加されることによって、予測モデルの予測精度が向上したことを確認した上で、当該新素性項目を初期素性定義情報２２ａに追加するようにしてもよい。具体的な処理の流れは以下の通りである。 As described above, the prediction accuracy of the prediction model constructed based on the updated feature definition information 22b is not necessarily improved from the prediction accuracy of the prediction model constructed based on the initial feature definition information 22a. Accordingly, the new feature item adding unit 36 confirms that the prediction accuracy of the prediction model has improved by adding the new feature item defined based on the specific feature combination to the initial feature definition information 22a. The new feature item may be added to the initial feature definition information 22a. The specific processing flow is as follows.

まず、新素性項目追加部３６は、上述と同様の処理によって、素性組合せ定義部３２が定義した複数の素性組合せの中から平均誤差が最大となる特定素性組合せを特定し、当該特定素性組合せに基づいて新素性項目を生成する。そして、新素性項目追加部３６は、生成した新素性項目を素性定義情報２２に仮追加する。これにより、新素性項目が仮追加された仮素性定義情報が生成される。 First, the new feature item adding unit 36 identifies a specific feature combination having the maximum average error from among a plurality of feature combinations defined by the feature combination defining unit 32 by the same processing as described above, and sets the specific feature combination to the specific feature combination. A new feature item is generated based on this. Then, the new feature item addition unit 36 provisionally adds the generated new feature item to the feature definition information 22. Thereby, provisional feature definition information to which a new feature item is provisionally added is generated.

その後、素性生成部２６以下の各部は、当該仮素性定義情報を用いて上述と同様の処理を行う。具体的には、素性生成部２６は、当該仮素性定義情報と過去データ群に基づいて仮素性を生成し、予測部２８は、学習データ群の仮素性と売上実績値に基づいて仮予測モデルを構築し、仮予測モデルと検証データ群の仮素性に基づいて各検証データに対する予測値を算出する。誤差算出部３０は、各検証データの売上実績値と、仮予測モデルを用いて予測された各売上予測値との複数の誤差を算出する。そして、誤差算出部３０は、全検証データに対応する複数の誤差の平均値（以下「仮全平均誤差」と記載する）を算出する。 Then, each part below the feature generation part 26 performs the process similar to the above-mentioned using the said temporary feature definition information. Specifically, the feature generation unit 26 generates a provisional feature based on the provisional feature definition information and the past data group, and the prediction unit 28 uses the provisional prediction model based on the provisional feature and the sales performance value of the learning data group. And a predicted value for each verification data is calculated based on the temporary features of the temporary prediction model and the verification data group. The error calculation unit 30 calculates a plurality of errors between the actual sales value of each verification data and each predicted sales value predicted using the provisional prediction model. Then, the error calculation unit 30 calculates an average value of a plurality of errors corresponding to all the verification data (hereinafter referred to as “temporary total average error”).

ここで、新素性項目追加部３６は、新素性項目を生成する処理において算出された、各検証データの売上実績値と、初期素性定義情報２２ａに基づいて構築された予測モデルを用いて予測された各売上予測値との複数の誤差の平均値（以下「処理前全平均誤差」と記載する）と、仮全平均誤差とを比較する。仮全平均誤差が処理前全平均誤差よりも小さい場合は、新素性項目の追加により予測モデルの予測精度が向上したということだから、仮追加した当該新素性項目を正式に素性定義情報２２に追加する。一方、仮全平均誤差が処理前全平均誤差以上となった場合は、新素性項目の追加により予測モデルの予測精度が低下したということだから、新素性項目追加部３６は仮追加した新素性項目を削除する。つまり、当該新素性項目を素性定義情報２２に追加しない。 Here, the new feature item adding unit 36 is predicted using the sales model value of each verification data calculated in the process of generating the new feature item and the prediction model constructed based on the initial feature definition information 22a. In addition, an average value of a plurality of errors with each sales forecast value (hereinafter referred to as “total average error before processing”) is compared with a provisional total average error. If the provisional total average error is smaller than the pre-processing total average error, it means that the prediction accuracy of the prediction model has been improved by adding a new feature item, so the provisionally added new feature item is formally added to the feature definition information 22. To do. On the other hand, if the provisional total average error is equal to or greater than the pre-processing total average error, it means that the prediction accuracy of the prediction model has decreased due to the addition of the new feature item, so the new feature item adding unit 36 has provisionally added the new feature item. Is deleted. That is, the new feature item is not added to the feature definition information 22.

以下、図８に示すフローチャートに従って、本実施形態に係る情報処理装置１０の処理の流れを説明する。 Hereinafter, the processing flow of the information processing apparatus 10 according to the present embodiment will be described with reference to the flowchart shown in FIG.

ステップＳ１０において、ユーザは、過去データＤＢ２０に過去データ群を格納すると共に、初期素性定義情報２２ａを記憶部１２に記憶させる。 In step S10, the user stores the past data group in the past data DB 20 and stores the initial feature definition information 22a in the storage unit 12.

ステップＳ１２において、過去データ分類部２４は、過去データＤＢ２０に格納された過去データ群を学習データ群と検証データ群とに分類する。 In step S12, the past data classification unit 24 classifies the past data group stored in the past data DB 20 into a learning data group and a verification data group.

ステップＳ１４において、素性生成部２６は、初期素性定義情報２２ａに基づいて、ステップＳ１０で過去データＤＢ２０に格納された各過去データについて、複数の素性項目と各素性項目に対する複数の値からなる素性を生成する。 In step S14, the feature generation unit 26 creates a feature including a plurality of feature items and a plurality of values for each feature item for each past data stored in the past data DB 20 in step S10 based on the initial feature definition information 22a. Generate.

ステップＳ１６において、予測部２８は、ステップＳ１４で学習データ群に基づいて生成された各素性と、各学習データに含まれる売上実績値とに基づいて、予測モデルを構築する。 In step S 16, the prediction unit 28 constructs a prediction model based on each feature generated based on the learning data group in step S 14 and the sales record value included in each learning data.

ステップＳ１８において、予測部２８は、ステップＳ１６で構築した予測モデルと、ステップＳ１４で検証データ群に基づいて生成された各素性とに基づいて、各検証データに対する売上予測値を予測する。 In step S18, the prediction unit 28 predicts a sales predicted value for each verification data based on the prediction model constructed in step S16 and each feature generated based on the verification data group in step S14.

ステップＳ２０において、誤差算出部３０は、各検証データについて、売上実績値と、ステップＳ１８で算出された売上予測値との誤差を算出する。 In step S20, the error calculation unit 30 calculates an error between the sales record value and the sales prediction value calculated in step S18 for each verification data.

ステップＳ２２において、誤差算出部３０は、今回のステップＳ２０で算出された全検証データについての複数の誤差の平均値（以下「今回全平均誤差」と記載する）が、前回のステップＳ２０で算出された全検証データについての複数の誤差の平均値（以下「前回全平均誤差」と記載する）よりも小さく、且つ、今回全平均誤差と前回全平均誤差との差分が閾値以下であるか否かを判断する。当該条件を満たす場合は処理を終了し、満たさない場合はステップＳ２４に進む。今回は、初回処理のため、前回平均誤差の値が存在しないことから、ステップＳ２４へ進む。ステップＳ２２の詳細については後述する。 In step S22, the error calculation unit 30 calculates an average value of a plurality of errors for all the verification data calculated in step S20 this time (hereinafter referred to as “current total average error”) in the previous step S20. Whether or not the difference between the current total average error and the previous total average error is less than the threshold value, and is smaller than the average value of the plurality of errors for all the verification data (hereinafter referred to as “previous total average error”). Judging. If the condition is satisfied, the process is terminated; otherwise, the process proceeds to step S24. This time, since there is no previous average error value because of the initial processing, the process proceeds to step S24. Details of step S22 will be described later.

ステップＳ２４において、素性組合せ定義部３２は、初期素性定義情報２２ａに基づいて、複数の素性項目と、当該複数の素性項目に対する複数の値からなる複数の素性組合せを定義する。 In step S24, the feature combination definition unit 32 defines a plurality of feature combinations including a plurality of feature items and a plurality of values for the plurality of feature items based on the initial feature definition information 22a.

ステップＳ２６において、素性組合せ誤差算出部３４は、ステップＳ２４で定義された素性組合せについて、ステップＳ２０で算出された各検証データに対応する複数の誤差のうち、当該素性組合せに該当する複数の検証データについての複数の誤差を抽出し、抽出した複数の誤差の平均値（平均誤差）を算出する。これをステップＳ２４で定義された各素性組合せについて行い、各素性組合せに対応する平均誤差が算出される。 In step S26, the feature combination error calculation unit 34, for the feature combination defined in step S24, among a plurality of errors corresponding to each verification data calculated in step S20, a plurality of verification data corresponding to the feature combination. Are extracted, and an average value (average error) of the extracted errors is calculated. This is performed for each feature combination defined in step S24, and an average error corresponding to each feature combination is calculated.

ステップＳ２８において、新素性項目追加部３６は、ステップＳ２４において定義された複数の素性組合せのうち、ステップＳ２６で算出された平均誤差が最大である素性組合わせを特定する。そして、特定された素性組合わせに基づいて、新素性項目及び当該新素性項目に対する取り得る値を生成する。 In step S28, the new feature item adding unit 36 identifies the feature combination having the maximum average error calculated in step S26 among the plurality of feature combinations defined in step S24. Then, based on the identified feature combination, a new feature item and a possible value for the new feature item are generated.

ステップＳ３０において、新素性項目追加部３６は、ステップＳ２８で生成された新素性項目及び当該新素性項目に対する取り得る値を素性定義情報２２に追加する。これにより、更新素性定義情報２２ｂが生成される。 In step S 30, the new feature item adding unit 36 adds the new feature item generated in step S 28 and possible values for the new feature item to the feature definition information 22. Thereby, the update feature definition information 22b is generated.

ステップＳ３０の処理後、再度ステップＳ１４へ戻り、再度のステップＳ１４において、素性生成部２６は、ステップＳ３０で生成された更新素性定義情報２２ｂに基づいて、各過去データについて新たな素性を生成する。以後、再度ステップＳ１６からステップＳ２２まで、新たな素性に基づいて、同様の処理が行われる。 After the process of step S30, the process returns to step S14 again, and in step S14 again, the feature generation unit 26 generates a new feature for each past data based on the updated feature definition information 22b generated in step S30. Thereafter, the same processing is performed again from step S16 to step S22 based on the new feature.

このように、本実施形態では、ステップＳ１４からステップＳ３０の処理が繰り返され、素性定義情報２２に順次新素性項目が追加されていく。これにより、予測部２８が構築する予測モデルの予測精度が順次向上していくことが期待される。 As described above, in the present embodiment, the processing from step S 14 to step S 30 is repeated, and new feature items are sequentially added to the feature definition information 22. Thereby, it is expected that the prediction accuracy of the prediction model constructed by the prediction unit 28 is sequentially improved.

ステップＳ２２に示されている条件が、当該繰り返し処理の終了条件となっている。上述の通り、ステップＳ２２の条件は、今回全平均誤差が前回全平均誤差よりも小さく、且つ、今回全平均誤差と前回全平均誤差との差分が閾値以下であるか否かという条件である。換言すれば、今回全平均誤差は前回全平均誤差よりも小さくなったが、全平均誤差が前回に対してあまり変わらなくなった場合に、繰り返し処理を終了する。 The condition shown in step S22 is an end condition for the repetition process. As described above, the condition of step S22 is a condition that the current total average error is smaller than the previous total average error and whether the difference between the current total average error and the previous total average error is equal to or less than a threshold value. In other words, the current total average error is smaller than the previous total average error, but when the total average error does not change much compared to the previous time, the iterative process ends.

これは、新素性項目追加処理の繰り返し回数と全平均誤差との関係は、一般に図９に示す関係を有していることに基づくものである。図９に示すように、新素性項目追加処理の繰り返し回数が比較的少ないときは、１回の新素性項目追加処理によって全平均誤差が比較的大きく減少するが、新素性項目追加処理の繰り返し回数が比較的多くなってくると、１回の新素性項目追加処理によって全平均誤差があまり変わらなくなってくる。したがって、今回全平均誤差と前回全平均誤差との差分が閾値以下となったときは、全平均誤差がそれ以上劇的に減少しないと判断できることから、新素性項目追加処理の繰り返し処理を終了する。 This is based on the fact that the relationship between the number of repetitions of new feature item addition processing and the total average error generally has the relationship shown in FIG. As shown in FIG. 9, when the number of repetitions of the new feature item addition process is relatively small, the total average error is relatively greatly reduced by one new feature item addition process. When the number of items becomes relatively large, the total average error does not change much by one new feature item addition process. Therefore, when the difference between the current total average error and the previous total average error is less than or equal to the threshold value, it can be determined that the total average error will not decrease drastically any more, so the new feature item addition process is repeated. .

なお、新素性項目追加処理の繰り返し処理の終了条件としては、上記以外の条件を採用することもできる。例えば、素性定義情報２２に含まれる素性項目の数が所定数に達したことを条件としてもよいし、新素性項目追加処理の繰り返し処理の回数が所定回数に達したことを条件としてもよい。 It should be noted that conditions other than those described above can be adopted as the termination condition for the repetition process of the new feature item addition process. For example, a condition may be that the number of feature items included in the feature definition information 22 has reached a predetermined number, or a condition that the number of repetitions of the new feature item addition process has reached a predetermined number.

＜第２実施形態＞
第２実施形態は、第１実施形態を基本としながらも、第１実施形態に比して素性組合せ定義部３２の処理内容が異なるものである。第１実施形態においては、素性組合せ定義部３２は、素性定義情報２２が有する素性項目と取り得る値との間で実現可能な組み合わせの全てを定義していた。これによれば、全ての素性組合せを漏れなく特定素性組合せの候補とすることができる一方、定義される素性組合せの数が膨大になる場合があり、素性組合せ定義部３２、素性組合せ誤差算出部３４、あるいは新素性項目追加部３６の処理量が多くなってしまう場合がある。 Second Embodiment
Although the second embodiment is based on the first embodiment, the processing content of the feature combination definition unit 32 is different from that of the first embodiment. In the first embodiment, the feature combination definition unit 32 defines all possible combinations between the feature items included in the feature definition information 22 and possible values. According to this, all feature combinations can be used as candidates for specific feature combinations without omission, while the number of defined feature combinations may be enormous, and a feature combination defining unit 32, a feature combination error calculating unit 34, or the amount of processing of the new feature item adding unit 36 may increase.

このことに鑑み、第２実施形態においては、素性組合せ定義部３２が、平均誤差が大きくなる可能性が低い素性組合せを定義しないことで、定義される素性組合せの数が低減される。これにより、新素性項目追加処理における処理量が低減される。以下、第２実施形態における素性組合せ定義部３２の処理について説明する。 In view of this, in the second embodiment, the feature combination definition unit 32 does not define a feature combination that is unlikely to have a large average error, thereby reducing the number of feature combinations that are defined. Thereby, the processing amount in the new feature item addition process is reduced. Hereinafter, the process of the feature combination definition unit 32 in the second embodiment will be described.

まず、第２実施形態においては、素性定義情報２２において、複数の素性項目間において階層関係を定義しておく。当該階層関係の定義は、予めユーザなどによって行われてよい。 First, in the second embodiment, in the feature definition information 22, a hierarchical relationship is defined between a plurality of feature items. The definition of the hierarchical relationship may be performed in advance by a user or the like.

図１０に、本実施形態における階層関係の例が示されている。図１０の例では、第１層として素性項目「年」、第２層として素性項目「月」、第３層として素性項目「週」、第４層として素性項目（曜日）が定義されている。このように、複数の素性項目間における階層関係は、各素性項目の実際の概念に即した階層となっている。もちろん、図１０に示した階層関係は一例であり、複数の素性項目間における階層関係は様々な態様で定義することができる。 FIG. 10 shows an example of the hierarchical relationship in the present embodiment. In the example of FIG. 10, the feature item “year” is defined as the first layer, the feature item “month” as the second layer, the feature item “week” as the third layer, and the feature item (day of the week) as the fourth layer. . As described above, the hierarchical relationship between a plurality of feature items is a layer that conforms to the actual concept of each feature item. Of course, the hierarchical relationship shown in FIG. 10 is an example, and the hierarchical relationship between a plurality of feature items can be defined in various modes.

素性組合せ定義部３２は、素性定義情報２２において定義された階層関係に基づいて、素性組合せを定義する。具体的には、まず、素性組合せ定義部３２は、素性組合せに含まれる一方の素性項目を選択する。次いで、階層関係において、当該一方の素性項目が属する層に隣接する層に属する素性項目を他方の素性項目として選択する。そして、当該一方の素性項目（及び当該素性項目に対する値）と、当該他方の素性項目（及び当該素性項目に対する値）とを組み合わせて素性組合せを定義する。換言すれば、第２実施形態においては、素性組合せ定義部３２は、一方の素性項目と、当該一方の素性項目が属する層に隣接しない層に属する素性項目との素性組合せは定義しない。 The feature combination definition unit 32 defines a feature combination based on the hierarchical relationship defined in the feature definition information 22. Specifically, first, the feature combination defining unit 32 selects one feature item included in the feature combination. Next, in the hierarchical relationship, a feature item belonging to a layer adjacent to the layer to which the one feature item belongs is selected as the other feature item. Then, a feature combination is defined by combining the one feature item (and a value for the feature item) and the other feature item (and a value for the feature item). In other words, in the second embodiment, the feature combination defining unit 32 does not define a feature combination of one feature item and a feature item belonging to a layer not adjacent to the layer to which the one feature item belongs.

図１０の例を用いて説明する。素性組合せ定義部３２は、例えば、第１層に属する素性項目「年」と、第２層に属する素性項目「月」との素性組合せは定義するが、第１層に属する素性項目「年」と、第３層に属する素性項目「週」との素性組合せは定義しない。具体的には、素性項目「年」に対する値「２０１５」と、素性項目「月」に対する値「３」との素性組合せは定義するが、素性項目「年」に対する値「２０１５」と、素性項目「週」に対する値「４」との組み合わせは定義しない。 This will be described with reference to the example of FIG. For example, the feature combination defining unit 32 defines a feature combination of a feature item “year” belonging to the first layer and a feature item “month” belonging to the second layer, but a feature item “year” belonging to the first layer. And the feature combination of the feature item “week” belonging to the third layer is not defined. Specifically, the feature combination of the value “2015” for the feature item “year” and the value “3” for the feature item “month” is defined, but the value “2015” for the feature item “year” and the feature item The combination with the value “4” for “week” is not defined.

例えば、過去データにおいて、階層が離れた項目である関連項目「年」と「週」との組み合わせが売上に特に影響を与えること（例えば２０１５年の毎月第４週のみ突発的に売上が変動すること）は比較的少ないと言える。したがって、互いに階層が離れた素性項目を含む素性組合せの平均誤差が大きくなる可能性は低いと言える。したがって、素性組合せ定義部３２が、階層が離れた互いに階層が離れた素性項目を含む素性組合せを定義しないことによって、平均誤差が大きくなる可能性の高い素性組合せを残しつつ、定義される素性組合せの数を低減させることができる。 For example, in the past data, the combination of related items “year” and “week”, which are items separated from each other in the hierarchy, particularly affects sales (for example, sales fluctuate suddenly only in the fourth week of each month in 2015) It can be said that there is relatively little. Therefore, it can be said that there is a low possibility that the average error of the feature combinations including feature items whose hierarchies are separated from each other will increase. Accordingly, the feature combination definition unit 32 does not define feature combinations that include feature items that are separated from each other in the hierarchy, and thus feature combinations that are defined while leaving a feature combination that is likely to have a large average error. Can be reduced.

第２実施形態において、新素性項目が素性定義情報２２に追加された場合、追加された新素性項目の階層は、当該新素性組合せに含まれる２つの素性項目のうち深い方の層とされる。例えば、第１層にある素性項目「年」に対する値「２０１５」と、第２層にある素性項目「月」に対する値「３」との素性組合せに基づく新素性項目「２０１５年の３月か」が定義された場合、当該新素性項目の階層は、第１層と第２層のうちより深い方の層である第２層となる。 In the second embodiment, when a new feature item is added to the feature definition information 22, the hierarchy of the added new feature item is the deeper layer of the two feature items included in the new feature combination. . For example, the new feature item “March 2015” is based on the feature combination of the value “2015” for the feature item “year” in the first layer and the value “3” for the feature item “month” in the second layer. ”Is defined, the hierarchy of the new feature item is the second layer which is the deeper layer of the first layer and the second layer.

これにより、さらなる新素性項目の追加処理により、第２層にある素性項目「２０１５年の３月か」と、第３層にある素性項目「週」との素性組合せが可能になる。このようにして、階層関係に沿って、３つ以上の素性項目を含む素性組合せを定義することも可能である。 Thereby, the feature combination of the feature item “March 2015” in the second layer and the feature item “week” in the third layer becomes possible by the addition processing of the new feature item. In this manner, feature combinations including three or more feature items can be defined along the hierarchical relationship.

＜第３実施形態＞
第３実施形態は、第１実施形態を基本としながらも、第１実施形態及び第２実施形態に比して素性組合せ定義部３２の処理内容が異なるものである。第１実施形態及び第２実施形態においては、素性組合せに含まれる複数の値は、同一の過去データ（注目過去データ）に関する値として定義されていた。例えば、図６に示した素性組合せＰ１は、ある１つの注目過去データの素性項目「月」に対する値が「１２」であり、且つ、素性項目「週」に対する値が「３」である、ということを意味するものである。 <Third Embodiment>
Although the third embodiment is based on the first embodiment, the processing content of the feature combination definition unit 32 is different from that of the first embodiment and the second embodiment. In the first embodiment and the second embodiment, a plurality of values included in a feature combination are defined as values related to the same past data (attention past data). For example, the feature combination P1 illustrated in FIG. 6 has a value of “12” for the feature item “month” of a certain target past data, and a value of “3” for the feature item “week”. It means that.

過去データ群が時系列に並ぶ複数の過去データから構成される場合、複数の過去データの関連項目に対する値を考慮して予測値を予測した方が、予測精度が向上する場合がある。本実施形態のように、過去データ群が日々蓄積される複数の過去データから構成される場合、例えば、予測対象日の前日が休日か平日か、あるいは、予測対象日の後日が休日か平日かなどが、予測対象日の売上予測に影響する場合がある。このような場合は、例えば予測対象日の前日あるいは後日が休日が否かということを含めて考慮して予測モデルを構築することで、当該予測モデルの予測精度をより向上させることができる。 When the past data group is composed of a plurality of past data arranged in time series, the prediction accuracy may be improved by predicting the predicted value in consideration of the values for the related items of the plurality of past data. When the past data group is composed of a plurality of past data accumulated every day as in the present embodiment, for example, whether the day before the prediction target day is a holiday or a weekday, or whether the day after the prediction target day is a holiday or a weekday Etc. may affect the sales forecast on the forecast date. In such a case, for example, the prediction accuracy of the prediction model can be further improved by constructing the prediction model in consideration of whether the day before or after the prediction target day is a holiday.

このことに鑑み、第３実施形態においては、素性組合せ定義部３２は、素性定義情報２２において定義された素性項目群から選択された素性項目に対する、注目過去データに関する値と、同じく素性項目群から選択された素性項目に対する、当該注目過去データとは異なる過去データに関する値とを含む素性組合せを定義する。 In view of this, in the third embodiment, the feature combination definition unit 32 uses a value related to the past featured data for the feature item selected from the feature item group defined in the feature definition information 22, and also from the feature item group. A feature combination including a value related to past data different from the past past data of interest for the selected feature item is defined.

例えば、第３実施形態において、素性組合せ定義部３２は、素性項目「休日か平日か」に対する注目過去データ（当日）の値「０（平日）」、素性項目「休日か平日か」に対する前日の値「１（平日）」という素性組合せを定義する。また、素性組合せに含まれる２つの素性項目が同一である必要はなく、例えば、素性項目「休日か平日か」に対する当日の値「０（平日）」、素性項目「曜日」に対する前日の値「１（日曜）」という素性組合せを定義するようにしてもよい。また、素性組合せに含まれる複数の値は、時系列において連続する過去データの値に関するものである必要はなく、当日の値と、例えば２日前、１週間前などの値であってもよい。 For example, in the third embodiment, the feature combination defining unit 32 sets the value “0 (weekday)” of the attention past data (today) for the feature item “holiday or weekday” and the previous day for the feature item “whether it is holiday or weekday”. A feature combination of value “1 (weekday)” is defined. Further, the two feature items included in the feature combination need not be the same. For example, the value “0 (weekday)” for the feature item “holiday or weekday”, the previous day value “ A feature combination of “1 (Sunday)” may be defined. Further, the plurality of values included in the feature combination need not relate to values of past data that are continuous in time series, and may be values of the current day and values of, for example, two days ago and one week ago.

このような素性組合せが定義され、素性組合せ誤差算出部３４により当該素性組合せに対する平均誤差が最大となれば、当該素性組合せに基づく新素性項目「当日が平日で前日が休日か」などが素性定義情報２２に追加される。これにより、予測部２８は、「当日が平日で前日が休日か」など、複数の過去データにおける関連項目に対する値を考慮した、より予測精度が向上した予測モデルを構築することができる。 When such a feature combination is defined and the average error for the feature combination is maximized by the feature combination error calculation unit 34, a new feature item “whether the current day is a weekday and the previous day is a holiday” or the like based on the feature combination is defined. It is added to the information 22. Accordingly, the prediction unit 28 can construct a prediction model with improved prediction accuracy in consideration of values for related items in a plurality of past data, such as “whether the current day is a weekday and the previous day is a holiday”.

＜第４実施形態＞
第４実施形態は、第１実施形態を基本としながらも、素性定義情報２２で定義された各素性項目についての、対象項目（本実施形態では売上）の予測値に対して与える影響の大きさを示す寄与度に基づく処理が行われる点において第１実施形態と異なるものである。 <Fourth embodiment>
Although the fourth embodiment is based on the first embodiment, the influence of each feature item defined in the feature definition information 22 on the predicted value of the target item (sales in this embodiment) is large. This is different from the first embodiment in that the processing based on the contribution degree indicating is performed.

各素性項目に対する寄与度は、予測部２８により算出される。寄与度算出の基本的な方法は以下の通りである。まず、予測部２８は、上述の通り、素性生成部２６が生成した学習データ群に関する素性群と、学習データ群に含まれる売上実績値に基づいて予測モデルを構築し、当該予測モデルと、検証データに関する素性に基づいて、当該検証データに対する売上予測値を算出する。次いで、当該検証データに関する素性において、注目素性項目に対する値をランダムに変更して変更素性を生成する。そして、予測部２８は、予測モデルと変更素性に基づいて、当該検証データに対する売上予測値を算出する。ここで算出された、変更素性に基づく売上予測値と、事前に算出された売上予測値との差が大きい程、注目素性項目の寄与度が高い、ということになる。したがって、予測部２８は、変更素性に基づく売上予測値と、事前に算出された売上予測値との差が大きい程、当該注目素性項目の寄与度を高く算出する。 The degree of contribution to each feature item is calculated by the prediction unit 28. The basic method for calculating the contribution is as follows. First, as described above, the prediction unit 28 constructs a prediction model based on the feature group related to the learning data group generated by the feature generation unit 26 and the sales result value included in the learning data group. Based on the data feature, a sales forecast value for the verification data is calculated. Next, in the feature related to the verification data, a change feature is generated by randomly changing the value for the feature item of interest. And the prediction part 28 calculates the sales prediction value with respect to the said verification data based on a prediction model and a change feature. The greater the difference between the sales forecast value calculated here based on the changed feature and the sales forecast value calculated in advance, the higher the contribution of the feature item of interest. Therefore, the prediction unit 28 calculates the contribution degree of the feature item of interest as the difference between the sales prediction value based on the changed feature and the sales prediction value calculated in advance increases.

このような処理によって、素性定義情報２２に各素性項目の寄与度を示す情報が付され、寄与度付素性定義情報２２−２が生成される。図１１に、寄与度付素性定義情報２２−２の例が示されている。 By such processing, information indicating the contribution degree of each feature item is added to the feature definition information 22, and contribution-added feature definition information 22-2 is generated. FIG. 11 shows an example of the contribution degree-added feature definition information 22-2.

素性組合せ定義部３２は、各素性項目の寄与度に基づいて素性組合せを生成する。具体的には、予め寄与度閾値を定めておき、素性組合せを定義するにあたり、寄与度が当該寄与度閾値以下である素性項目を選択しないようにする。寄与度が低い素性項目を含む素性組合せについて、素性組合せ誤差算出部３４が平均誤差を算出した場合、当該平均誤差が大きくなる可能性は低いといえる。したがって、寄与度閾値以下の素性項目を含む素性組合せを定義しないことによって、平均誤差が大きくなる可能性の高い素性組合せを残しつつ、定義される素性組合せの数を低減させることができる。 The feature combination definition unit 32 generates a feature combination based on the contribution degree of each feature item. Specifically, a contribution threshold is set in advance, and when defining a feature combination, feature items whose contribution is less than or equal to the contribution threshold are not selected. When the feature combination error calculation unit 34 calculates an average error for a feature combination including a feature item having a low contribution, it can be said that the average error is unlikely to increase. Therefore, by not defining feature combinations that include feature items that are equal to or less than the contribution threshold value, it is possible to reduce the number of feature combinations that are defined while leaving feature combinations that are likely to have a large average error.

また、第１実施形態では、新素性項目追加部３６は、各素性組合せについて算出された平均誤差に基づいて、特定素性組合せを特定していたが、第４実施形態では、さらに、各素性組合せに含まれる素性項目の寄与度に基づいて、特定素性組合せを特定する。例えば、ある素性組合せの平均誤差と、当該素性組合せに含まれる複数の素性項目についての複数の寄与度の平均値又は合算値との積を、当該素性組合せの指標値として算出する。そして、各素性組合せについて算出された指標値に基づいて特定素性組合せを特定する。指標値に基づく特定素性組合せ方法としては、例えば、複数の素性組合せのうち、指標値が最大の素性組合せを特定素性組合せとしてもよいし、各素性組合せの指標値が降順となるように複数の素性組合せに対して順位付けを行った上で、当該順位付けにおける上位にある複数の素性組合せ（例えば１〜３位など）を特定素性組合せとしてもよいし、予め指標閾値を設けておき、当該指標閾値以上の指標値が算出された全ての素性組合せを特定素性組合せとしてもよい。 In the first embodiment, the new feature item adding unit 36 specifies a specific feature combination based on an average error calculated for each feature combination. In the fourth embodiment, each feature combination is further added. The specific feature combination is specified based on the contribution degree of the feature item included in the. For example, a product of an average error of a certain feature combination and an average value or a sum value of a plurality of contributions for a plurality of feature items included in the feature combination is calculated as an index value of the feature combination. Then, the specific feature combination is specified based on the index value calculated for each feature combination. As a specific feature combination method based on an index value, for example, a feature combination having a maximum index value among a plurality of feature combinations may be set as a specific feature combination, or a plurality of feature values may be arranged in descending order. After ranking the feature combinations, a plurality of feature combinations (for example, the first to third ranks) at the top of the ranking may be specified feature combinations, or an index threshold value may be provided in advance. All feature combinations for which index values equal to or greater than the index threshold are calculated may be specified feature combinations.

寄与度を考慮して特定素性組合せを特定することで、寄与度が高い素性項目を含む素性組合せに基づいて生成された新素性項目が素性定義情報２２に追加され易くなる。これにより、予測部２８により構築される予測モデルの予測精度がより向上され得る。 By specifying the specific feature combination in consideration of the contribution degree, a new feature item generated based on the feature combination including the feature item having a high contribution degree is easily added to the feature definition information 22. Thereby, the prediction accuracy of the prediction model constructed by the prediction unit 28 can be further improved.

以上、本発明に係る実施形態を説明したが、本発明は上記実施形態に限られるものではなく、本発明の趣旨を逸脱しない限りにおいて種々の変更が可能である。 As mentioned above, although embodiment which concerns on this invention was described, this invention is not limited to the said embodiment, A various change is possible unless it deviates from the meaning of this invention.

１０情報処理装置、１２記憶部、１４制御部、２０過去データＤＢ、２２素性定義情報、２４過去データ分類部、２６素性生成部、２８予測部、３０誤差算出部、３２素性組合せ定義部、３４素性組合せ誤差算出部、３６新素性項目追加部。 DESCRIPTION OF SYMBOLS 10 Information processing apparatus, 12 Storage part, 14 Control part, 20 Past data DB, 22 Feature definition information, 24 Past data classification part, 26 Feature generation part, 28 Prediction part, 30 Error calculation part, 32 Feature combination definition part, 34 Feature combination error calculation unit, 36 new feature item addition unit.

Claims

A learning data group that is a part of a past data group including a past actual value for a target item to be predicted and a related item group related to the target item, and candidate items to be used for prediction among the related item group A prediction unit that predicts each prediction value for a target item of each verification data that is a part of the past data group and is a verification target, based on the feature definition information in which the feature item group is defined;
For each of the verification data, an error calculation unit that calculates an error between each prediction value predicted by the prediction unit and an actual value for the target item;
For each of a plurality of feature items selected from a feature item group included in the feature definition information and a plurality of feature combinations composed of values for the plurality of feature items, a plurality of the verification data corresponding to the feature combination A feature combination error calculation unit for calculating a representative value of error;
A new feature item adding unit that adds, to the feature definition information, a new feature item defined by a specific feature combination that is specified based on a representative value of the error corresponding to each feature combination from the plurality of feature combinations When,
An information processing apparatus comprising:

The new feature item adding unit, after specifying the new feature item, an average value of errors related to the verification data group calculated based on the provisional feature definition information to which the new feature item is provisionally added is the new feature item Adding the new feature item to the feature definition information when the error is smaller than the average value of the verification data group calculated based on the feature definition information before the item is provisionally added;
The information processing apparatus according to claim 1.

The plurality of feature items included in the feature item group have a hierarchical relationship,
The feature combination error calculation unit defines the feature combination by combining one feature item and the other feature item belonging to a layer adjacent to the layer to which the one feature item belongs in the hierarchical relationship.
The information processing apparatus according to claim 1.

The past data group is composed of a plurality of past data arranged in time series,
The feature combination includes a value relating to attention past data for the feature item selected from the feature item group and a value relating to past data other than the attention past data for the feature item selected from the feature item group.
The information processing apparatus according to claim 1.

The prediction unit calculates, for each feature item included in the feature item group, a degree of contribution indicating a magnitude of an influence that a value for each feature item has on the value of the target item,
The feature combination error calculation unit does not select a feature item whose contribution is equal to or less than a threshold when defining the feature combination.
The information processing apparatus according to claim 1.

The new feature item addition unit identifies the specific feature combination based on the representative value of the error corresponding to each feature combination and the contributions of a plurality of feature items included in each feature combination.
The information processing apparatus according to claim 5.

Computer
A learning data group that is a part of a past data group including past actual values for a target item to be predicted and a related item group related to the target item, and a verification data group that is a part of the past data group Each of the verification data included in the verification data group based on each value for the related item group and feature definition information in which the feature item group that is a candidate for the item used for prediction is defined. A prediction unit that predicts each prediction value for a target item of each verification data that is a part of the past data group and is a target of verification;
For each of the verification data, an error calculation unit that calculates an error between each prediction value predicted by the prediction unit and an actual value for the target item;
For each of a plurality of feature items selected from a feature item group included in the feature definition information and a plurality of feature combinations composed of values for the plurality of feature items, a plurality of the verification data corresponding to the feature combination A feature combination error calculation unit for calculating a representative value of error;
A new feature item adding unit that adds, to the feature definition information, a new feature item defined by a specific feature combination that is specified based on a representative value of the error corresponding to each feature combination from the plurality of feature combinations When,
An information processing program that functions as a computer program.