JP6972641B2

JP6972641B2 - Information processing equipment and information processing programs

Info

Publication number: JP6972641B2
Application number: JP2017089817A
Authority: JP
Inventors: 琢士田原; 軼謳王
Original assignee: Fuji Xerox Co Ltd; Fujifilm Business Innovation Corp
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2017-04-28
Filing date: 2017-04-28
Publication date: 2021-11-24
Anticipated expiration: 2037-04-28
Also published as: JP2018190044A

Description

本発明は、情報処理装置及び情報処理プログラムに関する。 The present invention relates to an information processing apparatus and an information processing program.

従来、機械学習を用いて予測対象値を予測することが行われている。機械学習を用いた予測処理の一例として、対象項目（目的変数）、及び、当該対象項目に関連する複数の関連項目（説明変数）に対する過去の実績値である過去データに基づいて学習処理（いわゆる教師あり学習）が行われて予測モデルが構築され、当該予測モデルと、予測対象値に関連する関連項目の値とに基づいて予測対象値を予測する処理が知られている。 Conventionally, prediction target values have been predicted using machine learning. As an example of prediction processing using machine learning, learning processing (so-called) based on past actual values for a target item (objective variable) and a plurality of related items (explanatory variables) related to the target item. Supervised learning) is performed to construct a prediction model, and a process of predicting a prediction target value based on the prediction model and the values of related items related to the prediction target value is known.

過去データは多数の関連項目を有している場合があり、その中には、対象項目の値にあまり影響しない項目が含まれている場合もある。したがって、予測モデルの予測精度を向上させるべく、過去データが有する複数の関連項目から選択された項目（素性項目）に対する値を用いて、予測モデルが構築される場合がある。なお、この複数の素性項目とそれらに対する値からなる情報は素性と呼ばれる。 Historical data may have a large number of related items, some of which may not significantly affect the value of the target item. Therefore, in order to improve the prediction accuracy of the prediction model, the prediction model may be constructed using the values for the items (feature items) selected from the plurality of related items of the past data. Information consisting of a plurality of feature items and values for them is called a feature.

従来、過去データが有する複数の関連項目から素性項目を自動的に選択する技術が提案されている。例えば、特許文献１には、より有効性の高い予測モデルを構築するため、サポートベクタマシンの重みに基づいて、複数の属性（関連項目）の中から適切な属性（素性項目）を抽出して、予測モデルを構築するための素性関数を生成することが記載されている。また、特許文献２にも、複数のデータ項目（関連項目）の中から予測精度が最も高くなるデータ項目の組み合わせを抽出する処理が記載されている。 Conventionally, a technique has been proposed in which a feature item is automatically selected from a plurality of related items possessed by past data. For example, in Patent Document 1, in order to construct a more effective prediction model, an appropriate attribute (feature item) is extracted from a plurality of attributes (related items) based on the weight of the support vector machine. , It is described to generate a feature function for building a predictive model. Further, Patent Document 2 also describes a process of extracting a combination of data items having the highest prediction accuracy from a plurality of data items (related items).

特開２００５−０９２６８１号公報Japanese Unexamined Patent Publication No. 2005-09261 特開２０１５−０７２６４４号公報Japanese Unexamined Patent Publication No. 2015-072644

上述のように、過去データが有する複数の関連項目から適切な素性項目を選択することで、より精度の高い予測モデルを構築することができる。ここで、複数の関連項目の組み合わせに基づいて定義される新素性項目を用いることで、より精度の高い予測モデルが構築され得る。例えば、ある店舗の売上予測を行う場合などであって、過去データに含まれる関連項目として「月（何月か）」及び「週（その月の第何週か）」を有する場合、「１２月の第４週目であるか否か」という新素性項目を用いることで、より精度の高い予測モデルが構築できる場合がある。もちろん、新素性項目としては、過去データに含まれる複数の関連項目のむやみな組み合わせに基づくものではなく、予測モデルの予測精度が向上するような適切な組み合わせに基づいて生成されるのが肝要となる。 As described above, a more accurate prediction model can be constructed by selecting an appropriate feature item from a plurality of related items of the past data. Here, a more accurate prediction model can be constructed by using a novel feature item defined based on a combination of a plurality of related items. For example, when forecasting the sales of a certain store and having "month (month)" and "week (week of the month)" as related items included in the past data, "12". It may be possible to build a more accurate prediction model by using the new element item "whether or not it is the 4th week of the month". Of course, it is important that the new feature items are generated not based on the unreasonable combination of multiple related items included in the past data, but on the appropriate combination that improves the prediction accuracy of the prediction model. Become.

本発明の目的は、機械学習を用いて予測対象値の予測を行う際に用いられる素性に、過去データが有する複数の関連項目の組み合わせに基づく新素性項目を自動追加することにある。 An object of the present invention is to automatically add a new feature item based on a combination of a plurality of related items possessed by past data to a feature used when predicting a prediction target value by using machine learning.

請求項１に係る発明は、予測の対象となる対象項目及び前記対象項目に関連する関連項目群、に対する過去の実績値を含む過去データ群の一部である学習データ群と、前記関連項目群のうち予測に用いる項目の候補である素性項目群が定義された素性定義情報とに基づいて、前記過去データ群の一部であり検証の対象となる各検証データの対象項目に対する各予測値を予測する予測部と、前記各検証データそれぞれについて、前記予測部が予測した各予測値と、前記対象項目に対する実績値との誤差を算出する誤差算出部と、前記素性定義情報に含まれる素性項目群から選択された複数の素性項目と、当該複数の素性項目に対する値からなる複数の素性組合せそれぞれについて、前記素性組合せに該当する複数の検証データに関する複数の前記誤差の代表値を算出する素性組合せ誤差算出部と、前記複数の素性組合せの中から、各素性組合せに対応する前記誤差の代表値に基づいて特定された、特定素性組合せにより定義される新素性項目を生成し、前記素性定義情報に追加する新素性項目追加部と、を備えることを特徴とする情報処理装置である。 The invention according to claim 1 is a learning data group that is a part of a past data group including past actual values for a target item to be predicted and a related item group related to the target item, and the related item group. Based on the identity definition information in which the identity item group that is a candidate for the item used for prediction is defined, each predicted value for the target item of each verification data that is a part of the past data group and is the target of verification is set. For each of the prediction unit to be predicted, the error calculation unit for calculating the error between the predicted value predicted by the prediction unit and the actual value for the target item for each of the verification data, and the identity item included in the identity definition information. For each of the plurality of elemental combinations selected from the group and the plurality of elemental combinations consisting of the values for the plurality of elemental items, the elemental combination for calculating the representative value of the plurality of errors related to the plurality of verification data corresponding to the elemental combination. From the error calculation unit and the plurality of element combinations, a new element item defined by the specific element combination specified based on the representative value of the error corresponding to each element combination is generated, and the element definition information is generated. It is an information processing apparatus characterized by being provided with a new element item addition unit to be added to.

請求項２に係る発明は、前記新素性項目追加部は、前記新素性項目を特定した後、前記新素性項目が仮追加された仮素性定義情報に基づいて算出された前記検証データ群に関する誤差の平均値が、前記新素性項目が仮追加される前の前記素性定義情報に基づいて算出された前記検証データ群に関する誤差の平均値よりも小さい場合に、前記新素性項目を前記素性定義情報に追加する、ことを特徴とする請求項１に記載の情報処理装置である。 In the invention according to claim 2, the new feature item addition unit identifies the new feature item, and then the error regarding the verification data group calculated based on the pseudo-feature definition information to which the new feature item is provisionally added. When the average value of is smaller than the average value of the errors related to the verification data group calculated based on the feature definition information before the new feature item is provisionally added, the new feature item is referred to as the feature definition information. The information processing apparatus according to claim 1, wherein the information processing device is added to the above.

請求項３に係る発明は、前記素性項目群に含まれる複数の素性項目は階層関係を有しており、前記素性組合せ誤差算出部は、一方の素性項目と、前記階層関係において当該一方の素性項目が属する階層に隣接する層に属する他方の素性項目とを組み合わせて前記素性組合せを定義する、ことを特徴とする請求項１に記載の情報処理装置である。 In the invention according to claim 3, a plurality of feature items included in the feature item group have a hierarchical relationship, and the feature combination error calculation unit has one feature item and the one feature in the hierarchical relationship. The information processing apparatus according to claim 1, wherein the feature combination is defined by combining the other feature item belonging to the layer adjacent to the layer to which the item belongs.

請求項４に係る発明は、前記過去データ群は、時系列に並ぶ複数の過去データから構成され、前記素性組合せには、前記素性項目群から選択された素性項目に対する注目過去データに関する値と、前記素性項目群から選択された素性項目に対する前記注目過去データ以外の過去データに関する値とが含まれる、ことを特徴とする請求項１に記載の情報処理装置である。 In the invention according to claim 4, the past data group is composed of a plurality of past data arranged in a time series, and the feature combination includes a value related to the past data of interest for a feature item selected from the feature item group. The information processing apparatus according to claim 1, wherein the information processing apparatus includes a value related to past data other than the attention past data for the feature item selected from the feature item group.

請求項５に係る発明は、前記予測部は、前記素性項目群に含まれる各素性項目について、各素性項目に対する値が前記対象項目の値に対して与える影響の大きさを示す寄与度を算出し、前記素性組合せ誤差算出部は、前記素性組合せを定義するにあたり、寄与度が閾値以下である素性項目を選択しない、ことを特徴とする請求項１に記載の情報処理装置である。 In the invention according to claim 5, the prediction unit calculates the degree of contribution indicating the magnitude of the influence of the value on each feature item on the value of the target item for each feature item included in the feature item group. The information processing apparatus according to claim 1, wherein the feature combination error calculation unit does not select a feature item whose contribution is equal to or less than a threshold value in defining the feature combination.

請求項６に係る発明は、前記新素性項目追加部は、各素性組合せに対応する前記誤差の代表値と、各素性組合せに含まれる複数の素性項目の寄与度とに基づいて、前記特定素性組合せを特定する、ことを特徴とする請求項５に記載の情報処理装置である。 In the invention according to claim 6, the new feature item addition unit has the specific feature based on the representative value of the error corresponding to each feature combination and the contribution of a plurality of feature items included in each feature combination. The information processing apparatus according to claim 5, wherein the combination is specified.

請求項７に係る発明は、コンピュータを、予測の対象となる対象項目及び前記対象項目に関連する関連項目群、に対する過去の実績値を含む過去データ群の一部である学習データ群と、前記関連項目群のうち予測に用いる項目の候補である素性項目群が定義された素性定義情報とに基づいて、前記過去データ群の一部であり検証の対象となる各検証データの対象項目に対する各予測値を予測する予測部と、前記各検証データそれぞれについて、前記予測部が予測した各予測値と、前記対象項目に対する実績値との誤差を算出する誤差算出部と、前記素性定義情報に含まれる素性項目群から選択された複数の素性項目と、当該複数の素性項目に対する値からなる複数の素性組合せそれぞれについて、前記素性組合せに該当する複数の検証データに関する複数の前記誤差の代表値を算出する素性組合せ誤差算出部と、前記複数の素性組合せの中から、各素性組合せに対応する前記誤差の代表値に基づいて特定された、特定素性組合せにより定義される新素性項目を生成し、前記素性定義情報に追加する新素性項目追加部と、として機能させることを特徴とする情報処理プログラムである。 The invention according to claim 7 comprises a learning data group which is a part of a past data group including past actual values for a target item to be predicted and a related item group related to the target item, and the above-mentioned invention. Each of the target items of each verification data that is a part of the past data group and is the target of verification based on the predisposition definition information in which the predisposition item group that is a candidate for the item used for prediction is defined among the related item group. Included in the prediction unit that predicts the predicted value, the error calculation unit that calculates the error between the predicted value predicted by the prediction unit and the actual value for the target item for each of the verification data, and the identity definition information. For each of the plurality of elemental combinations selected from the elemental item groups selected from the elemental item groups and the plurality of elemental combinations consisting of the values for the plurality of elemental items, the representative values of the plurality of errors related to the plurality of verification data corresponding to the elemental combination are calculated. A new element item defined by a specific element combination, which is specified based on the representative value of the error corresponding to each element combination, is generated from the element combination error calculation unit and the plurality of element combinations. It is an information processing program characterized by functioning as a new element item addition part to be added to the element definition information.

請求項１又は７に係る発明によれば、機械学習を用いて予測対象値の予測を行う際に用いられる素性に、過去データが有する複数の関連項目の組み合わせに基づく新素性項目を自動追加することができる。 According to the invention of claim 1 or 7, a new feature item based on a combination of a plurality of related items possessed by past data is automatically added to the features used when predicting a prediction target value by using machine learning. be able to.

請求項２に係る発明によれば、新素性項目を含む素性を用いたときの予測精度が向上することを確認した上で、当該新素性項目を素性に追加することができる。 According to the invention of claim 2, the new feature item can be added to the feature after confirming that the prediction accuracy when the feature including the new feature item is used is improved.

請求項３又は５に係る発明によれば、全ての素性項目間の組合せを定義する場合に比して、定義される素性組合せの数を低減させることができる。 According to the invention of claim 3 or 5, the number of defined feature combinations can be reduced as compared with the case of defining combinations among all feature items.

請求項４に係る発明によれば、素性組合せに注目過去データ以外の過去データに関する値を含めない場合に比して、より予測精度を向上させ得る新素性項目を追加することができる。 According to the invention of claim 4, it is possible to add a new feature item that can further improve the prediction accuracy as compared with the case where the feature combination does not include the value related to the past data other than the past data of interest.

請求項６に係る発明によれば、寄与度を考慮しない場合に比して、より予測精度を向上させ得る新素性項目を追加することができる。 According to the invention of claim 6, it is possible to add a new feature item that can further improve the prediction accuracy as compared with the case where the contribution degree is not taken into consideration.

本実施形態に係る情報処理装置の構成概略図である。It is a block diagram of the information processing apparatus which concerns on this embodiment. 過去データＤＢの内容例を示す図である。It is a figure which shows the content example of the past data DB. 初期素性定義情報の内容例を示す図である。It is a figure which shows the content example of the initial feature definition information. 素性の内容例を示す図である。It is a figure which shows the content example of a feature. 各検証データに対する実績値と予測値の例を示す図である。It is a figure which shows the example of the actual value and the predicted value for each verification data. 素性組合せの例を示す図である。It is a figure which shows the example of the feature combination. 更新素性定義情報の内容例を示す図である。It is a figure which shows the content example of the update feature definition information. 第１実施形態に係る情報処理装置の処理の流れを示すフローチャートである。It is a flowchart which shows the process flow of the information processing apparatus which concerns on 1st Embodiment. 新素性項目追加処理の繰り返し処理に対する誤差の平均値を示すグラフである。It is a graph which shows the average value of the error with respect to the iterative processing of a new feature item addition processing. 各素性項目の階層関係の例を示す図である。It is a figure which shows the example of the hierarchical relation of each feature item. 各素性項目に対する寄与度の例を示す図である。It is a figure which shows the example of the degree of contribution to each feature item.

以下、本発明の実施形態について説明する。 Hereinafter, embodiments of the present invention will be described.

＜第１実施形態＞
図１には、本実施形態に係る情報処理装置１０の構成概略図が示されている。情報処理装置１０としては、一般のコンピュータ、例えばサーバやパーソナルコンピュータであってよい。図１に示すように、情報処理装置１０は、記憶部１２及び制御部１４を含んで構成される。また、図１には示されていないが、情報処理装置１０は、例えばネットワークアダプタなどから構成され、インターネットなどの通信回線を介して他の装置と通信を行うための通信部、例えば液晶パネルなどから構成され、情報処理装置１０の処理内容（例えば後述の予測値など）を表示するための表示部、例えばマウス、キーボード、あるいはタッチパネルなどから構成され、利用者（ユーザ）からの指示を入力するための入力部などを含んでいてもよい。 <First Embodiment>
FIG. 1 shows a schematic configuration diagram of the information processing apparatus 10 according to the present embodiment. The information processing device 10 may be a general computer, for example, a server or a personal computer. As shown in FIG. 1, the information processing apparatus 10 includes a storage unit 12 and a control unit 14. Further, although not shown in FIG. 1, the information processing apparatus 10 is composed of, for example, a network adapter or the like, and is a communication unit for communicating with another apparatus via a communication line such as the Internet, for example, a liquid crystal panel or the like. It is composed of a display unit for displaying the processing content of the information processing apparatus 10 (for example, a predicted value described later), for example, a mouse, a keyboard, a touch panel, or the like, and inputs an instruction from a user. It may include an input unit for the purpose.

情報処理装置１０は、過去の実績データに基づいて、未来を予測する処理を行う装置である。本明細書においては、情報処理装置１０がある店舗（予測対象店舗）の予測対象日における売上を予測する処理を例にして説明を行うが、情報処理装置１０が予測するものはこれに限られない。 The information processing device 10 is a device that performs processing for predicting the future based on past actual data. In this specification, the process of predicting the sales on the prediction target date of the store (prediction target store) in which the information processing device 10 is located will be described as an example, but the information processing device 10 predicts only this. No.

記憶部１２は、例えばハードディスク、ＲＯＭ（Read Only Memory）あるいはＲＡＭ（Random Access Memory）などから構成される。記憶部１２には、情報処理装置１０の各部を動作させるための情報処理プログラムが記憶される。あるいは、記憶部１２には、各種制御データあるいは各種処理データなどが記憶される。さらに、図１に示すように、記憶部１２には過去データＤＢ２０が定義され、また、素性定義情報２２が記憶される。 The storage unit 12 is composed of, for example, a hard disk, a ROM (Read Only Memory), a RAM (Random Access Memory), or the like. The storage unit 12 stores an information processing program for operating each unit of the information processing device 10. Alternatively, various control data, various processing data, and the like are stored in the storage unit 12. Further, as shown in FIG. 1, the past data DB 20 is defined in the storage unit 12, and the feature definition information 22 is stored.

過去データＤＢ２０には、予測対象である対象項目に対する過去の実績値、及び、対象項目に関連する関連項目群に対する過去の実績値が蓄積されている。本実施形態では、対象項目は予測対象店舗の売上であり、過去データＤＢ２０には、予測対象店舗の過去の売上の実績値が格納される。また、本実施形態では、関連項目群は売上に関連する各種項目（詳細後述）であり、過去データＤＢ２０には、関連項目群に対する各実績値が格納される。過去データＤＢ２０には、これらのデータがユーザによって格納されてもよいし、自動的に収集されて格納されるようになっていてもよい。 In the past data DB 20, past actual values for the target item to be predicted and past actual values for the related item group related to the target item are accumulated. In the present embodiment, the target item is the sales of the forecast target store, and the past data DB 20 stores the actual value of the past sales of the forecast target store. Further, in the present embodiment, the related item group is various items related to sales (details will be described later), and each actual value for the related item group is stored in the past data DB 20. These data may be stored by the user in the past data DB 20, or may be automatically collected and stored.

図２に、過去データＤＢ２０の内容例が示されている。図２においては、過去データＤＢ２０がテーブル形式で示されているが、過去データＤＢ２０のデータ形式としてはこれに限られない。本実施形態では、日毎に過去データＤＢ２０にデータ（レコード）が蓄積されるようになっている。図２に示されるように、過去データＤＢ２０には、予測対象店舗の売上の実績値が格納されており、また、売上に関連する関連項目として、年、月、日、天気、最高気温、最低気温などに対する実績値が格納されている。もちろん、関連項目としては、売上に対する影響の大小に関わらず、様々な項目を有していてもよい。例えば、曜日、休日か平日か、湿度、風速、店前交通量、平均株価、為替レートなど、種々の項目を有し得る。 FIG. 2 shows an example of the contents of the past data DB 20. In FIG. 2, the past data DB 20 is shown in a table format, but the data format of the past data DB 20 is not limited to this. In the present embodiment, data (records) are accumulated in the past data DB 20 every day. As shown in FIG. 2, the past data DB 20 stores the actual value of the sales of the forecast target store, and the related items related to the sales include the year, month, day, weather, maximum temperature, and minimum. Actual values for temperature etc. are stored. Of course, as related items, various items may be included regardless of the magnitude of the influence on sales. For example, it may have various items such as day of the week, holiday or weekday, humidity, wind speed, traffic volume in front of the store, average stock price, and exchange rate.

本実施形態においては、過去データＤＢ２０における１つのレコードが、ある１日の売上の実績値と、関連項目群に対するその日の実績値が関連付けられたデータとなっている。本明細書では、過去データＤＢ２０における１つのレコードを「過去データ」と記載する。つまり、過去データＤＢ２０には、過去データが逐次蓄積されることで、過去データ群が格納されることになる。本実施形態では、過去データＤＢ２０には、予測対象店舗に関する２０１４年から２０１６年までの３年間分の過去データが蓄積されているものとする。 In the present embodiment, one record in the past data DB 20 is data in which the actual value of sales for a certain day and the actual value of the related item group for that day are associated with each other. In this specification, one record in the past data DB 20 is referred to as "past data". That is, the past data group is stored in the past data DB 20 by sequentially accumulating the past data. In the present embodiment, it is assumed that the past data DB 20 stores the past data for three years from 2014 to 2016 regarding the store to be predicted.

なお、本実施形態における過去データＤＢ２０においては、各過去データに対して、当該過去データを一意に識別可能な過去データＩＤが付されている。 In the past data DB 20 in the present embodiment, a past data ID that can uniquely identify the past data is attached to each past data.

詳細は後述するが、情報処理装置１０においては、過去データＤＢ２０に格納された過去データ群に基づいて、予測対象店舗の予測対象日における予測売上値を予測する。具体的には、各過去データの各関連項目に対する実績値と売上の実績値との関係を学習することで、各関連項目の値から売上値を予測するための予測モデルが構築され、当該予測モデルと、予測対象日の各関連項目に対する値とに基づいて、予測売上値が予測される。 Although the details will be described later, in the information processing apparatus 10, the forecasted sales value on the forecast target date of the forecast target store is predicted based on the past data group stored in the past data DB 20. Specifically, by learning the relationship between the actual value of each related item in each past data and the actual value of sales, a forecast model for predicting the sales value from the value of each related item is constructed, and the forecast is made. Forecast sales values are predicted based on the model and the values for each related item on the forecasted date.

素性定義情報２２は、予測モデルの構築に用いる項目の候補である複数の素性項目からなる素性項目群を定義する情報である。素性項目は、過去データに含まれる関連項目群に基づく項目である。上述のように、過去データに含まれる関連項目群には、対象項目である売上に対して関連性の高い項目もあれば、関連性の低い項目も含まれ得る。一般に、対象項目と関連性の低い関連項目の値までをも考慮して予測モデルを構築した場合、当該予測モデルの予測精度はあまり良くならない。逆に、適切な関連項目の値に基づいて予測モデルを構築すれば、当該予測モデルの予測精度が向上され得る。つまり、素性定義情報２２において適切な素性項目群が定義されることによって、関連項目群の中から適切な項目を用いて予測モデルを構築することができるから、予測モデルの予測精度を向上させることができる。このように、素性定義情報２２において定義される素性項目群は、予測モデルの予測精度、ひいては情報処理装置１０における予測処理の予測精度に大きく関わる要素となる。なお、場合によっては、過去データに含まれる関連項目の全てが素性項目として定義されてもよい。 The feature definition information 22 is information that defines a feature item group composed of a plurality of feature items that are candidates for items used for constructing a prediction model. The feature item is an item based on the related item group included in the past data. As described above, the related item group included in the past data may include items that are highly related to the sales that are the target items and items that are not related to the sales. In general, when a prediction model is constructed in consideration of the values of related items that are not related to the target item, the prediction accuracy of the prediction model does not improve so much. On the contrary, if the prediction model is constructed based on the values of appropriate related items, the prediction accuracy of the prediction model can be improved. That is, by defining an appropriate feature item group in the feature definition information 22, a prediction model can be constructed using an appropriate item from the related item group, so that the prediction accuracy of the prediction model can be improved. Can be done. As described above, the feature item group defined in the feature definition information 22 is a factor greatly related to the prediction accuracy of the prediction model and, by extension, the prediction accuracy of the prediction process in the information processing apparatus 10. In some cases, all the related items included in the past data may be defined as feature items.

詳しくは後述するが、本実施形態では、情報処理装置１０の処理によって、素性定義情報２２に新たな素性項目（新素性項目）が自動的に追加される。本明細書では、素性定義情報２２のうち、情報処理装置１０によって新素性項目が全く追加されていないものを「初期素性定義情報２２ａ」と、情報処理装置１０の処理によって新素性項目の追加処理後のものを「更新素性定義情報２２ｂ」と区別して記載する。 As will be described in detail later, in the present embodiment, a new feature item (new feature item) is automatically added to the feature definition information 22 by the processing of the information processing apparatus 10. In the present specification, among the feature definition information 22, the one to which the new feature item is not added by the information processing device 10 is referred to as "initial feature definition information 22a", and the new feature item is added by the processing of the information processing device 10. The latter one is described separately from the "update feature definition information 22b".

図３に、本実施形態における初期素性定義情報２２ａの内容例が示されている。初期素性定義情報２２ａは、ユーザによって生成される。図３に示されるように、初期素性定義情報２２ａには、素性項目として、年、月、日、週、曜日、休日か平日かが含まれている。これらの各素性項目は、過去データ群が有する関連項目群から選択された項目である。また、初期素性定義情報２２ａには、各素性項目に対して取り得る値が定義されている。上記の素性項目については、元々過去データにおいて取り得る値がそれほど多くないため、過去データにおいて取り得る値が、そのまま素性項目が取り得る値として定義されている。 FIG. 3 shows an example of the contents of the initial feature definition information 22a in the present embodiment. The initial feature definition information 22a is generated by the user. As shown in FIG. 3, the initial feature definition information 22a includes year, month, day, week, day of the week, holiday or weekday as feature items. Each of these feature items is an item selected from the related item group possessed by the past data group. Further, in the initial feature definition information 22a, possible values are defined for each feature item. Since the above-mentioned feature items are originally not so many values that can be taken in the past data, the values that can be taken in the past data are defined as the values that the feature items can take as they are.

さらに、初期素性定義情報２２ａには、最高気温は１５度以上か、最低気温は１０度未満か、あるいは雨が降ったか否か、といった素性項目も含まれる。過去データ群においては、最高気温及び最低気温に対する実績値は連続的な（すなわち様々な）値を取り得るし、天気に対する実績値も晴、雨、曇り、晴のち雨などといった様々な値を取り得る。予測モデルの構築の際に、このような様々な値を取り得る生データを直接用いるのは適切ではない。すなわち、処理量が膨大になり得る反面、予測精度向上の効果があまり期待できない。そのため、初期素性定義情報２２ａにおいては、例えば、過去データが有する最高気温という関連項目に基づいて、最高気温が１５度以上かという素性項目が定義され、当該素性項目が取り得る値としては１（１５度以上）及び０（１５度未満）という２つの値が定義されている。最高気温の閾値（この例では１５度）はユーザによって決定されてよい。このように、様々な値を取り得る関連項目については、素性項目を適宜工夫することで、各素性項目が取り得る値の数を低減することができる。これにより、予測モデル構築の処理が簡略化される。 Further, the initial feature definition information 22a also includes feature items such as whether the maximum temperature is 15 degrees or higher, the minimum temperature is less than 10 degrees, or whether it has rained. In the past data group, the actual values for the maximum and minimum temperatures can take continuous (that is, various) values, and the actual values for the weather also take various values such as fine, rain, cloudy, and then rain. obtain. It is not appropriate to directly use raw data that can take such various values when constructing a predictive model. That is, while the amount of processing can be enormous, the effect of improving the prediction accuracy cannot be expected so much. Therefore, in the initial feature definition information 22a, for example, a feature item such as whether the maximum temperature is 15 degrees or higher is defined based on the related item of the maximum temperature possessed by the past data, and the value that the feature item can take is 1 ( Two values are defined: 15 degrees or more) and 0 (less than 15 degrees). The maximum temperature threshold (15 degrees in this example) may be determined by the user. As described above, for related items that can take various values, the number of values that can be taken by each feature item can be reduced by appropriately devising the feature items. This simplifies the process of building a predictive model.

なお、素性定義情報２２において定義された各素性項目には、素性項目を一意に識別するための素性項目ＩＤが付されている。 Each feature item defined in the feature definition information 22 is assigned a feature item ID for uniquely identifying the feature item.

制御部１４は、例えばＣＰＵ（Central Processing Unit）あるいはマイクロコントローラなどから構成される。制御部１４は、記憶部１２に記憶された情報処理プログラムに基づいて、情報処理装置１０の各部を制御するものである。また、図１に示されるように、制御部１４は、過去データ分類部２４、素性生成部２６、予測部２８、誤差算出部３０、素性組合せ定義部３２、素性組合せ誤差算出部３４、及び新素性項目追加部３６としても機能する。制御部１４がこれらの機能を発揮することにより、素性定義情報２２に、予測モデルの予測精度を向上させ得る新素性項目が追加される。以下、これらの各機能の詳細について説明する。 The control unit 14 is composed of, for example, a CPU (Central Processing Unit) or a microcontroller. The control unit 14 controls each unit of the information processing apparatus 10 based on the information processing program stored in the storage unit 12. Further, as shown in FIG. 1, the control unit 14 includes a past data classification unit 24, a feature generation unit 26, a prediction unit 28, an error calculation unit 30, a feature combination definition unit 32, a feature combination error calculation unit 34, and a new unit. It also functions as a feature item addition unit 36. When the control unit 14 exerts these functions, a new feature item that can improve the prediction accuracy of the prediction model is added to the feature definition information 22. The details of each of these functions will be described below.

過去データ分類部２４は、過去データＤＢ２０に格納されている過去データ群を学習データ群と検証データ群とに分類する。学習データ群は、予測モデルの構築に用いるものであり、過去データ群の一部である複数の過去データからなるものである。一方、検証データ群は、構築された予測モデルの予測精度を検証するために用いるものであり、過去データ群の一部である複数の過去データからなるものである。上述の通り、本実施形態では、過去データＤＢ２０には、２０１４年から２０１６年までの３年分の過去データ群が格納されているため、２０１４年と２０１５年の２年分の複数の過去データを学習データ群とし、２０１６年分の複数の過去データを検証データ群とする。もちろん、学習データ群と検証データ群の分類方法は、これには限られない。 The past data classification unit 24 classifies the past data group stored in the past data DB 20 into a learning data group and a verification data group. The training data group is used for constructing a prediction model, and consists of a plurality of past data that are a part of the past data group. On the other hand, the verification data group is used to verify the prediction accuracy of the constructed prediction model, and is composed of a plurality of past data that are a part of the past data group. As described above, in the present embodiment, since the past data DB 20 stores the past data group for three years from 2014 to 2016, a plurality of past data for two years 2014 and 2015 are stored. Is used as a training data group, and a plurality of past data for 2016 is used as a verification data group. Of course, the method of classifying the training data group and the verification data group is not limited to this.

素性生成部２６は、過去データＤＢ２０に格納された過去データと、素性定義情報２２に基づいて、複数の素性項目と、各素性項目に対する複数の値からなる素性を生成する。素性生成部２６は、過去データ群に含まれる各過去データに対応する複数の素性を生成する。これにより素性群が生成される。なお、素性に含まれる素性項目は、素性定義情報２２に含まれる素性項目の全てである必要はなく、素性定義情報２２において定義された複数の素性項目の一部の素性項目であってもよい。 The feature generation unit 26 generates a plurality of feature items and a feature consisting of a plurality of values for each feature item based on the past data stored in the past data DB 20 and the feature definition information 22. The feature generation unit 26 generates a plurality of features corresponding to each past data included in the past data group. This creates a feature group. It should be noted that the feature items included in the features do not have to be all of the feature items included in the feature definition information 22, and may be some feature items of a plurality of feature items defined in the feature definition information 22. ..

図４に、素性生成部２６により生成された素性群の例が示されている。図４には、図２に示された各過去データと、図３に示される素性定義情報２２とに基づいて生成された素性群の例が示されている。 FIG. 4 shows an example of a feature group generated by the feature generation unit 26. FIG. 4 shows an example of a feature group generated based on each past data shown in FIG. 2 and the feature definition information 22 shown in FIG.

例えば、図４に示された素性Ｉ１は、過去データＩＤ「２０１４０４０１」が示す過去データ（図２参照）に対応するものである。当該過去データが有する関連項目に対する各値に基づいて、各素性項目に対する値が決定されている。素性Ｉ１においては、素性項目「年」に対して値「２０１４」が決定され、以後同様に他の各素性項目に対する値が決定されている。なお、素性項目「曜日」に対して値「３」が決定されているが、これは、当該過去データの関連項目「曜日」に対して「火曜日」の値を有しているためである。また、当該過去データの関連項目「最高気温」に対する値が「１７．４度」であるから、素性項目「最高気温が１５度以上か」に対する値には、１５度以上を示す「１」が決定されている。 For example, the feature I1 shown in FIG. 4 corresponds to the past data (see FIG. 2) indicated by the past data ID “201440401”. The value for each feature item is determined based on each value for the related item in the past data. In the feature I1, the value "2014" is determined for the feature item "year", and thereafter, the value for each of the other feature items is determined in the same manner. The value "3" is determined for the feature item "day of the week" because it has the value of "Tuesday" for the related item "day of the week" of the past data. In addition, since the value for the related item "maximum temperature" of the past data is "17.4 degrees", the value for the feature item "whether the maximum temperature is 15 degrees or more" is "1" indicating 15 degrees or more. It has been decided.

このようにして、素性生成部２６は、学習データ群及び検証データ群を含む全ての過去データ群について素性を生成する。なお、各素性には、素性と過去データとの関連を示すように過去データＩＤが付されている。生成された素性は、予測部２８及び素性組合せ誤差算出部３４に渡される。 In this way, the feature generation unit 26 generates features for all past data groups including the training data group and the verification data group. A past data ID is attached to each feature so as to indicate the relationship between the feature and the past data. The generated features are passed to the prediction unit 28 and the feature combination error calculation unit 34.

予測部２８は、まず、素性生成部２６が学習データ群に基づいて生成した各素性と、各学習データに含まれる対象項目の実績値（本実施形態では売上実績値）とに基づいて、予測モデルを構築する。 First, the prediction unit 28 makes a prediction based on each feature generated by the feature generation unit 26 based on the learning data group and the actual value (sales actual value in this embodiment) of the target item included in each learning data. Build a model.

予測モデルの構築には、種々の方法を用いることができる。例えば、各素性と実績値に基づいて生成される複数の弱識別器を組み合わせて予測モデルを構築するアンサンブル学習法であって、１つの弱識別器の学習結果を参考にして次の弱識別器を学習しつつ、予測値と実績値の誤差を定義した損失関数の勾配を考慮して予測モデルを構築する勾配ブースティング法を用いることができる。あるいは、学習データ群からサンプリングされた学習データに基づいて、非終端ノードにおいて識別（分類）に用いる素性項目をランダムに選択することで、相関の低い複数の決定木を作成し、当該複数の決定木を用いて予測モデルを構築するランダムフォレスト法を用いることができる。 Various methods can be used to build the predictive model. For example, it is an ensemble learning method that constructs a prediction model by combining a plurality of weak classifiers generated based on each element and actual value, and the next weak classifier is referred to by referring to the learning result of one weak classifier. While learning, the gradient boosting method can be used to build a prediction model considering the slope of the loss function that defines the error between the predicted value and the actual value. Alternatively, based on the training data sampled from the training data group, by randomly selecting the feature items used for identification (classification) in the non-terminal node, a plurality of decision trees with low correlation are created, and the plurality of decision trees are created. A random forest method can be used to build a predictive model using.

次に、予測部２８は、構築した予測モデルと、素性生成部２６が検証データ群に基づいて生成した各素性とに基づいて、各検証データの対象項目の予測値（本実施形態では売上予測値）を予測する。ここで、各検証データは、売上実績値を既に有しているのであるが、予測部２８は、素性定義情報２２に新素性項目を追加するために、予測部２８は、既知である各検証データの売上予測値の予測を行う。 Next, the prediction unit 28 predicts the target item of each verification data based on the constructed prediction model and each feature generated by the feature generation unit 26 based on the verification data group (sales forecast in this embodiment). Value) is predicted. Here, each verification data already has the actual sales value, but the forecasting unit 28 adds a new feature item to the feature definition information 22, so that the forecasting unit 28 has each known verification. Forecast the sales forecast value of the data.

予測部２８により予測処理が行われると、各検証データに対応する複数の売上予測値が得られる。図５に、各検証データが有する売上実績値に対応する売上予測値の例が示されている。予測部２８により予測された当該複数の売上予測値は、対応する過去データを示す過去ＩＤと、当該過去データの売上実績値と関連付けられて誤差算出部３０に渡される。 When the forecasting process is performed by the forecasting unit 28, a plurality of sales forecast values corresponding to each verification data are obtained. FIG. 5 shows an example of a sales forecast value corresponding to the actual sales value of each verification data. The plurality of sales forecast values predicted by the forecast unit 28 are associated with a past ID indicating the corresponding past data and the actual sales value of the past data, and are passed to the error calculation unit 30.

誤差算出部３０は、検証データが有する対象項目の実績値（本実施形態では売上実績値）と、当該検証データについて予測部２８が予測した対象項目の予測値（本実施形態では売上予測値）との誤差を算出する。誤差算出部３０は、各検証データそれぞれについて誤差を算出する。これにより、各検証データに対応する複数の誤差が算出される。例えば、図５の例では、誤差算出部は、過去データＩＤ「２０１６０４０１」が示す検証データに対して誤差「２」を算出し、過去データＩＤ「２０１６０４０２」が示す検証データに対して誤差「４」を算出し、以下同様に各検証データについての誤差を算出する。 The error calculation unit 30 has an actual value of the target item possessed by the verification data (actual sales value in the present embodiment) and a predicted value of the target item predicted by the forecasting unit 28 for the verification data (sales forecast value in the present embodiment). Calculate the error with. The error calculation unit 30 calculates an error for each verification data. As a result, a plurality of errors corresponding to each verification data are calculated. For example, in the example of FIG. 5, the error calculation unit calculates an error "2" with respect to the verification data indicated by the past data ID "20160401", and an error "4" with respect to the verification data indicated by the past data ID "20160402". ], And the error for each verification data is calculated in the same manner below.

誤差算出部３０は、各検証データについての複数の誤差を算出すると、全検証データについての誤差の平均値（以下「全平均誤差」と記載する）を算出する。言うまでもないが、全平均誤差は、各誤差の合計値を検証データ数で割ることで算出される。後述のように、全平均誤差は、新素性項目を追加する処理を終えるときなどの判断材料となる。 When the error calculation unit 30 calculates a plurality of errors for each verification data, the error calculation unit 30 calculates the average value of the errors for all the verification data (hereinafter referred to as “total average error”). Needless to say, the total average error is calculated by dividing the total value of each error by the number of verification data. As will be described later, the total average error can be used as a judgment material when the process of adding a new feature item is completed.

誤差算出部３０により算出された各検証データについての複数の誤差は、各検証データの過去データＩＤと関連付けられて素性組合せ誤差算出部３４に渡される。 A plurality of errors for each verification data calculated by the error calculation unit 30 are associated with the past data ID of each verification data and passed to the feature combination error calculation unit 34.

素性組合せ定義部３２は、素性定義情報２２に基づいて、複数の素性組合せを定義する。素性組合せとは、素性定義情報２２において定義されている複数の素性項目と、当該複数の素性項目に対する複数の値からなるものである。図６に、素性組合せの具体例が示されている。図６に示された素性組合せは、図３に示された素性定義情報２２に基づいて定義されたものである。 The feature combination definition unit 32 defines a plurality of feature combinations based on the feature definition information 22. The feature combination is composed of a plurality of feature items defined in the feature definition information 22 and a plurality of values for the plurality of feature items. FIG. 6 shows a specific example of the feature combination. The feature combination shown in FIG. 6 is defined based on the feature definition information 22 shown in FIG.

例えば、図６に示された素性組合せＰ１は、素性項目「月」に対する値「１２」と、素性項目「週」に対する値「３」との組み合わせとなっている。図６に示された素性組合せは一例であり、本実施形態においては、素性組合せ定義部３２は、素性定義情報２２が有する素性項目と取り得る値との間で実現可能な組み合わせの全てを定義する。ただし、同じ素性項目同士の組み合わせは定義しないものとする。例えば、素性項目「月」に対する値「１」と、同じ素性項目「月」に対する値「２」との組み合わせは定義しないものとする。なお、図６に示されるように、各素性組合せに対しては、素性組合せを一意に識別する組合せＩＤが付されている。素性組合せ定義部３２により定義された複数の素性組合せは素性組合せ誤差算出部３４に渡される。 For example, the feature combination P1 shown in FIG. 6 is a combination of a value "12" for the feature item "month" and a value "3" for the feature item "week". The feature combination shown in FIG. 6 is an example, and in the present embodiment, the feature combination definition unit 32 defines all the possible combinations between the feature item and the possible value of the feature definition information 22. do. However, the combination of the same feature items shall not be defined. For example, the combination of the value "1" for the feature item "month" and the value "2" for the same feature item "month" is not defined. As shown in FIG. 6, each feature combination is given a combination ID that uniquely identifies the feature combination. A plurality of feature combinations defined by the feature combination definition unit 32 are passed to the feature combination error calculation unit 34.

素性組合せ誤差算出部３４は、素性組合せ定義部３２が定義した複数の素性組合せそれぞれについて、検証データ群のうち、当該素性組合せに該当する複数の検証データに関する複数の誤差の代表値を算出する。 The feature combination error calculation unit 34 calculates representative values of a plurality of errors related to a plurality of verification data corresponding to the feature combination in the verification data group for each of the plurality of feature combinations defined by the feature combination definition unit 32.

素性組合せに該当する検証データとは、当該素性組合せに含まれる素性項目に対する値を有する検証データである。例えば、素性組合せが、素性項目「月」に対する値「１２」と、素性項目「週」に対する値「３」との組み合わせである場合、当該素性組合せに該当する検証データとは、関連項目「月」に対する値が「１２月」であり、且つ、関連項目「週」に対する値が「３」の検証データである。すなわち、本実施形態では、２０１６年１２月の第３週に対応する７つの検証データが当該素性組合せに該当する複数の検証データということになる。このようにして、素性組合せ毎に、該当する複数の検証データが特定される。なお、素性組合せに該当する検証データの特定は、素性生成部２６が検証データ群に基づいて生成した各素性に基づいて行う。 The verification data corresponding to the feature combination is the verification data having a value for the feature item included in the feature combination. For example, when the feature combination is a combination of the value "12" for the feature item "month" and the value "3" for the feature item "week", the verification data corresponding to the feature combination is the related item "month". The value for "December" is "December", and the value for the related item "Week" is "3", which is the verification data. That is, in the present embodiment, the seven verification data corresponding to the third week of December 2016 are a plurality of verification data corresponding to the feature combination. In this way, a plurality of corresponding verification data are specified for each feature combination. The verification data corresponding to the feature combination is specified based on each feature generated by the feature generation unit 26 based on the verification data group.

素性組合せ誤差算出部３４は、各素性組合せに該当する複数の検証データを特定すると、素性組合せ毎に、該当する複数の検証データに関する誤差の代表値を演算する。本実施形態では、代表値として平均値が算出されるが、代表値としては例えば中央値などであってもよい。具体的には、素性組合せ誤差算出部３４は、ある素性組合せを選択し、素性生成部２６から渡された各素性に基づいて、当該素性組合せに該当する複数の検証データの複数の過去データＩＤを特定する。次いで、誤差算出部３０から渡された複数の誤差のうち、特定した複数の過去データＩＤに関連付けられた複数の誤差を抽出する。そして、抽出した複数の誤差の平均値を演算する。個の平均値が、当該素性組合せの誤差の平均値（以下「平均誤差」と記載する）となる。このようにして、素性組合せ誤差算出部３４は、素性組合せ定義部３２が定義した全ての素性組合せの平均誤差を算出する。 When the feature combination error calculation unit 34 specifies a plurality of verification data corresponding to each feature combination, the feature combination error calculation unit 34 calculates a representative value of the error related to the plurality of corresponding verification data for each feature combination. In the present embodiment, the average value is calculated as the representative value, but the representative value may be, for example, the median value. Specifically, the feature combination error calculation unit 34 selects a certain feature combination, and based on each feature passed from the feature generation unit 26, a plurality of past data IDs of a plurality of verification data corresponding to the feature combination. To identify. Next, among the plurality of errors passed from the error calculation unit 30, a plurality of errors associated with the specified plurality of past data IDs are extracted. Then, the average value of the extracted plurality of errors is calculated. The average value of the pieces is the average value of the error of the feature combination (hereinafter referred to as "average error"). In this way, the feature combination error calculation unit 34 calculates the average error of all the feature combinations defined by the feature combination definition unit 32.

新素性項目追加部３６は、素性組合せ誤差算出部３４により算出された、各素性組合せの誤差の代表値（本実施形態では平均誤差）に基づいて、素性組合せ定義部３２が定義した複数の素性組合せの中から特定素性組合せを特定する。特定素性組合せの特定方法は後述するが、基本的には平均誤差が大きい素性組合せが特定素性組合せとされる。 The new feature item addition unit 36 has a plurality of features defined by the feature combination definition unit 32 based on the representative value of the error of each feature combination (average error in this embodiment) calculated by the feature combination error calculation unit 34. Specify a specific feature combination from the combinations. The method for specifying the specific feature combination will be described later, but basically, the feature combination having a large average error is regarded as the specific feature combination.

本実施形態においては、新素性項目追加部３６は、各素性組合せの平均誤差が降順となるように複数の素性組合せに対して順位付けを行い、そのうち１位の素性組合せを特定素性組合せとする。すなわち、複数の素性組合せのうち、最も平均誤差が大きかった１つの素性組合せを特定素性組合せとする。なお、新素性項目追加部３６は、上記以外の特定方法により特定素性組合せを特定するようにしてもよい。例えば、各素性組合せの平均誤差が降順となるように複数の素性組合せに対して順位付けを行った上で、当該順位付けにおいて上位にいる複数の素性組合せ（例えば１〜３位など）を特定素性組合せとしてもよい。あるいは、予め誤差閾値を設けておき、当該誤差閾値以上の平均誤差が算出された全ての素性組合せを特定素性組合せとして特定するようにしてもよい。このような特定方法を採用した場合は、複数の素性組合せが特定素性組合せとして特定され得る。 In the present embodiment, the new feature item addition unit 36 ranks a plurality of feature combinations so that the average error of each feature combination is in descending order, and the first feature combination is the specific feature combination. .. That is, one feature combination having the largest average error among a plurality of feature combinations is defined as a specific feature combination. The new feature item addition unit 36 may specify the specific feature combination by a specific method other than the above. For example, after ranking a plurality of feature combinations so that the average error of each feature combination is in descending order, a plurality of feature combinations (for example, 1st to 3rd place) that are higher in the ranking are specified. It may be a combination of features. Alternatively, an error threshold value may be set in advance, and all feature combinations for which an average error equal to or higher than the error threshold value has been calculated may be specified as a specific feature combination. When such a specific method is adopted, a plurality of feature combinations can be specified as specific feature combinations.

次いで、新素性項目追加部３６は、特定素性組合せに基づいて新素性項目を生成する。本実施形態では、特定素性組合せに該当するか否かという新素性項目が生成される。例えば、素性項目「月」に対する値「１２」と、素性項目「週」に対する値「４」との組み合わせからなる特定素性組合せが特定された場合、新素性項目追加部３６は、「１２月の第４週か」という新素性項目を生成する。そして、当該新素性項目に対する取り得る値として、「１（はい）」及び「０（いいえ）」を生成する。 Next, the new feature item addition unit 36 generates a new feature item based on the specific feature combination. In the present embodiment, a new feature item of whether or not it corresponds to a specific feature combination is generated. For example, when a specific feature combination consisting of a combination of a value "12" for the feature item "month" and a value "4" for the feature item "week" is specified, the new feature item addition unit 36 is set to "December. Generate a new feature item "Is it the 4th week?" Then, "1 (yes)" and "0 (no)" are generated as possible values for the new feature item.

そして、新素性項目追加部３６は、生成した新素性項目を素性定義情報２２に追加する。これにより、更新素性定義情報２２ｂが生成される。なお、複数の特定素性組合せが特定された場合は、新素性項目追加部３６は、複数の特定素性組合せにそれぞれ対応する複数の新素性項目及びそれらに対する複数の取り得る値を生成し、素性定義情報２２に追加する。 Then, the new feature item addition unit 36 adds the generated new feature item to the feature definition information 22. As a result, the update feature definition information 22b is generated. When a plurality of specific feature combinations are specified, the new feature item addition unit 36 generates a plurality of new feature items corresponding to each of the plurality of specific feature combinations and a plurality of possible values for them, and defines the features. Add to information 22.

図７に、新素性項目が追加された更新素性定義情報２２ｂの例が示されている。図７に示された更新素性定義情報２２ｂは、図３に示す初期素性定義情報２２ａに対して、上述の例の１つの新素性項目が追加されたものである。 FIG. 7 shows an example of updated feature definition information 22b to which a new feature item is added. The updated feature definition information 22b shown in FIG. 7 is obtained by adding one new feature item in the above example to the initial feature definition information 22a shown in FIG.

このように、新素性項目追加部３６により、初期素性定義情報２２ａにおいて定義されていた複数の素性項目とそれらに対する値の組合せである素性組合せから定義される新素性項目が素性定義情報２２に追加される。更新素性定義情報２２ｂに基づいて、予測部２８により新予測モデルが構築された場合、当該新予測モデルは、初期素性定義情報２２ａに基づいて構築された旧予測モデルに対して予測精度の向上が期待されるものとなる。 In this way, the new feature item addition unit 36 adds a new feature item defined from a plurality of feature items defined in the initial feature definition information 22a and a feature combination that is a combination of values for them to the feature definition information 22. Will be done. When a new prediction model is constructed by the prediction unit 28 based on the updated feature definition information 22b, the prediction accuracy of the new prediction model is improved as compared with the old prediction model constructed based on the initial feature definition information 22a. It will be expected.

特に、多数定義され得る素性組合せの中から、平均誤差が大きい特定素性組合せに基づく新素性項目が追加されるから、新予測モデルの予測精度が旧予測モデルの予測精度を上回ることがより期待できる。例えば、初期素性定義情報２２ａに素性項目「月」及び「週」が含まれている場合、初期素性定義情報２２ａに基づく予測モデルは、月別の売上の変動と、週別の売上の変動とが別個に考慮されて構築され、特定の月と特定の週の組合せまで考慮されて構築されない。この場合、例えば、毎年恒例の特定のイベントによる特定の月の特定の週における突発的な売上の変動が、当該予測モデルにおいて好適に反映されないことになる。本実施形態によれば、例えば、素性項目「月」に対する値「１２」及び素性項目「週」に対する値「４」の素性組合せに基づく新素性項目が追加されることで、予測部２８は、１２月の第４週にのみ発生する突発的な売り上げの変動を考慮した予測モデルを構築することができる。 In particular, since new feature items based on specific feature combinations with a large average error are added from among a large number of feature combinations that can be defined, it can be expected that the prediction accuracy of the new prediction model will exceed the prediction accuracy of the old prediction model. .. For example, when the initial feature definition information 22a includes the feature items "month" and "week", the forecast model based on the initial feature definition information 22a has a monthly fluctuation in sales and a weekly sales fluctuation. It is built with separate consideration and not with the combination of a particular month and a particular week. In this case, for example, sudden sales fluctuations in a specific week of a specific month due to an annual specific event will not be suitably reflected in the forecast model. According to the present embodiment, for example, a new feature item based on a feature combination of a feature item "12" for a feature item "month" and a value "4" for a feature item "week" is added, so that the prediction unit 28 can perform the prediction unit 28. It is possible to build a forecast model that takes into account sudden sales fluctuations that occur only in the fourth week of December.

また、本実施形態における素性組合せの特定方法によれば、１回の新素性項目追加処理において１つの特定素性組合せが特定され、それにより１つの新素性項目が追加される。これにより、１回の新素性項目追加処理における素性定義情報２２の変化が最小限に抑えられ、すなわち１回の新素性項目追加処理における予測モデルの変動量が最小限に抑えられる。これは、新素性項目追加処理を繰り返し行って予測モデルを繰り返し変動させていくことを前提とすると、１回の新素性項目の追加処理によって予測モデルの予測精度が向上しない場合があることに鑑みると、予測モデルの変動量を最小に抑えて徐々に変化させていくことは、予測モデルの予測精度をかえってより早期に向上させることに繋がる。 Further, according to the method for specifying the feature combination in the present embodiment, one specific feature combination is specified in one new feature item addition process, whereby one new feature item is added. As a result, the change of the feature definition information 22 in one feature item addition process is minimized, that is, the fluctuation amount of the prediction model in one feature feature item addition process is minimized. This is because, on the premise that the prediction model is repeatedly changed by repeatedly performing the new element item addition process, the prediction accuracy of the prediction model may not be improved by one new element item addition process. And, by minimizing the fluctuation amount of the prediction model and gradually changing it, it leads to improving the prediction accuracy of the prediction model at an earlier stage.

なお、新素性項目追加処理を繰り返し行う場合、２回目以降の処理においては、１回目の処理で追加された新素性項目を含む素性組わせを定義することが可能である。例えば、１回目の処理で図７に示すような更新素性定義情報２２ｂが生成された場合、２回目の処理において、素性組合せ定義部３２は、例えば、素性組合せ「１２月の第４週か」に対する値「１」と、素性項目「最高気温は１５度以上か」に対する値「１」との素性組合せを定義することが可能である。 When the new feature item addition process is repeated, it is possible to define a feature combination including the new feature item added in the first process in the second and subsequent processes. For example, when the updated feature definition information 22b as shown in FIG. 7 is generated in the first process, in the second process, the feature combination definition unit 32 is, for example, the feature combination “4th week of December?”. It is possible to define a feature combination of a value "1" for the feature item "is the maximum temperature 15 degrees or higher?" And a value "1" for the feature item "is the maximum temperature 15 degrees or higher?".

上述のように、更新素性定義情報２２ｂに基づいて構築された予測モデルの予測精度は、必ずしも、初期素性定義情報２２ａに基づいて構築された予測モデルの予測精度より向上するとは限らない。したがって、新素性項目追加部３６は、特定素性組合せに基づいて定義された新素性項目が初期素性定義情報２２ａに追加されることによって、予測モデルの予測精度が向上したことを確認した上で、当該新素性項目を初期素性定義情報２２ａに追加するようにしてもよい。具体的な処理の流れは以下の通りである。 As described above, the prediction accuracy of the prediction model constructed based on the updated feature definition information 22b is not always higher than the prediction accuracy of the prediction model constructed based on the initial feature definition information 22a. Therefore, the new feature item addition unit 36 confirms that the prediction accuracy of the prediction model is improved by adding the new feature item defined based on the specific feature combination to the initial feature definition information 22a. The new feature item may be added to the initial feature definition information 22a. The specific processing flow is as follows.

まず、新素性項目追加部３６は、上述と同様の処理によって、素性組合せ定義部３２が定義した複数の素性組合せの中から平均誤差が最大となる特定素性組合せを特定し、当該特定素性組合せに基づいて新素性項目を生成する。そして、新素性項目追加部３６は、生成した新素性項目を素性定義情報２２に仮追加する。これにより、新素性項目が仮追加された仮素性定義情報が生成される。 First, the new feature item addition unit 36 identifies a specific feature combination having the maximum average error from a plurality of feature combinations defined by the feature combination definition unit 32 by the same processing as described above, and uses the specific feature combination as the specific feature combination. Generate a new feature item based on. Then, the new feature item addition unit 36 temporarily adds the generated new feature item to the feature definition information 22. As a result, the pseudo-feature definition information to which the new feature item is provisionally added is generated.

その後、素性生成部２６以下の各部は、当該仮素性定義情報を用いて上述と同様の処理を行う。具体的には、素性生成部２６は、当該仮素性定義情報と過去データ群に基づいて仮素性を生成し、予測部２８は、学習データ群の仮素性と売上実績値に基づいて仮予測モデルを構築し、仮予測モデルと検証データ群の仮素性に基づいて各検証データに対する予測値を算出する。誤差算出部３０は、各検証データの売上実績値と、仮予測モデルを用いて予測された各売上予測値との複数の誤差を算出する。そして、誤差算出部３０は、全検証データに対応する複数の誤差の平均値（以下「仮全平均誤差」と記載する）を算出する。 After that, each part below the feature generation unit 26 performs the same processing as described above using the pseudo-feature definition information. Specifically, the element generation unit 26 generates a pseudo element based on the pseudo element definition information and the past data group, and the prediction unit 28 generates a tentative prediction model based on the pseudo element and the sales actual value of the training data group. Is constructed, and the predicted value for each verification data is calculated based on the tentative prediction model and the tentative nature of the verification data group. The error calculation unit 30 calculates a plurality of errors between the actual sales value of each verification data and each sales forecast value predicted using the provisional prediction model. Then, the error calculation unit 30 calculates the average value of a plurality of errors corresponding to all the verification data (hereinafter referred to as "provisional total average error").

ここで、新素性項目追加部３６は、新素性項目を生成する処理において算出された、各検証データの売上実績値と、初期素性定義情報２２ａに基づいて構築された予測モデルを用いて予測された各売上予測値との複数の誤差の平均値（以下「処理前全平均誤差」と記載する）と、仮全平均誤差とを比較する。仮全平均誤差が処理前全平均誤差よりも小さい場合は、新素性項目の追加により予測モデルの予測精度が向上したということだから、仮追加した当該新素性項目を正式に素性定義情報２２に追加する。一方、仮全平均誤差が処理前全平均誤差以上となった場合は、新素性項目の追加により予測モデルの予測精度が低下したということだから、新素性項目追加部３６は仮追加した新素性項目を削除する。つまり、当該新素性項目を素性定義情報２２に追加しない。 Here, the new element item addition unit 36 is predicted using the actual sales value of each verification data calculated in the process of generating the new element item and the prediction model constructed based on the initial element definition information 22a. The average value of a plurality of errors with each sales forecast value (hereinafter referred to as "pre-processed total average error") is compared with the provisional total average error. If the tentative total average error is smaller than the unprocessed total average error, it means that the prediction accuracy of the prediction model has improved due to the addition of the new feature item, so the tentatively added new feature item is officially added to the feature definition information 22. do. On the other hand, if the provisional total average error is greater than or equal to the pre-processing total average error, it means that the prediction accuracy of the prediction model has deteriorated due to the addition of new feature items. To delete. That is, the new feature item is not added to the feature definition information 22.

以下、図８に示すフローチャートに従って、本実施形態に係る情報処理装置１０の処理の流れを説明する。 Hereinafter, the processing flow of the information processing apparatus 10 according to the present embodiment will be described with reference to the flowchart shown in FIG.

ステップＳ１０において、ユーザは、過去データＤＢ２０に過去データ群を格納すると共に、初期素性定義情報２２ａを記憶部１２に記憶させる。 In step S10, the user stores the past data group in the past data DB 20 and stores the initial feature definition information 22a in the storage unit 12.

ステップＳ１２において、過去データ分類部２４は、過去データＤＢ２０に格納された過去データ群を学習データ群と検証データ群とに分類する。 In step S12, the past data classification unit 24 classifies the past data group stored in the past data DB 20 into a learning data group and a verification data group.

ステップＳ１４において、素性生成部２６は、初期素性定義情報２２ａに基づいて、ステップＳ１０で過去データＤＢ２０に格納された各過去データについて、複数の素性項目と各素性項目に対する複数の値からなる素性を生成する。 In step S14, the feature generation unit 26 obtains a feature consisting of a plurality of feature items and a plurality of values for each feature item for each past data stored in the past data DB 20 in step S10 based on the initial feature definition information 22a. Generate.

ステップＳ１６において、予測部２８は、ステップＳ１４で学習データ群に基づいて生成された各素性と、各学習データに含まれる売上実績値とに基づいて、予測モデルを構築する。 In step S16, the prediction unit 28 builds a prediction model based on each feature generated based on the learning data group in step S14 and the actual sales value included in each learning data.

ステップＳ１８において、予測部２８は、ステップＳ１６で構築した予測モデルと、ステップＳ１４で検証データ群に基づいて生成された各素性とに基づいて、各検証データに対する売上予測値を予測する。 In step S18, the prediction unit 28 predicts the sales forecast value for each verification data based on the prediction model constructed in step S16 and each feature generated based on the verification data group in step S14.

ステップＳ２０において、誤差算出部３０は、各検証データについて、売上実績値と、ステップＳ１８で算出された売上予測値との誤差を算出する。 In step S20, the error calculation unit 30 calculates an error between the actual sales value and the predicted sales value calculated in step S18 for each verification data.

ステップＳ２２において、誤差算出部３０は、今回のステップＳ２０で算出された全検証データについての複数の誤差の平均値（以下「今回全平均誤差」と記載する）が、前回のステップＳ２０で算出された全検証データについての複数の誤差の平均値（以下「前回全平均誤差」と記載する）よりも小さく、且つ、今回全平均誤差と前回全平均誤差との差分が閾値以下であるか否かを判断する。当該条件を満たす場合は処理を終了し、満たさない場合はステップＳ２４に進む。今回は、初回処理のため、前回平均誤差の値が存在しないことから、ステップＳ２４へ進む。ステップＳ２２の詳細については後述する。 In step S22, the error calculation unit 30 calculates the average value of a plurality of errors (hereinafter referred to as “this time total average error”) for all the verification data calculated in this step S20 in the previous step S20. Whether or not the difference between the current total average error and the previous total average error is less than the threshold value, which is smaller than the average value of multiple errors for all the verification data (hereinafter referred to as "previous total average error"). To judge. If the condition is satisfied, the process is terminated, and if the condition is not satisfied, the process proceeds to step S24. Since this time is the initial processing and the value of the previous average error does not exist, the process proceeds to step S24. The details of step S22 will be described later.

ステップＳ２４において、素性組合せ定義部３２は、初期素性定義情報２２ａに基づいて、複数の素性項目と、当該複数の素性項目に対する複数の値からなる複数の素性組合せを定義する。 In step S24, the feature combination definition unit 32 defines a plurality of feature items and a plurality of feature combinations consisting of a plurality of values for the plurality of feature items based on the initial feature definition information 22a.

ステップＳ２６において、素性組合せ誤差算出部３４は、ステップＳ２４で定義された素性組合せについて、ステップＳ２０で算出された各検証データに対応する複数の誤差のうち、当該素性組合せに該当する複数の検証データについての複数の誤差を抽出し、抽出した複数の誤差の平均値（平均誤差）を算出する。これをステップＳ２４で定義された各素性組合せについて行い、各素性組合せに対応する平均誤差が算出される。 In step S26, the feature combination error calculation unit 34 has a plurality of verification data corresponding to the feature combination among the plurality of errors corresponding to each verification data calculated in step S20 for the feature combination defined in step S24. Multiple errors are extracted, and the average value (average error) of the extracted multiple errors is calculated. This is performed for each feature combination defined in step S24, and the average error corresponding to each feature combination is calculated.

ステップＳ２８において、新素性項目追加部３６は、ステップＳ２４において定義された複数の素性組合せのうち、ステップＳ２６で算出された平均誤差が最大である素性組合わせを特定する。そして、特定された素性組合わせに基づいて、新素性項目及び当該新素性項目に対する取り得る値を生成する。 In step S28, the new feature item addition unit 36 identifies the feature combination having the largest average error calculated in step S26 among the plurality of feature combinations defined in step S24. Then, based on the specified feature combination, a new feature item and a possible value for the new feature item are generated.

ステップＳ３０において、新素性項目追加部３６は、ステップＳ２８で生成された新素性項目及び当該新素性項目に対する取り得る値を素性定義情報２２に追加する。これにより、更新素性定義情報２２ｂが生成される。 In step S30, the new feature item addition unit 36 adds the new feature item generated in step S28 and the possible values for the new feature item to the feature definition information 22. As a result, the update feature definition information 22b is generated.

ステップＳ３０の処理後、再度ステップＳ１４へ戻り、再度のステップＳ１４において、素性生成部２６は、ステップＳ３０で生成された更新素性定義情報２２ｂに基づいて、各過去データについて新たな素性を生成する。以後、再度ステップＳ１６からステップＳ２２まで、新たな素性に基づいて、同様の処理が行われる。 After the processing of step S30, the process returns to step S14 again, and in step S14 again, the feature generation unit 26 generates new features for each past data based on the updated feature definition information 22b generated in step S30. After that, the same processing is performed again from step S16 to step S22 based on the new features.

このように、本実施形態では、ステップＳ１４からステップＳ３０の処理が繰り返され、素性定義情報２２に順次新素性項目が追加されていく。これにより、予測部２８が構築する予測モデルの予測精度が順次向上していくことが期待される。 As described above, in the present embodiment, the processes of steps S14 to S30 are repeated, and new feature items are sequentially added to the feature definition information 22. As a result, it is expected that the prediction accuracy of the prediction model constructed by the prediction unit 28 will be gradually improved.

ステップＳ２２に示されている条件が、当該繰り返し処理の終了条件となっている。上述の通り、ステップＳ２２の条件は、今回全平均誤差が前回全平均誤差よりも小さく、且つ、今回全平均誤差と前回全平均誤差との差分が閾値以下であるか否かという条件である。換言すれば、今回全平均誤差は前回全平均誤差よりも小さくなったが、全平均誤差が前回に対してあまり変わらなくなった場合に、繰り返し処理を終了する。 The condition shown in step S22 is the end condition of the iterative process. As described above, the condition of step S22 is whether or not the current total average error is smaller than the previous total average error and the difference between the current total average error and the previous total average error is equal to or less than the threshold value. In other words, the total average error this time is smaller than the previous total average error, but when the total average error does not change much from the previous time, the iterative process ends.

これは、新素性項目追加処理の繰り返し回数と全平均誤差との関係は、一般に図９に示す関係を有していることに基づくものである。図９に示すように、新素性項目追加処理の繰り返し回数が比較的少ないときは、１回の新素性項目追加処理によって全平均誤差が比較的大きく減少するが、新素性項目追加処理の繰り返し回数が比較的多くなってくると、１回の新素性項目追加処理によって全平均誤差があまり変わらなくなってくる。したがって、今回全平均誤差と前回全平均誤差との差分が閾値以下となったときは、全平均誤差がそれ以上劇的に減少しないと判断できることから、新素性項目追加処理の繰り返し処理を終了する。 This is based on the fact that the relationship between the number of repetitions of the new feature item addition process and the total average error generally has the relationship shown in FIG. As shown in FIG. 9, when the number of repetitions of the new feature item addition process is relatively small, the total average error is relatively greatly reduced by one new feature item addition process, but the number of repetitions of the new feature item addition process is relatively small. When the number of new features becomes relatively large, the total average error does not change much by one new feature item addition process. Therefore, when the difference between the total average error this time and the total average error last time is less than the threshold value, it can be judged that the total average error does not decrease dramatically, and the iterative processing of the new feature item addition process is terminated. ..

なお、新素性項目追加処理の繰り返し処理の終了条件としては、上記以外の条件を採用することもできる。例えば、素性定義情報２２に含まれる素性項目の数が所定数に達したことを条件としてもよいし、新素性項目追加処理の繰り返し処理の回数が所定回数に達したことを条件としてもよい。 Conditions other than the above can be adopted as the end condition of the iterative process of the new feature item addition process. For example, it may be a condition that the number of feature items included in the feature definition information 22 has reached a predetermined number, or it may be a condition that the number of times of repeated processing of the new feature item addition process has reached a predetermined number.

＜第２実施形態＞
第２実施形態は、第１実施形態を基本としながらも、第１実施形態に比して素性組合せ定義部３２の処理内容が異なるものである。第１実施形態においては、素性組合せ定義部３２は、素性定義情報２２が有する素性項目と取り得る値との間で実現可能な組み合わせの全てを定義していた。これによれば、全ての素性組合せを漏れなく特定素性組合せの候補とすることができる一方、定義される素性組合せの数が膨大になる場合があり、素性組合せ定義部３２、素性組合せ誤差算出部３４、あるいは新素性項目追加部３６の処理量が多くなってしまう場合がある。 <Second Embodiment>
Although the second embodiment is based on the first embodiment, the processing content of the feature combination definition unit 32 is different from that of the first embodiment. In the first embodiment, the feature combination definition unit 32 defines all the possible combinations between the feature item and the possible value of the feature definition information 22. According to this, while all feature combinations can be candidates for specific feature combinations without omission, the number of defined feature combinations may become enormous, and the feature combination definition unit 32 and feature combination error calculation unit The processing amount of 34 or the new feature item addition unit 36 may increase.

このことに鑑み、第２実施形態においては、素性組合せ定義部３２が、平均誤差が大きくなる可能性が低い素性組合せを定義しないことで、定義される素性組合せの数が低減される。これにより、新素性項目追加処理における処理量が低減される。以下、第２実施形態における素性組合せ定義部３２の処理について説明する。 In view of this, in the second embodiment, the feature combination definition unit 32 does not define the feature combinations that are unlikely to have a large average error, so that the number of defined feature combinations is reduced. As a result, the amount of processing in the new feature item addition processing is reduced. Hereinafter, the processing of the feature combination definition unit 32 in the second embodiment will be described.

まず、第２実施形態においては、素性定義情報２２において、複数の素性項目間において階層関係を定義しておく。当該階層関係の定義は、予めユーザなどによって行われてよい。 First, in the second embodiment, the feature definition information 22 defines a hierarchical relationship among a plurality of feature items. The definition of the hierarchical relationship may be performed by a user or the like in advance.

図１０に、本実施形態における階層関係の例が示されている。図１０の例では、第１層として素性項目「年」、第２層として素性項目「月」、第３層として素性項目「週」、第４層として素性項目（曜日）が定義されている。このように、複数の素性項目間における階層関係は、各素性項目の実際の概念に即した階層となっている。もちろん、図１０に示した階層関係は一例であり、複数の素性項目間における階層関係は様々な態様で定義することができる。 FIG. 10 shows an example of the hierarchical relationship in this embodiment. In the example of FIG. 10, the feature item "year" is defined as the first layer, the feature item "month" is defined as the second layer, the feature item "week" is defined as the third layer, and the feature item (day of the week) is defined as the fourth layer. .. In this way, the hierarchical relationship between the plurality of feature items is a hierarchy that matches the actual concept of each feature item. Of course, the hierarchical relationship shown in FIG. 10 is an example, and the hierarchical relationship between a plurality of feature items can be defined in various ways.

素性組合せ定義部３２は、素性定義情報２２において定義された階層関係に基づいて、素性組合せを定義する。具体的には、まず、素性組合せ定義部３２は、素性組合せに含まれる一方の素性項目を選択する。次いで、階層関係において、当該一方の素性項目が属する層に隣接する層に属する素性項目を他方の素性項目として選択する。そして、当該一方の素性項目（及び当該素性項目に対する値）と、当該他方の素性項目（及び当該素性項目に対する値）とを組み合わせて素性組合せを定義する。換言すれば、第２実施形態においては、素性組合せ定義部３２は、一方の素性項目と、当該一方の素性項目が属する層に隣接しない層に属する素性項目との素性組合せは定義しない。 The feature combination definition unit 32 defines the feature combination based on the hierarchical relationship defined in the feature definition information 22. Specifically, first, the feature combination definition unit 32 selects one feature item included in the feature combination. Next, in the hierarchical relationship, the feature item belonging to the layer adjacent to the layer to which the one feature item belongs is selected as the other feature item. Then, the feature combination is defined by combining the one feature item (and the value for the feature item) and the other feature item (and the value for the feature item). In other words, in the second embodiment, the feature combination definition unit 32 does not define a feature combination of one feature item and a feature item belonging to a layer not adjacent to the layer to which the one feature item belongs.

図１０の例を用いて説明する。素性組合せ定義部３２は、例えば、第１層に属する素性項目「年」と、第２層に属する素性項目「月」との素性組合せは定義するが、第１層に属する素性項目「年」と、第３層に属する素性項目「週」との素性組合せは定義しない。具体的には、素性項目「年」に対する値「２０１５」と、素性項目「月」に対する値「３」との素性組合せは定義するが、素性項目「年」に対する値「２０１５」と、素性項目「週」に対する値「４」との組み合わせは定義しない。 This will be described with reference to the example of FIG. The feature combination definition unit 32 defines, for example, the feature combination of the feature item "year" belonging to the first layer and the feature item "month" belonging to the second layer, but the feature item "year" belonging to the first layer. And the feature combination with the feature item "week" belonging to the third layer is not defined. Specifically, the feature combination of the value "2015" for the feature item "year" and the value "3" for the feature item "month" is defined, but the value "2015" for the feature item "year" and the feature item. The combination with the value "4" for "week" is not defined.

例えば、過去データにおいて、階層が離れた項目である関連項目「年」と「週」との組み合わせが売上に特に影響を与えること（例えば２０１５年の毎月第４週のみ突発的に売上が変動すること）は比較的少ないと言える。したがって、互いに階層が離れた素性項目を含む素性組合せの平均誤差が大きくなる可能性は低いと言える。したがって、素性組合せ定義部３２が、階層が離れた互いに階層が離れた素性項目を含む素性組合せを定義しないことによって、平均誤差が大きくなる可能性の高い素性組合せを残しつつ、定義される素性組合せの数を低減させることができる。 For example, in historical data, the combination of related items "year" and "week", which are items that are separated from each other, has a particular effect on sales (for example, sales fluctuate suddenly only in the 4th week of every month in 2015). It can be said that there are relatively few things. Therefore, it can be said that it is unlikely that the average error of the feature combinations including the feature items whose layers are separated from each other becomes large. Therefore, the feature combination definition unit 32 does not define the feature combinations including the feature items whose layers are separated from each other, so that the feature combinations defined while leaving the feature combinations with a high possibility that the average error becomes large. The number of can be reduced.

第２実施形態において、新素性項目が素性定義情報２２に追加された場合、追加された新素性項目の階層は、当該新素性組合せに含まれる２つの素性項目のうち深い方の層とされる。例えば、第１層にある素性項目「年」に対する値「２０１５」と、第２層にある素性項目「月」に対する値「３」との素性組合せに基づく新素性項目「２０１５年の３月か」が定義された場合、当該新素性項目の階層は、第１層と第２層のうちより深い方の層である第２層となる。 In the second embodiment, when a new feature item is added to the feature definition information 22, the layer of the added new feature item is the deeper layer of the two feature items included in the new feature combination. .. For example, a new feature item "March 2015" based on a combination of features "2015" for the feature item "year" in the first layer and "3" for the feature item "month" in the second layer. Is defined, the layer of the new feature item is the second layer, which is the deeper layer of the first layer and the second layer.

これにより、さらなる新素性項目の追加処理により、第２層にある素性項目「２０１５年の３月か」と、第３層にある素性項目「週」との素性組合せが可能になる。このようにして、階層関係に沿って、３つ以上の素性項目を含む素性組合せを定義することも可能である。 As a result, the feature item "March 2015?" In the second layer and the feature item "week" in the third layer can be combined by further additional processing of the feature item. In this way, it is also possible to define a feature combination including three or more feature items along the hierarchical relationship.

＜第３実施形態＞
第３実施形態は、第１実施形態を基本としながらも、第１実施形態及び第２実施形態に比して素性組合せ定義部３２の処理内容が異なるものである。第１実施形態及び第２実施形態においては、素性組合せに含まれる複数の値は、同一の過去データ（注目過去データ）に関する値として定義されていた。例えば、図６に示した素性組合せＰ１は、ある１つの注目過去データの素性項目「月」に対する値が「１２」であり、且つ、素性項目「週」に対する値が「３」である、ということを意味するものである。 <Third Embodiment>
Although the third embodiment is based on the first embodiment, the processing content of the feature combination definition unit 32 is different from that of the first embodiment and the second embodiment. In the first embodiment and the second embodiment, a plurality of values included in the feature combination are defined as values relating to the same past data (attention past data). For example, in the feature combination P1 shown in FIG. 6, the value for the feature item "month" of a certain attention past data is "12", and the value for the feature item "week" is "3". It means that.

過去データ群が時系列に並ぶ複数の過去データから構成される場合、複数の過去データの関連項目に対する値を考慮して予測値を予測した方が、予測精度が向上する場合がある。本実施形態のように、過去データ群が日々蓄積される複数の過去データから構成される場合、例えば、予測対象日の前日が休日か平日か、あるいは、予測対象日の後日が休日か平日かなどが、予測対象日の売上予測に影響する場合がある。このような場合は、例えば予測対象日の前日あるいは後日が休日が否かということを含めて考慮して予測モデルを構築することで、当該予測モデルの予測精度をより向上させることができる。 When the past data group is composed of a plurality of past data arranged in a time series, the prediction accuracy may be improved by predicting the predicted value in consideration of the values for the related items of the plurality of past data. When the past data group is composed of a plurality of past data accumulated daily as in the present embodiment, for example, whether the day before the forecast target date is a holiday or a weekday, or the day after the forecast target date is a holiday or a weekday. Etc. may affect the sales forecast on the forecast target date. In such a case, the prediction accuracy of the prediction model can be further improved by constructing the prediction model in consideration of whether or not the day before or after the prediction target day is a holiday.

このことに鑑み、第３実施形態においては、素性組合せ定義部３２は、素性定義情報２２において定義された素性項目群から選択された素性項目に対する、注目過去データに関する値と、同じく素性項目群から選択された素性項目に対する、当該注目過去データとは異なる過去データに関する値とを含む素性組合せを定義する。 In view of this, in the third embodiment, the feature combination definition unit 32 has the values related to the past data of interest for the feature items selected from the feature item group defined in the feature definition information 22, and also from the feature item group. Define a feature combination including a value related to past data different from the featured past data for the selected feature item.

例えば、第３実施形態において、素性組合せ定義部３２は、素性項目「休日か平日か」に対する注目過去データ（当日）の値「０（平日）」、素性項目「休日か平日か」に対する前日の値「１（平日）」という素性組合せを定義する。また、素性組合せに含まれる２つの素性項目が同一である必要はなく、例えば、素性項目「休日か平日か」に対する当日の値「０（平日）」、素性項目「曜日」に対する前日の値「１（日曜）」という素性組合せを定義するようにしてもよい。また、素性組合せに含まれる複数の値は、時系列において連続する過去データの値に関するものである必要はなく、当日の値と、例えば２日前、１週間前などの値であってもよい。 For example, in the third embodiment, the feature combination definition unit 32 has a value "0 (weekday)" of attention past data (on the day) for the feature item "holiday or weekday", and the previous day for the feature item "holiday or weekday". A feature combination with a value of "1 (weekday)" is defined. In addition, the two feature items included in the feature combination do not have to be the same. For example, the value of the day "0 (weekday)" for the feature item "holiday or weekday" and the value of the previous day for the feature item "day of the week" " You may define the feature combination of "1 (Sunday)". Further, the plurality of values included in the feature combination do not have to be related to the values of consecutive past data in the time series, and may be the values of the current day and the values of, for example, two days ago and one week ago.

このような素性組合せが定義され、素性組合せ誤差算出部３４により当該素性組合せに対する平均誤差が最大となれば、当該素性組合せに基づく新素性項目「当日が平日で前日が休日か」などが素性定義情報２２に追加される。これにより、予測部２８は、「当日が平日で前日が休日か」など、複数の過去データにおける関連項目に対する値を考慮した、より予測精度が向上した予測モデルを構築することができる。 If such a feature combination is defined and the average error for the feature combination is maximized by the feature combination error calculation unit 34, the new feature item "whether the current day is a weekday and the previous day is a holiday" based on the feature combination is defined as a feature. Added to information 22. As a result, the prediction unit 28 can build a prediction model with improved prediction accuracy in consideration of values for related items in a plurality of past data such as "whether the current day is a weekday and the previous day is a holiday".

＜第４実施形態＞
第４実施形態は、第１実施形態を基本としながらも、素性定義情報２２で定義された各素性項目についての、対象項目（本実施形態では売上）の予測値に対して与える影響の大きさを示す寄与度に基づく処理が行われる点において第１実施形態と異なるものである。 <Fourth Embodiment>
Although the fourth embodiment is based on the first embodiment, the magnitude of the influence on the predicted value of the target item (sales in this embodiment) for each feature item defined in the feature definition information 22. It is different from the first embodiment in that the processing based on the contribution degree is performed.

各素性項目に対する寄与度は、予測部２８により算出される。寄与度算出の基本的な方法は以下の通りである。まず、予測部２８は、上述の通り、素性生成部２６が生成した学習データ群に関する素性群と、学習データ群に含まれる売上実績値に基づいて予測モデルを構築し、当該予測モデルと、検証データに関する素性に基づいて、当該検証データに対する売上予測値を算出する。次いで、当該検証データに関する素性において、注目素性項目に対する値をランダムに変更して変更素性を生成する。そして、予測部２８は、予測モデルと変更素性に基づいて、当該検証データに対する売上予測値を算出する。ここで算出された、変更素性に基づく売上予測値と、事前に算出された売上予測値との差が大きい程、注目素性項目の寄与度が高い、ということになる。したがって、予測部２８は、変更素性に基づく売上予測値と、事前に算出された売上予測値との差が大きい程、当該注目素性項目の寄与度を高く算出する。 The degree of contribution to each feature item is calculated by the prediction unit 28. The basic method for calculating the contribution is as follows. First, as described above, the prediction unit 28 constructs a prediction model based on the element group related to the training data group generated by the element generation unit 26 and the actual sales value included in the training data group, and verifies the prediction model. The sales forecast value for the verification data is calculated based on the nature of the data. Next, in the features related to the verification data, the values for the feature items of interest are randomly changed to generate the changed features. Then, the forecasting unit 28 calculates the sales forecast value for the verification data based on the forecast model and the change feature. The larger the difference between the sales forecast value calculated here based on the change feature and the pre-calculated sales forecast value, the higher the contribution of the feature feature item. Therefore, the forecasting unit 28 calculates the contribution of the feature feature item as the difference between the sales forecast value based on the change feature and the sales forecast value calculated in advance is larger.

このような処理によって、素性定義情報２２に各素性項目の寄与度を示す情報が付され、寄与度付素性定義情報２２−２が生成される。図１１に、寄与度付素性定義情報２２−２の例が示されている。 By such processing, information indicating the contribution degree of each feature item is attached to the feature definition information 22, and the feature definition information 22-2 with the contribution degree is generated. FIG. 11 shows an example of feature definition information 22-2 with contribution.

素性組合せ定義部３２は、各素性項目の寄与度に基づいて素性組合せを生成する。具体的には、予め寄与度閾値を定めておき、素性組合せを定義するにあたり、寄与度が当該寄与度閾値以下である素性項目を選択しないようにする。寄与度が低い素性項目を含む素性組合せについて、素性組合せ誤差算出部３４が平均誤差を算出した場合、当該平均誤差が大きくなる可能性は低いといえる。したがって、寄与度閾値以下の素性項目を含む素性組合せを定義しないことによって、平均誤差が大きくなる可能性の高い素性組合せを残しつつ、定義される素性組合せの数を低減させることができる。 The feature combination definition unit 32 generates feature combinations based on the contribution of each feature item. Specifically, the contribution threshold is set in advance, and when defining the feature combination, the feature item whose contribution is equal to or less than the contribution threshold is not selected. When the feature combination error calculation unit 34 calculates the average error for the feature combination including the feature item having a low contribution, it can be said that the possibility that the average error becomes large is low. Therefore, by not defining the feature combinations including the feature items equal to or less than the contribution threshold, the number of the defined feature combinations can be reduced while leaving the feature combinations with a high possibility that the average error becomes large.

また、第１実施形態では、新素性項目追加部３６は、各素性組合せについて算出された平均誤差に基づいて、特定素性組合せを特定していたが、第４実施形態では、さらに、各素性組合せに含まれる素性項目の寄与度に基づいて、特定素性組合せを特定する。例えば、ある素性組合せの平均誤差と、当該素性組合せに含まれる複数の素性項目についての複数の寄与度の平均値又は合算値との積を、当該素性組合せの指標値として算出する。そして、各素性組合せについて算出された指標値に基づいて特定素性組合せを特定する。指標値に基づく特定素性組合せ方法としては、例えば、複数の素性組合せのうち、指標値が最大の素性組合せを特定素性組合せとしてもよいし、各素性組合せの指標値が降順となるように複数の素性組合せに対して順位付けを行った上で、当該順位付けにおける上位にある複数の素性組合せ（例えば１〜３位など）を特定素性組合せとしてもよいし、予め指標閾値を設けておき、当該指標閾値以上の指標値が算出された全ての素性組合せを特定素性組合せとしてもよい。 Further, in the first embodiment, the new feature item addition unit 36 specified the specific feature combination based on the average error calculated for each feature combination, but in the fourth embodiment, each feature combination is further specified. A specific feature combination is specified based on the contribution of the feature items included in. For example, the product of the average error of a certain feature combination and the average value or the total value of a plurality of contributions for a plurality of feature items included in the feature combination is calculated as an index value of the feature combination. Then, the specific feature combination is specified based on the index value calculated for each feature combination. As a specific element combination method based on the index value, for example, among a plurality of element combinations, the element combination having the maximum index value may be used as the specific element combination, or a plurality of element combinations so that the index values of each element combination are in descending order. After ranking the element combinations, a plurality of element combinations (for example, 1st to 3rd place) higher in the ranking may be used as a specific element combination, or an index threshold value may be set in advance and the index threshold may be set. All the element combinations for which the index value equal to or higher than the index threshold is calculated may be used as the specific element combination.

寄与度を考慮して特定素性組合せを特定することで、寄与度が高い素性項目を含む素性組合せに基づいて生成された新素性項目が素性定義情報２２に追加され易くなる。これにより、予測部２８により構築される予測モデルの予測精度がより向上され得る。 By specifying the specific feature combination in consideration of the contribution degree, the new feature item generated based on the feature combination including the feature item having a high contribution degree can be easily added to the feature definition information 22. As a result, the prediction accuracy of the prediction model constructed by the prediction unit 28 can be further improved.

以上、本発明に係る実施形態を説明したが、本発明は上記実施形態に限られるものではなく、本発明の趣旨を逸脱しない限りにおいて種々の変更が可能である。 Although the embodiments according to the present invention have been described above, the present invention is not limited to the above-described embodiments, and various modifications can be made without departing from the spirit of the present invention.

１０情報処理装置、１２記憶部、１４制御部、２０過去データＤＢ、２２素性定義情報、２４過去データ分類部、２６素性生成部、２８予測部、３０誤差算出部、３２素性組合せ定義部、３４素性組合せ誤差算出部、３６新素性項目追加部。 10 Information processing device, 12 Storage unit, 14 Control unit, 20 Past data DB, 22 Feature definition information, 24 Past data classification unit, 26 Feature generation section, 28 Prediction section, 30 Error calculation section, 32 Feature combination definition section, 34 Feature combination error calculation unit, 36 new feature item addition section.

Claims

A training data group that is a part of a past data group including past actual values for a target item to be predicted and a related item group related to the target item, and a candidate for an item to be used for prediction among the related item groups. A prediction unit that predicts each predicted value for the target item of each verification data that is a part of the past data group and is the target of verification, based on the prescriptive definition information in which the predisposition item group is defined.
For each of the verification data, an error calculation unit that calculates an error between each prediction value predicted by the prediction unit and an actual value for the target item, and
For each of the plurality of feature items selected from the feature item group included in the feature definition information and the plurality of feature combinations consisting of the values for the plurality of feature items, the plurality of verification data corresponding to the feature combination are described above. The feature combination error calculation unit that calculates the representative value of the error, and the feature combination error calculation unit.
From the plurality of feature combinations, a new feature item defined by the specific feature combination specified based on the representative value of the error corresponding to each feature combination is generated and added to the feature definition information. Item addition part and
An information processing device characterized by being equipped with.

In the new feature item addition unit, after the new feature item is specified, the average value of the errors related to the verification data group calculated based on the pseudo-feature definition information to which the new feature item is provisionally added is the new feature. The new feature item is added to the feature definition information when it is smaller than the average value of the errors related to the verification data group calculated based on the feature definition information before the item is provisionally added.
The information processing apparatus according to claim 1.

A plurality of feature items included in the feature item group have a hierarchical relationship.
The feature combination error calculation unit defines the feature combination by combining one feature item and the other feature item belonging to a layer adjacent to the layer to which the one feature item belongs in the hierarchical relationship.
The information processing apparatus according to claim 1.

The past data group is composed of a plurality of past data arranged in a time series.
The feature combination includes a value related to past data of interest for a feature item selected from the feature item group and a value related to past data other than the past data of interest for a feature item selected from the feature item group.
The information processing apparatus according to claim 1.

The prediction unit calculates the degree of contribution indicating the magnitude of the influence of the value on each feature item on the value of the target item for each feature item included in the feature item group.
In defining the feature combination, the feature combination error calculation unit does not select a feature item whose contribution is equal to or less than the threshold value.
The information processing apparatus according to claim 1.

The new feature item addition unit identifies the specific feature combination based on the representative value of the error corresponding to each feature combination and the contribution of a plurality of feature items included in each feature combination.
The information processing apparatus according to claim 5.

Computer,
A training data group that is a part of a past data group including past actual values for a target item to be predicted and a related item group related to the target item, and a verification data group that is a part of the past data group. The verification data included in the verification data group is based on the respective values for the related item group and the identity definition information in which the identity item group that is a candidate for the item used for prediction in the related item group is defined. A prediction unit that predicts each predicted value for the target item of each verification data that is a part of the past data group and is the target of verification, and
For each of the verification data, an error calculation unit that calculates an error between each prediction value predicted by the prediction unit and an actual value for the target item, and
For each of the plurality of feature items selected from the feature item group included in the feature definition information and the plurality of feature combinations consisting of the values for the plurality of feature items, the plurality of verification data corresponding to the feature combination are described above. The feature combination error calculation unit that calculates the representative value of the error, and the feature combination error calculation unit.
From the plurality of feature combinations, a new feature item defined by the specific feature combination specified based on the representative value of the error corresponding to each feature combination is generated and added to the feature definition information. Item addition part and
An information processing program characterized by functioning as.