JP2023183187A

JP2023183187A - Program, and information processing device

Info

Publication number: JP2023183187A
Application number: JP2022096679A
Authority: JP
Inventors: 俊亮広瀬; Toshiaki Hirose; 孝志森; Takashi Mori; 伊織三浦; Iori Miura; 青雲山根; Seiun YAMANE
Original assignee: Deloitte Touche Tohmatsu LLC
Current assignee: Deloitte Touche Tohmatsu LLC
Priority date: 2022-06-15
Filing date: 2022-06-15
Publication date: 2023-12-27
Anticipated expiration: 2042-06-15
Also published as: JP7143545B1

Abstract

To detect an abnormal transaction in audit data with unsupervised learning for each transaction, and to present a cause of the abnormality.SOLUTION: First calculation means 113 calculates, as a first feature amount, a feature amount based on correlation between a quantitative variable and a qualitative variable designated by a scenario selected by a user for each of a plurality of transaction records included in data designated by the user. Second calculation means 114 calculates, as a second feature amount, a feature amount based on correlation between two qualitative variables designated by the scenario selected by the user for each of the plurality of transaction records. Integrating means 115 calculates an integrated feature amount by integrating the first feature amount and the second feature amount calculated for each combination of variables indicated by the set scenario. Estimating means 116 estimates the probability that the transaction record is abnormal, based on the integrated feature amount calculated by the integrating means 115 based on the first feature amount and the second feature amount to display the result on a display unit 15.SELECTED DRAWING: Figure 6

Description

本発明は、監査業務に用いるプログラム、及び情報処理装置に関する。 The present invention relates to a program and an information processing device used for audit work.

監査業務では、仕訳帳等の中から異常な取引を検知するとともに、その異常の原因となる変数を特定することが求められる。人がこの作業を行う場合、負担が大きい。そこで、仕訳帳等の中から異常な取引を検知するにあたり、機械学習を利用することが考えられる。 Audit work requires detecting abnormal transactions in journals and the like, as well as identifying the variables that cause the abnormalities. It is a heavy burden for humans to perform this work. Therefore, it is possible to use machine learning to detect abnormal transactions from journals and the like.

教師あり学習を採用する場合、ラベルが付された相当量の教師データが必要となるが、多くの場合、仕訳帳等の監査データにおいてそれぞれの取引にはラベルが付されていない。一般に、監査データに含まれる取引の数は膨大であり、機械学習に必要となる教師データを作成すべくそれらにラベルを付与する処理を行うことは現実的でない。そこで、教師なし学習により、不正の可能性がある異常な取引を検知する試みが行われている。 When using supervised learning, a considerable amount of labeled training data is required, but in many cases, each transaction in audit data such as a journal is not labeled. Generally, the number of transactions included in audit data is enormous, and it is not realistic to perform a process of assigning labels to them in order to create training data necessary for machine learning. Therefore, attempts are being made to use unsupervised learning to detect abnormal transactions that may be fraudulent.

ここで、取引を表すデータは、通常、売上額等の比例尺度で表される項目のほか、取引先名、取引対象商品名等の名義尺度、日付等の間隔尺度で表される項目が含まれている。教師なし学習を採用する場合、名義尺度は数値でないのでそのままでは統計量にならない。そこで名義尺度を、例えばＯＮＥ－ＨＯＴ表現等の数値化手法を用いて特徴量計算に用いることも考えられる。 Here, data representing transactions usually includes items expressed on a proportional scale such as sales amount, as well as items expressed on a nominal scale such as business partner names and product names to be traded, and items expressed on an interval scale such as dates. It is. When using unsupervised learning, the nominal scale is not a numerical value, so it cannot be used as a statistic. Therefore, it is also conceivable to use the nominal scale in the feature amount calculation using a numerical method such as ONE-HOT expression.

特許文献１には、データ点の属性を抽出し、その属性を数値にスケーリングして、ｋ平均クラスタリングアルゴリズムによりデータ点をクラスタ化して、それぞれの外れ値スコアを生成して不正データ点を決定するコンピュータ実装システム及び方法が記載されている。 Patent Document 1 discloses a method for extracting attributes of data points, scaling the attributes into numerical values, clustering the data points using a k-means clustering algorithm, and generating outlier scores for each to determine fraudulent data points. A computer-implemented system and method is described.

特許文献２には、コンピュータ実装システムが、局所性鋭敏型ハッシュ、及び局所外れ値因子アルゴリズムを使用して不正データ点を検出する方法が記載されている。この文献において、コンピュータ実装システムは、データ点の属性を抽出して数値にスケーリングし、スケーリングされたその属性を特徴ベクトルに変換し、ランダムベクトルと特徴ベクトルとによって表されるデータ点とのドット積を計算して、局所性鋭敏型ハッシュテーブルを生成する。 U.S. Pat. No. 5,900,301 describes a method in which a computer-implemented system detects fraudulent data points using locality-sensitive hashing and local outlier factor algorithms. In this document, a computer-implemented system extracts attributes of a data point, scales them numerically, converts the scaled attributes into a feature vector, and dot-products the data points represented by the random vector and the feature vector. , and generate a locality-sensitive hash table.

特表２０２１－５３００１３号公報Special Publication No. 2021-530013 特表２０２１－５３００１７号公報Special Publication No. 2021-530017

しかし、上述した特許文献１、特許文献２に示すように、名義尺度等の質的変数（カテゴリ変数ともいう）を数値化しても仕訳帳等に記載される取引先等の数は膨大であるから、このような手法を採用することは現実的でない。 However, as shown in Patent Document 1 and Patent Document 2 mentioned above, even if qualitative variables (also called categorical variables) such as nominal scales are quantified, the number of business partners etc. recorded in journals etc. is enormous. Therefore, it is not realistic to adopt such a method.

また、これらの方法は、複数の名義尺度と比例尺度との相関を考慮してこれらを同時に扱うことができない。そのため、これらの方法では、異常な取引を検知したとしても、どの変数が異常に寄与しているかという原因の推定が困難である。 Furthermore, these methods cannot handle multiple nominal scales and proportional scales simultaneously in consideration of their correlation. Therefore, with these methods, even if an abnormal transaction is detected, it is difficult to estimate the cause of which variable contributes to the abnormal transaction.

本発明の目的の一つは、教師なし学習により、販売データ、仕訳伝票、仕訳帳等の監査データにおける異常な取引を取引単位で検出するとともに、その異常の原因を提示することである。 One of the objects of the present invention is to use unsupervised learning to detect abnormal transactions in audit data such as sales data, journal entry slips, journals, etc. on a transaction-by-transaction basis, and to present the cause of the abnormality.

本発明は、一の態様において、コンピュータを、複数の取引レコードを含むデータにおいて着目すべき変数の組合せを設定する設定手段と、各取引レコードに含まれる前記着目すべき変数である量的変数と質的変数との相関に基づく第１特徴量を算出する第１算出手段と、各取引レコードに含まれる前記着目すべき変数である２つの質的変数の相関に基づく第２特徴量を算出する第２算出手段と、前記第１算出手段と前記第２算出手段とを用いて、取引レコードごとに、前記設定手段にて設定された変数の各組合せについてそれぞれ算出された前記第１特徴量及び前記第２特徴量に基づいて、前記取引レコードが異常である可能性を推定する推定手段、として機能させるためのプログラムを提供する。 In one aspect, the present invention provides a setting means for setting a computer to a combination of variables to be noted in data including a plurality of transaction records, and a quantitative variable that is the variable to be noted included in each transaction record. a first calculation means for calculating a first feature amount based on a correlation with a qualitative variable; and a second calculation means for calculating a second feature amount based on a correlation between two qualitative variables that are the variables of interest included in each transaction record. The first feature quantity and the amount calculated for each combination of variables set by the setting means for each transaction record using a second calculation means, the first calculation means, and the second calculation means. A program is provided for functioning as an estimating means for estimating the possibility that the transaction record is abnormal based on the second feature amount.

好ましい態様において、前記推定手段は、設定された前記各組合せについてそれぞれ算出された前記第１特徴量及び前記第２特徴量を、それぞれに決められた係数を用いて統合して統合特徴量を算出し、該統合特徴量に基づいて前記可能性を推定し、前記係数は、前記統合特徴量と全ての前記第１特徴量及び前記第２特徴量とのそれぞれの相関から求まる指標値が決められた基準を満たすように決められることを特徴とする。 In a preferred embodiment, the estimating means calculates an integrated feature amount by integrating the first feature amount and the second feature amount calculated for each of the set combinations using respective determined coefficients. The probability is estimated based on the integrated feature amount, and the coefficient is an index value determined from the correlation between the integrated feature amount and all the first feature amount and the second feature amount. It is characterized by being determined to meet certain criteria.

好ましい態様において、前記第１算出手段は、取引レコードに含まれる前記量的変数の傾向の類否に応じて、前記データを前記質的変数ごとに複数のグループに分類し、それぞれのグループ内における前記各取引レコードの統計的珍しさを前記第１特徴量として算出することを特徴とする。 In a preferred embodiment, the first calculation means classifies the data into a plurality of groups for each of the qualitative variables depending on the similarity of trends in the quantitative variables included in the transaction records, and The method is characterized in that the statistical rarity of each transaction record is calculated as the first feature amount.

好ましい態様において、前記設定手段は、予め決められた複数の前記組合せの中から、ユーザによって選択された組合せを前記着目すべき変数の組合せとして設定することを特徴とする。 In a preferred embodiment, the setting means sets a combination selected by the user from among the plurality of predetermined combinations as the combination of the variables of interest.

好ましい態様において、前記データの種類を取得する取得手段を有し、前記設定手段は、予め決められた複数の前記組合せの中から、前記データの種類に応じた組合せを前記着目すべき変数の組合せとして設定することを特徴とする。 In a preferred embodiment, the setting means includes an acquisition means for acquiring the type of data, and the setting means selects a combination according to the type of data from among the plurality of predetermined combinations of the variables to be focused on. It is characterized by being set as .

好ましい態様において、前記質的変数は、前記取引レコードが示す取引をした部門の識別情報であることを特徴とする。 In a preferred embodiment, the qualitative variable is identification information of the department that conducted the transaction indicated by the transaction record.

好ましい態様において、前記量的変数は、前記取引レコードが示す取引の額であることを特徴とする。 In a preferred embodiment, the quantitative variable is the amount of the transaction indicated by the transaction record.

好ましい態様において、前記第１算出手段は、取引レコードに前記質的変数が含まれている条件下における前記量的変数の割合から求まる量を前記第１特徴量として算出することを特徴とする。 In a preferred embodiment, the first calculation means calculates, as the first feature amount, an amount determined from a ratio of the quantitative variable under conditions in which the qualitative variable is included in the transaction record.

好ましい態様において、前記第２算出手段は、前記２つの質的変数のそれぞれの値が共に前記取引レコードに含まれる割合を、該値が前記取引レコードに含まれるそれぞれの割合の積で割ったリフト値を用いて前記第２特徴量を算出することを特徴とする。 In a preferred embodiment, the second calculation means calculates a lift calculated by dividing the proportion of each value of the two qualitative variables included in the transaction record by the product of the proportion of each value included in the transaction record. The method is characterized in that the second feature amount is calculated using the value.

本発明は、一の態様において、複数の取引レコードを含むデータにおいて着目すべき変数の組合せを設定する設定手段と、各取引レコードに含まれる前記着目すべき変数である量的変数と質的変数との相関に基づく第１特徴量を算出する第１算出手段と、各取引レコードに含まれる前記着目すべき変数である２つの質的変数の相関に基づく第２特徴量を算出する第２算出手段と、前記第１算出手段と前記第２算出手段とを用いて、取引レコードごとに、前記設定手段にて設定された変数の各組合せについてそれぞれ算出された前記第１特徴量及び前記第２特徴量に基づいて、前記取引レコードが異常である可能性を推定する推定手段と、を有する情報処理装置を提供する。 In one aspect, the present invention provides a setting means for setting a combination of variables to be noted in data including a plurality of transaction records, and a quantitative variable and a qualitative variable that are the variables to be noted included in each transaction record. and a second calculation means for calculating a second feature amount based on the correlation between the two qualitative variables that are the variables of interest included in each transaction record. and the first feature quantity and the second feature quantity calculated for each combination of variables set by the setting means for each transaction record using the first calculation means and the second calculation means. An information processing apparatus is provided, comprising: an estimation means for estimating the possibility that the transaction record is abnormal based on a feature amount.

本発明は、教師なし学習により、販売データ、仕訳伝票、仕訳帳等の監査データにおける異常な取引を取引単位で検出するとともに、その異常の原因を提示することができる。 The present invention uses unsupervised learning to detect abnormal transactions in audit data such as sales data, journal entry slips, and journals on a transaction-by-transaction basis, and to present the cause of the abnormality.

情報処理装置１の構成の例を示す図。1 is a diagram showing an example of the configuration of an information processing device 1. FIG. 取引ＤＢ１２１の例を示す図。The figure which shows the example of transaction DB121. 取引表１２１２の例を示す図。A diagram showing an example of a transaction table 1212. シナリオＤＢ１２２の例を示す図。The figure which shows the example of scenario DB122. 設定ＤＢ１２３の例を示す図。The figure which shows the example of setting DB123. 情報処理装置１の機能的構成の例を示す図。1 is a diagram showing an example of a functional configuration of an information processing device 1. FIG. 着目すべき量的変数の全体分布の例を示す図。The figure which shows the example of the whole distribution of the quantitative variable of interest. 着目すべき量的変数のグループごとの分布の例を示す図。The figure which shows the example of the distribution for each group of the quantitative variable of interest. 同時発生件数の例を示す図。A diagram showing an example of the number of simultaneous occurrences. リフト値の例を示す図。The figure which shows the example of a lift value. 取引レコードの異常の可能性を推定する動作の流れの例を示すフロー図。FIG. 3 is a flow diagram illustrating an example of the flow of operations for estimating the possibility of an abnormality in a transaction record. 第１特徴量の算出の動作の流れの例を示すフロー図。FIG. 7 is a flow diagram showing an example of the flow of operation for calculating the first feature amount. 第２特徴量の算出の動作の流れの例を示すフロー図。FIG. 7 is a flowchart showing an example of the flow of operation for calculating the second feature amount.

＜実施形態＞
＜情報処理装置の構成＞
図１は、情報処理装置１の構成の例を示す図である。図１に示す情報処理装置１は、プロセッサ１１、メモリ１２、通信部１３、操作部１４、及び表示部１５を有する。これらの構成は、例えばバスで、互いに通信可能に接続されている。 <Embodiment>
<Configuration of information processing device>
FIG. 1 is a diagram showing an example of the configuration of an information processing device 1. As shown in FIG. The information processing device 1 shown in FIG. 1 includes a processor 11, a memory 12, a communication section 13, an operation section 14, and a display section 15. These structures are communicatively connected to each other, for example by a bus.

プロセッサ１１は、メモリ１２に記憶されているプログラムを読出して実行することにより情報処理装置１の各部を制御する。プロセッサ１１は、例えばＣＰＵ（Central Processing Unit）である。 The processor 11 controls each part of the information processing device 1 by reading and executing a program stored in the memory 12 . The processor 11 is, for example, a CPU (Central Processing Unit).

操作部１４は、各種の指示をするための操作ボタン、キーボード、タッチパネル、マウス等の操作子を備えており、操作を受付けてその操作内容に応じた信号をプロセッサ１１に送る。この操作は、例えば、ボタンに対する押下、タッチパネルに対するジェスチャー等である。 The operation unit 14 includes operators such as operation buttons, a keyboard, a touch panel, and a mouse for issuing various instructions, and receives operations and sends signals to the processor 11 according to the contents of the operations. This operation is, for example, a press on a button, a gesture on a touch panel, or the like.

表示部１５は、液晶ディスプレイ等の表示画面を有しており、プロセッサ１１の制御の下、画像を表示する。表示画面の上には、操作部１４の透明のタッチパネルが重ねて配置されてもよい。 The display unit 15 has a display screen such as a liquid crystal display, and displays images under the control of the processor 11. A transparent touch panel of the operation unit 14 may be placed on top of the display screen.

通信部１３は、有線又は無線により情報処理装置１を外部装置等に通信可能に接続する通信回路である。 The communication unit 13 is a communication circuit that communicably connects the information processing device 1 to an external device or the like by wire or wirelessly.

メモリ１２は、プロセッサ１１に読み込まれるオペレーティングシステム、各種のプログラム、データ等を記憶する記憶手段である。メモリ１２は、ＲＡＭ（Random Access Memory）、ＲＯＭ（Read Only Memory）を有する。なお、メモリ１２は、ソリッドステートドライブ、ハードディスクドライブ等を有してもよい。また、メモリ１２は、取引ＤＢ１２１、シナリオＤＢ１２２、及び設定ＤＢ１２３を記憶する。 The memory 12 is a storage means for storing an operating system read into the processor 11, various programs, data, and the like. The memory 12 includes a RAM (Random Access Memory) and a ROM (Read Only Memory). Note that the memory 12 may include a solid state drive, a hard disk drive, or the like. The memory 12 also stores a transaction DB 121, a scenario DB 122, and a setting DB 123.

＜取引ＤＢの構成＞
図２は、取引ＤＢ１２１の例を示す図である。取引ＤＢ１２１は、複数の取引を記述した表を、その取引の識別情報ごとに記憶するデータベースである。図２に示す取引ＤＢ１２１は、データＩＤリスト１２１１、及び取引表１２１２を有する。 <Transaction DB structure>
FIG. 2 is a diagram showing an example of the transaction DB 121. The transaction DB 121 is a database that stores a table describing a plurality of transactions for each transaction identification information. The transaction DB 121 shown in FIG. 2 has a data ID list 1211 and a transaction table 1212.

データＩＤリスト１２１１は、取引を記述したデータの識別情報であるデータＩＤと、そのデータ名、及びそのデータの種類を示す種類ＩＤを対応付けて記憶する表である。データＩＤリスト１２１１に列挙されたデータＩＤには、それぞれ１つずつ取引表１２１２が対応付けられている。 The data ID list 1211 is a table that stores data IDs that are identification information of data describing transactions, data names, and type IDs that indicate the types of data in association with each other. Each data ID listed in the data ID list 1211 is associated with one transaction table 1212.

取引表１２１２は、複数の取引レコードを含むデータを記憶する表である。図３は、取引表１２１２の例を示す図である。例えば、図３の（ａ）に示す取引表１２１２は、データＩＤ「Ｄ１」に対応する表であり、その項目には「時刻」、「場所」、「販売者」、「分類」、「商品名」、「単価」、「数量」、「金額」等が含まれている。この表における行は各項目を示す。 Transaction table 1212 is a table that stores data including multiple transaction records. FIG. 3 is a diagram showing an example of the transaction table 1212. For example, the transaction table 1212 shown in FIG. It includes "name", "unit price", "quantity", "amount", etc. Each row in this table represents each item.

そして、取引レコードは、これらの項目の値をそれぞれ有するレコードである。この表における列は取引レコードを示す。取引レコードのそれぞれには、例えば、通し番号等、他の取引レコードと識別するための識別情報が付されている。 The transaction record is a record that has values for each of these items. Columns in this table represent transaction records. Each transaction record is attached with identification information such as a serial number for distinguishing it from other transaction records.

項目の値には、量的変数と質的変数とがある。量的変数は、量を示す数値として表現される変数であり、例えば、「時刻」、「金額」等である。量的変数は、間隔尺度、及び比例尺度である。 Item values include quantitative variables and qualitative variables. A quantitative variable is a variable expressed as a numerical value indicating a quantity, such as "time", "amount", etc. Quantitative variables are interval scales and proportional scales.

質的変数は、量を示す数値として表現されない変数であり、例えば、「販売者」、「商品名」等である。質的変数は、名義尺度、及び順序尺度である。 Qualitative variables are variables that are not expressed as numerical values indicating quantities, such as "seller" and "product name." Qualitative variables are nominal and ordinal.

＜シナリオＤＢの構成＞
図４は、シナリオＤＢ１２２の例を示す図である。シナリオＤＢ１２２は、予め決められたシナリオを記憶するデータベースである。ここでシナリオとは、複数の取引レコードを含むデータにおいて着目すべき変数の組合せと、それらの変数を使用して特徴量を算出する手続とを対応付けて記憶した情報である。このシナリオは、一定の水準の知見、経験を有する監査人が予め編集したものである。 <Scenario DB configuration>
FIG. 4 is a diagram showing an example of the scenario DB 122. The scenario DB 122 is a database that stores predetermined scenarios. Here, the scenario is information in which a combination of variables of interest in data including a plurality of transaction records is associated with a procedure for calculating a feature amount using those variables. This scenario was compiled in advance by an auditor with a certain level of knowledge and experience.

図４に示すシナリオＤＢ１２２は、「番号」、「シナリオ名」、「特徴量説明文」、及び「使用する変数」の各項目を有する。また、各シナリオには、特徴量説明文の内容に対応する、図示しない特徴量の算出手続が対応付けて記憶されている。 The scenario DB 122 shown in FIG. 4 has the following items: "number", "scenario name", "feature description", and "variable to be used". Further, each scenario is stored in association with a feature value calculation procedure (not shown) corresponding to the content of the feature value explanatory text.

例えば、図４に示すシナリオＤＢ１２２において番号が２のシナリオは、シナリオ名が「受注日付と売上計上日の差が異常に短い」である。また、このシナリオの特徴量説明文は「品種を条件に入れた、日数差の珍しさを表す値」である。このシナリオでは、取引レコードに含まれる受注日と売上日とを用いてそれらの差を算出し、この差を新たな量的変数として使用する。つまり、シナリオで使用する変数は、取引レコードに含まれる変数そのままであってもよいが、取引レコードに含まれる１以上の変数を用いて生成される変数であってもよい。 For example, the scenario numbered 2 in the scenario DB 122 shown in FIG. 4 has the scenario name "The difference between the order date and the sales recording date is abnormally short." In addition, the feature value description for this scenario is ``a value that represents the rarity of the difference in number of days, taking into account the variety.'' In this scenario, the order date and sales date included in the transaction record are used to calculate the difference between them, and this difference is used as a new quantitative variable. That is, the variables used in the scenario may be variables included in the transaction record as they are, or may be variables generated using one or more variables included in the transaction record.

また、シナリオにおいて、或る取引レコードの評価に使用する変数は、その取引レコードに含まれる変数のみから生成される変数であってもよいが、その取引レコードと特定の関係にある他の取引レコードに含まれる変数を用いて生成される変数であってもよい。取引レコードの評価に使用する変数は、例えば、同製品、同得意先の前年平均に対する数量比率のように、質的変数が共通する他の取引レコードであって、成立した時期がその取引レコードと一定の条件を満たす期間内にある取引レコードから算出された平均値等の値を用いて生成されてもよい。 Furthermore, in a scenario, variables used to evaluate a certain transaction record may be generated only from variables included in that transaction record, but other transaction records that have a specific relationship with that transaction record may also be used. It may be a variable that is generated using variables included in . The variables used to evaluate a transaction record are other transaction records that share a qualitative variable, such as the ratio of quantity to the previous year's average for the same product and same customer, and are concluded at the same time as the transaction record. It may be generated using a value such as an average value calculated from transaction records within a period that satisfies certain conditions.

＜設定ＤＢの構成＞
図５は、設定ＤＢ１２３の例を示す図である。設定ＤＢ１２３は、シナリオＤＢ１２２に記憶されているシナリオの中からユーザが選択したシナリオをデータＩＤごとに記憶するデータベースである。図５に示す設定ＤＢ１２３は、データＩＤリスト１２３１、及びシナリオ番号リスト１２３２を有する。 <Configuration of settings DB>
FIG. 5 is a diagram showing an example of the setting DB 123. The setting DB 123 is a database that stores scenarios selected by the user from among the scenarios stored in the scenario DB 122 for each data ID. The setting DB 123 shown in FIG. 5 has a data ID list 1231 and a scenario number list 1232.

データＩＤリスト１２３１は、取引ＤＢ１２１に記憶されているデータのうち、監査の対象とされるデータの識別情報であるデータＩＤを列挙したリストである。データＩＤリスト１２３１に記述されたデータＩＤのそれぞれには、１つずつシナリオ番号リスト１２３２が対応付けられている。 The data ID list 1231 is a list listing data IDs that are identification information of data to be audited among the data stored in the transaction DB 121. Each data ID written in the data ID list 1231 is associated with one scenario number list 1232.

シナリオ番号リスト１２３２は、データＩＤで識別されるデータに対してユーザが操作部１４を介して設定したシナリオの番号を列挙したリストである。例えば、図５に示す設定ＤＢ１２３において、ユーザは、データＩＤ「Ｄ１」のデータに対してシナリオの番号として「２」「４」「７」…を選択したことが示されている。 The scenario number list 1232 is a list listing scenario numbers set by the user via the operation unit 14 for the data identified by the data ID. For example, the setting DB 123 shown in FIG. 5 shows that the user has selected "2", "4", "7", etc. as the scenario number for the data with the data ID "D1".

図５に示す通り、シナリオ番号リスト１２３２は、重み係数の欄を有してもよい。この重み係数の欄には、対応するシナリオの番号で識別されるシナリオで算出される特徴量ごとに乗じる重み係数が記憶される。なお、１つのシナリオにつき複数の特徴量が生成される場合、重み係数は、それら複数の特徴量のそれぞれに設定されてもよい。 As shown in FIG. 5, the scenario number list 1232 may have a column for weighting factors. This weighting factor column stores a weighting factor by which each feature amount calculated in the scenario identified by the corresponding scenario number is multiplied. Note that when a plurality of feature quantities are generated for one scenario, a weighting coefficient may be set for each of the plurality of feature quantities.

＜情報処理装置の機能的構成＞
図６は、情報処理装置１の機能的構成の例を示す図である。図６において、情報処理装置１の通信部１３は省かれている。 <Functional configuration of information processing device>
FIG. 6 is a diagram showing an example of the functional configuration of the information processing device 1. As shown in FIG. In FIG. 6, the communication unit 13 of the information processing device 1 is omitted.

情報処理装置１のプロセッサ１１は、メモリ１２に記憶されたプログラムを実行することにより、取得手段１１１、設定手段１１２、第１算出手段１１３、第２算出手段１１４、統合手段１１５、及び推定手段１１６として機能する。 The processor 11 of the information processing device 1 executes the program stored in the memory 12 to obtain the acquisition means 111, the setting means 112, the first calculation means 113, the second calculation means 114, the integration means 115, and the estimation means 116. functions as

取得手段１１１は、操作部１４を介してユーザからデータを指定する操作を受付ける。そして取得手段１１１は、指定されたそのデータをメモリ１２に記憶された取引ＤＢ１２１から取得する。このときプロセッサ１１は、シナリオＤＢ１２２からシナリオの一覧を読み出して表示部１５に表示させる。 The acquisition unit 111 receives an operation for specifying data from the user via the operation unit 14 . The acquisition means 111 then acquires the specified data from the transaction DB 121 stored in the memory 12. At this time, the processor 11 reads out a list of scenarios from the scenario DB 122 and causes the display unit 15 to display the list.

ユーザは、表示されたシナリオの一覧を見て、その中から指定したデータに使用するシナリオを選択する操作をする。設定手段１１２は、操作部１４を介してユーザからシナリオを選択する操作を受付ける。そして設定手段１１２は、選択されたシナリオを示す番号等を設定ＤＢ１２３に記憶する。 The user views the displayed list of scenarios and selects a scenario to be used for the specified data. The setting unit 112 receives an operation for selecting a scenario from the user via the operation unit 14. Then, the setting means 112 stores a number indicating the selected scenario in the setting DB 123.

つまり、この設定手段１１２は、予め決められたシナリオで指定される変数の複数の組合せの中から、ユーザによって選択された組合せを着目すべき変数の組合せとして設定する設定手段の例である。 In other words, the setting means 112 is an example of a setting means that sets a combination selected by the user as a combination of variables of interest from among a plurality of combinations of variables specified in a predetermined scenario.

これにより、設定ＤＢ１２３には、指定されたデータにおいてシナリオによって指定された、着目すべき変数の組合せが設定される。すなわち、この設定手段１１２は、複数の取引レコードを含むデータにおいて着目すべき変数の組合せを設定する設定手段の例である。 As a result, the combination of variables of interest specified by the scenario in the specified data is set in the setting DB 123. That is, this setting means 112 is an example of a setting means that sets a combination of variables to be noted in data including a plurality of transaction records.

第１算出手段１１３は、ユーザに指定されたデータに含まれる複数の取引レコードのそれぞれに対し、設定ＤＢ１２３に記憶されたシナリオが指定する量的変数と質的変数との相関に基づく特徴量を第１特徴量として算出する。 The first calculation means 113 calculates feature quantities based on the correlation between quantitative variables and qualitative variables specified by the scenario stored in the setting DB 123, for each of the plurality of transaction records included in the data specified by the user. Calculate as the first feature amount.

設定されたシナリオは、例えば、着目すべき変数として質的変数と量的変数との組を指定する。この理由は、量的変数そのものの珍しさからデータの異常度を推定するだけではなく、質的変数との相関を考慮してその珍しさを特定し、データの異常度を推定するためである。 The set scenario specifies, for example, a set of qualitative variables and quantitative variables as variables of interest. The reason for this is that the degree of abnormality of the data is not only estimated based on the rarity of the quantitative variable itself, but also the degree of abnormality of the data is estimated by considering the correlation with the qualitative variable to identify its rarity. .

図７は、着目すべき量的変数の全体分布の例を示す図である。図７の横軸は設定したシナリオで着目すべき変数として挙げられている量的変数の値である。この横軸は複数に区分されている。図７の縦軸は、対応する区分に属する量的変数の値を有するデータの件数である。図７において一点鎖線が示す値は、上述した量的変数の期待値であり、例えば、（量的変数の値）×（その値を有するデータの件数）／（全データの件数）等で算出される。 FIG. 7 is a diagram showing an example of the overall distribution of quantitative variables of interest. The horizontal axis in FIG. 7 is the value of the quantitative variable listed as a variable of interest in the set scenario. This horizontal axis is divided into multiple sections. The vertical axis in FIG. 7 is the number of data items having values of quantitative variables belonging to the corresponding category. The value indicated by the dashed-dotted line in Figure 7 is the expected value of the quantitative variable mentioned above, and is calculated by, for example, (value of quantitative variable) x (number of data items having that value) / (number of total data items). be done.

ここでデータに含まれる或る取引レコードは、図７に示す矢印の値を有しているとする。この値は、図７において期待値から遠くないから全体としては異常と判断され難い。 Here, it is assumed that a certain transaction record included in the data has the value indicated by the arrow shown in FIG. Since this value is not far from the expected value in FIG. 7, it is difficult to judge it as abnormal as a whole.

図８は、着目すべき量的変数のグループごとの分布の例を示す図である。図８の横軸及び縦軸は図７と共通する。図７に示す量的変数の分布は、シナリオで着目すべき変数として挙げられている質的変数が属するグループごとに（ａ）（ｂ）（ｃ）に分類される。 FIG. 8 is a diagram showing an example of the distribution of quantitative variables of interest for each group. The horizontal and vertical axes of FIG. 8 are the same as those of FIG. The distribution of quantitative variables shown in FIG. 7 is classified into (a), (b), and (c) for each group to which qualitative variables listed as variables of interest in the scenario belong.

例えば、設定されたシナリオが質的変数として取引をした部門を挙げており、量的変数として取引額を挙げているとする。このシナリオに基づいて情報処理装置１は、統計的手法によって、例えば量的変数の傾向の類否に応じて、データを複数のグループに分類する。この統計的手法は、例えば決定木である。 For example, assume that the set scenario lists the sector in which the transaction was made as a qualitative variable, and the transaction amount as a quantitative variable. Based on this scenario, the information processing device 1 uses a statistical method to classify the data into a plurality of groups, for example, depending on the similarity of trends in quantitative variables. This statistical method is, for example, a decision tree.

なお、上述した通り、シナリオが質的変数として取引をした部門を挙げている場合、この質的変数は、取引レコードが示す取引をした部門の識別情報である質的変数の例である。また、上述した通り、シナリオが量的変数として取引額を挙げている場合、この量的変数は、取引レコードが示す取引の額である量的変数の例である。
Note that, as described above, when the scenario lists the department that made the transaction as a qualitative variable, this qualitative variable is an example of a qualitative variable that is identification information of the department that made the transaction indicated by the transaction record. Furthermore, as described above, when the scenario lists the transaction amount as a quantitative variable, this quantitative variable is an example of a quantitative variable that is the transaction amount indicated by the transaction record.

この結果、図８に示す（ａ）には部門Ａ、部門Ｄ、及び部門Ｅを含むグループ１の取引レコードが、（ｂ）には部門Ｂ、部門Ｆを含むグループ２の取引レコードが、（ｃ）には部門Ｃを含むグループ３の取引レコードが分類される。 As a result, (a) shown in FIG. 8 shows the transaction record of group 1 including department A, department D, and department E, and (b) shows the transaction record of group 2 including department B, department F. Transaction records of group 3 including department C are classified into c).

分類された取引レコードで期待値を算出すると、その期待値は、図８にそれぞれ一点鎖線で示す通り、ばらばらになる。ここで、図７に示す矢印の値は、グループ１において期待値から比較的遠く異常と判断されることがある。 When expected values are calculated using classified transaction records, the expected values will vary, as shown by the dashed-dotted lines in FIG. Here, the values indicated by the arrows shown in FIG. 7 are relatively far from the expected values in group 1 and may be determined to be abnormal.

第１算出手段１１３は、質的変数の発生を前提とした量的変数の発生確率ｐを含む以下の式（１）により、それぞれの取引レコードの異常度を示す異常度スコアを算出する。この式（１）に示す異常度スコアは、上述した発生確率ｐが小さいほど大きな値を示すように発生確率ｐの逆数の対数で表される。この異常度スコアは、本発明における第１特徴量である。 The first calculation means 113 calculates an abnormality score indicating the abnormality degree of each transaction record using the following equation (1) that includes the probability p of occurrence of a quantitative variable on the premise of the occurrence of a qualitative variable. The abnormality score shown in this formula (1) is expressed by the logarithm of the reciprocal of the occurrence probability p such that the smaller the occurrence probability p described above is, the larger the value is. This abnormality score is the first feature amount in the present invention.

この第１特徴量は、シナリオが指定する質的変数と量的変数との相関の情報を含んでいる。例えば、上述した例で第１特徴量は、いわゆる条件付き確率ｐを用いているから、取引レコードが質的変数を含む条件下でその取引レコードに含まれる量的変数の統計的珍しさ（希少性）を示している。 This first feature amount includes information on the correlation between the qualitative variables and quantitative variables specified by the scenario. For example, in the above example, the first feature uses the so-called conditional probability p, so under the condition that the transaction record includes a qualitative variable, the statistical rarity (rareness) of the quantitative variable included in the transaction record is gender).

したがって、例えば量的変数が取引額である場合、全体の中では珍しくない取引額であったとしても、その取引を担当した担当者、部門、又はその取引で扱われた製品等の質的変数との組合せが珍しいとき、情報処理装置１は、この取引レコードの異常度スコアを比較的高く算出する。そのため、例えば、「この部門の取引にしては取引額が高すぎる」とか「この製品のわりにこの金額は高すぎる」といった、質的変数と量的変数との組合せの異常が検出される。 Therefore, for example, if the quantitative variable is the transaction amount, even if the transaction amount is not uncommon in the whole, qualitative variables such as the person in charge of the transaction, the department, or the product handled in the transaction When the combination is rare, the information processing device 1 calculates a relatively high abnormality score for this transaction record. Therefore, for example, abnormalities in the combination of qualitative and quantitative variables are detected, such as ``the transaction amount is too high for a transaction in this department'' or ``this amount is too high for this product.''

また、この場合の第１特徴量は、グループごとに分割してそのグループ内における取引レコードの異常度を評価する。そのため、データ全体にいわゆる多峰性があったとしてもそのために異常度の検出が影響されることが比較的少ない。 Further, the first feature amount in this case is divided into groups and evaluates the degree of abnormality of transaction records within the group. Therefore, even if there is so-called multimodality in the entire data, detection of the degree of abnormality is relatively unlikely to be affected by it.

つまり、この第１算出手段１１３は、各取引レコードに含まれる着目すべき変数である量的変数と質的変数との相関に基づく第１特徴量を算出する第１算出手段の例である。 In other words, the first calculation means 113 is an example of a first calculation means that calculates a first feature amount based on the correlation between a quantitative variable and a qualitative variable, which are variables of interest included in each transaction record.

また、この第１算出手段１１３は、取引レコードに含まれる量的変数の傾向の類否に応じて、データを質的変数ごとに複数のグループに分類し、それぞれのグループ内における各取引レコードの統計的珍しさを第１特徴量として算出する第１算出手段の例である。 In addition, this first calculation means 113 classifies the data into a plurality of groups for each qualitative variable according to the similarity of trends in the quantitative variables included in the transaction records, and classifies the data into a plurality of groups for each qualitative variable. This is an example of a first calculation means that calculates statistical rarity as a first feature quantity.

また、この第１算出手段１１３は、取引レコードに質的変数が含まれている条件下における量的変数の割合から求まる量を第１特徴量として算出する第１算出手段の例である。 Further, this first calculation means 113 is an example of a first calculation means that calculates, as a first feature amount, an amount determined from the ratio of quantitative variables under conditions in which a qualitative variable is included in a transaction record.

図６に示す第２算出手段１１４は、ユーザに指定されたデータに含まれる複数の取引レコードのそれぞれに対し、設定ＤＢ１２３に記憶されたシナリオが指定する２つの質的変数の相関に基づく特徴量を第２特徴量として算出する。 The second calculation means 114 shown in FIG. 6 calculates a feature quantity based on the correlation between two qualitative variables specified by the scenario stored in the setting DB 123 for each of a plurality of transaction records included in the data specified by the user. is calculated as the second feature amount.

設定されたシナリオは、例えば、着目すべき変数として２つの質的変数の組を指定する。この理由は、２つの質的変数が同時に１つの取引レコードに含まれることの珍しさを特定し、データの異常度を推定するためである。 The set scenario specifies, for example, a set of two qualitative variables as variables of interest. The reason for this is to identify the rarity of two qualitative variables being included in one transaction record at the same time and estimate the degree of abnormality of the data.

図９は、同時発生件数の例を示す図である。図９に示す表は、或るデータに含まれる全ての取引レコードを、そのそれぞれに含まれる部門と担当者との組合せで分類し、それぞれの発生件数を計上したものである。例えば、図９から部門Ａと担当者αとの組合せは全データ中に８４５６件も含まれていることがわかる。一方、部門Ｂと担当者γとの組合せは全データ中に１件しか含まれていないことがわかる。 FIG. 9 is a diagram showing an example of the number of simultaneous occurrences. The table shown in FIG. 9 classifies all transaction records included in certain data by the combination of department and person in charge included in each transaction record, and records the number of occurrences of each transaction record. For example, it can be seen from FIG. 9 that 8456 combinations of department A and person in charge α are included in the total data. On the other hand, it can be seen that only one combination of department B and person in charge γ is included in all the data.

図１０は、リフト値の例を示す図である。図１０に示す表は、図９に示す表に対応するそれぞれの取引レコードにおける、部門と担当者との組合せの同時発生確率と、そのリフト値とを示すものである。ここで同時発生確率は、その部門と担当者との組合せの発生件数を全発生件数で割った値である。そして、リフト値は、同時発生確率を、その部門、及び担当者のそれぞれが単独で発生する確率の積で割った値である。このリフト値は、例えば、以下の式（２）によって示される。 FIG. 10 is a diagram showing an example of lift values. The table shown in FIG. 10 shows the probability of simultaneous occurrence of the combination of department and person in charge and its lift value in each transaction record corresponding to the table shown in FIG. Here, the probability of simultaneous occurrence is the value obtained by dividing the number of occurrences for the combination of the department and person in charge by the total number of occurrences. The lift value is a value obtained by dividing the simultaneous occurrence probability by the product of the individual probability of occurrence for each department and person in charge. This lift value is expressed, for example, by the following equation (2).

この式（２）においてＸ，Ｙはいずれも質的変数である。そして、Ａ，ＢはそれぞれＸ，Ｙの実現値である。また、ｐ（Ｘ＝Ａ，Ｙ＝Ｂ）は、ＸがＡであり、かつ、ＹがＢであるときの確率である。つまり、ｐ（Ｘ＝Ａ，Ｙ＝Ｂ）は、着目すべき変数である２つの質的変数のそれぞれの値が共に取引レコードに含まれる割合である。また、ｐ（Ｘ＝Ａ）、及びｐ（Ｙ＝Ｂ）は、それぞれ、ＸがＡである確率、ＹがＢである確率である。したがって、ｐ（Ｘ＝Ａ）ｐ（Ｙ＝Ｂ）は、着目すべき変数である２つの質的変数のそれぞれの値が取引レコードに含まれるそれぞれの割合の積である。式（２）の左辺であるＬｉｆｔ（Ｘ＝Ａ，Ｙ＝Ｂ）は、リフト値である。 In this equation (2), both X and Y are qualitative variables. A and B are realized values of X and Y, respectively. Furthermore, p(X=A, Y=B) is the probability when X is A and Y is B. That is, p(X=A, Y=B) is the rate at which each value of two qualitative variables, which are the variables of interest, are both included in the transaction record. Furthermore, p(X=A) and p(Y=B) are the probability that X is A and the probability that Y is B, respectively. Therefore, p(X=A)p(Y=B) is the product of the respective proportions in which the respective values of the two qualitative variables that are the variables of interest are included in the transaction records. Lift (X=A, Y=B), which is the left side of equation (2), is a lift value.

式（２）に示す異常度スコアは、上述したリフト値が小さいほど大きな値を示すようにＬｉｆｔ（Ｘ＝Ａ，Ｙ＝Ｂ）の逆数の対数で表される。或る数値の逆数の対数は、すなわち、その数値の対数のマイナス１倍である。この式（２）に示す異常度スコアは、本発明における第２特徴量である。第２算出手段１１４は、リフト値から求まる異常度スコアを上述した第２特徴量として算出する。上述したリフト値は、例えば通販サイト等において、顧客に商品を推薦するレコメンデーションに利用される。このレコメンデーションは、或る商品を購入した顧客に対して、その商品を購入した顧客群の購買の傾向を参照し、その商品と一緒に購入される確率の高い商品等を推薦する処理である。レコメンデーションにおいて、商品の組合せのリフト値が高いほど、その組合せの商品は一緒に購入されている確率が高い。 The abnormality score shown in Equation (2) is expressed by the logarithm of the reciprocal of Lift (X=A, Y=B) so that the smaller the above-mentioned lift value is, the larger the value is. The logarithm of the reciprocal of a certain number is, in other words, the logarithm of that number minus one. The abnormality score shown in equation (2) is the second feature amount in the present invention. The second calculation means 114 calculates the abnormality score obtained from the lift value as the above-mentioned second feature quantity. The above-mentioned lift value is used, for example, in mail-order sites and the like to recommend products to customers. This recommendation is a process that refers to the purchasing trends of a group of customers who have purchased a certain product and recommends products that are likely to be purchased together with that product to a customer who has purchased that product. . In recommendations, the higher the lift value of a product combination, the higher the probability that the products in that combination are purchased together.

一方、本発明において、このリフト値は、上述した用途と逆の用途に用いられる。すなわち、情報処理装置１は、このリフト値が低いほど、その組合せが統計的に珍しいことを利用して、その組合せが発生している取引レコードが異常である可能性が高いと推定する。 On the other hand, in the present invention, this lift value is used for the opposite purpose to that described above. That is, the information processing device 1 uses the fact that the lower the lift value is, the statistically rarer the combination is, and estimates that the transaction record in which the combination occurs is more likely to be abnormal.

つまり、この第２算出手段１１４は、各取引レコードに含まれる着目すべき変数である２つの質的変数の相関に基づく第２特徴量を算出する第２算出手段の例である。 In other words, this second calculation means 114 is an example of a second calculation means that calculates a second feature amount based on the correlation between two qualitative variables that are variables of interest included in each transaction record.

また、この第２算出手段１１４は、着目すべき変数である２つの質的変数のそれぞれの値が共に取引レコードに含まれる割合を、それらの値が取引レコードに含まれるそれぞれの割合の積で割ったリフト値を用いて第２特徴量を算出する第２算出手段の例である。 In addition, this second calculation means 114 calculates the proportion of each of the values of two qualitative variables, which are the variables of interest, being included in the transaction record by the product of the respective proportions of those values being included in the transaction record. This is an example of a second calculation means that calculates a second feature amount using the divided lift value.

統合手段１１５は、設定されたシナリオが示す変数の組合せのそれぞれについて、算出された上述の第１特徴量及び第２特徴量を統合して統合特徴量を算出する。第１特徴量及び第２特徴量は、それぞれ個別に異常を示すことがわかるが、どの取引レコードに注目すべきかを表す参考指標があった方が結果を把握し易い。そこで、情報処理装置１のプロセッサ１１は、統合手段１１５として機能することで、第１特徴量及び第２特徴量を統合した統合特徴量を算出する。統合特徴量は、以下の式（３）により示される。 The integrating means 115 calculates an integrated feature amount by integrating the above-mentioned calculated first feature amount and second feature amount for each combination of variables indicated by the set scenario. Although it can be seen that the first feature amount and the second feature amount each individually indicate an abnormality, it is easier to understand the results if there is a reference index indicating which transaction record to focus on. Therefore, the processor 11 of the information processing device 1 functions as the integrating means 115 to calculate an integrated feature amount by integrating the first feature amount and the second feature amount. The integrated feature amount is expressed by the following equation (3).

この式（３）において、ｓは統合特徴量であり、ｘ_ｎは第ｎ番目の取引レコードである。したがって、式（３）の左辺であるｓ（ｘ_ｎ）は、第ｎ番目の取引レコードについての統合特徴量である。 In this formula (3), s is an integrated feature amount, and x _n is the nth transaction record. Therefore, s(x _n ), which is the left side of equation (3), is the integrated feature amount for the n-th transaction record.

そして、式（３）においてｆはシナリオに基づいて算出された第１特徴量又は第２特徴量（以下、単に「特徴量」ともいう）のそれぞれであり、ｗはｆに乗じる重み係数である。Ｋは算出された特徴量の総数である。 In Equation (3), f is the first feature amount or the second feature amount (hereinafter also simply referred to as "feature amount") calculated based on the scenario, and w is the weighting coefficient by which f is multiplied. . K is the total number of calculated feature amounts.

つまり、式（３）の右辺は、第ｎ番目の取引レコードについてのＫ種類の特徴量にそれぞれ重み係数を乗じた値の合計を示している。 That is, the right side of Equation (3) indicates the sum of values obtained by multiplying K types of feature amounts for the n-th transaction record by respective weighting coefficients.

統合手段１１５は、第ｎ番目の取引レコードについて得られているＫ種類の特徴量ｆを用いて教師なし学習におけるアンサンブル学習を行う。Ｋ種類の特徴量、及びそれら特徴量のそれぞれに応じたＫ種類の重み係数は、いずれも要素数がＫのベクトルとして表現される。第ｎ番目の取引レコードのベクトル表現と、これに対応し、Ｋ種類の特徴量ｆを要素とする特徴量ベクトルとは、以下の式（４）で表される。式（４）におけるＴは転置を示す。 The integrating means 115 performs ensemble learning in unsupervised learning using K types of feature amounts f obtained for the n-th transaction record. The K types of feature amounts and the K types of weighting coefficients corresponding to each of the feature amounts are each expressed as a vector with K elements. A vector representation of the n-th transaction record and a corresponding feature vector whose elements are K types of feature amounts f are expressed by the following equation (4). T in equation (4) indicates transposition.

ここでデータに含まれる個々の取引レコードはラベルが付されていない。そのため、ブースティング、バギング等の通常のアンサンブル学習手法は用いることができない。そこで統合手段１１５は、統合特徴量ｓと個別の特徴量ｆとの間の相関係数の二乗の和が最大になるように重み係数ｗのそれぞれを決める。重み係数ｗを決める手法は以下の通りである。すなわち、統合手段１１５は、まず、目的関数Ｅを以下の式（５）の通り定義する。 The individual transaction records contained in the data are not labeled. Therefore, normal ensemble learning methods such as boosting and bagging cannot be used. Therefore, the integrating means 115 determines each of the weighting coefficients w so that the sum of the squares of the correlation coefficients between the integrated feature quantity s and the individual feature quantities f becomes maximum. The method for determining the weighting coefficient w is as follows. That is, the integrating means 115 first defines the objective function E as shown in equation (5) below.

式（５）において、Ｃは相関係数を表し、また、Ｔは転置を示す。この式（５）において、目的関数Ｅ（ｗ）は、統合特徴量と個別の特徴量との間の相関係数の二乗の和を示す、重み係数ベクトルｗの関数と定められる。この目的関数Ｅ（ｗ）は、統合特徴量と全ての第１特徴量及び第２特徴量とのそれぞれの相関から求まる指標値の例である。統合手段１１５は、この式（５）に基づいて、目的関数Ｅ（ｗ）が最大となるように重み係数ベクトルｗを決定する。 In equation (5), C represents a correlation coefficient, and T represents transposition. In this equation (5), the objective function E(w) is defined as a function of the weighting coefficient vector w, which indicates the sum of squares of the correlation coefficients between the integrated feature amount and the individual feature amounts. This objective function E(w) is an example of an index value found from the respective correlations between the integrated feature and all the first and second features. The integrating means 115 determines the weighting coefficient vector w based on this equation (5) so that the objective function E(w) is maximized.

なお、２つの変数ｘ，ｙの相関係数は、共分散をそれぞれの標準偏差の積で割った値であり、以下の式（６）で示される。なお、数式において「＜」と「＞」とで変数を囲む記号は、その変数の期待値又は平均値を表す。 Note that the correlation coefficient between the two variables x and y is a value obtained by dividing the covariance by the product of their respective standard deviations, and is expressed by the following equation (6). Note that in the formula, the symbol surrounding a variable with "<" and ">" represents the expected value or average value of that variable.

また、算出されたＫ種類の特徴量の分散、及び共分散は、以下の式（７）で示される。式（７）におけるＮはデータに含まれる取引レコードの総数である。式（７）において、Ｆ_ｉｊはｉ＝ｊのときに分散になり、ｉ≠ｊのときに共分散になる。 Further, the variance and covariance of the K types of calculated feature amounts are expressed by the following equation (7). N in equation (7) is the total number of transaction records included in the data. In equation (7), F _ij becomes a variance when i=j, and becomes a covariance when i≠j.

ここで、式（５）に示した目的関数Ｅ（ｗ）の各部分は、期待値の線形性により以下の式（８）で示される。目的関数Ｅ（ｗ）の各部分は、すなわち、個別の特徴量の分散、統合特徴量の分散、及び個別の特徴量と統合特徴量との共分散である。 Here, each part of the objective function E(w) shown in equation (5) is expressed by the following equation (8) due to the linearity of the expected value. Each part of the objective function E(w) is the variance of the individual features, the variance of the integrated feature, and the covariance of the individual features and the integrated feature.

ここで重み係数のベクトルｗは取引レコードの通し番号ｎに依らず、Ｋ種類の特徴量のそれぞれに対応する重み係数ｗ_ｋを要素に持つＫ次元のベクトルである。そしてＦは、ｉ行ｊ列の要素をＦ_ｉｊとするＫ次の正方行列である。すなわち、Ｆは特徴量ｆの分散共分散行列である。 Here, the weighting coefficient vector w is a K-dimensional vector having weighting coefficients w _k corresponding to each of the K types of feature quantities as elements, regardless of the serial number n of the transaction record. Further, F is a K-order square matrix in which the element in the i-th row and the j-th column is F _ij . That is, F is a variance-covariance matrix of the feature amount f.

式（８）を式（５）に代入すると、以下の式（９）が得られる。なお、Λは、Ｋ次の正方行列であり、そのｉ行ｊ列の要素はΛ_ｉｊである。そしてδ_ｉｊは、クロネッカーのデルタを表し、ｉ＝ｊのとき１となり、ｉ≠ｊのとき０になる。 By substituting equation (8) into equation (5), the following equation (9) is obtained. Note that Λ is a square matrix of order K, and the element of the i-th row and j-column is Λ _ij . And δ _ij represents Kronecker's delta, which is 1 when i=j and 0 when i≠j.

すなわち、統合特徴量ｓに用いられる重み係数ベクトルｗを求める問題は、以下の式（１０）に示す最大化問題となる。 That is, the problem of finding the weighting coefficient vector w used for the integrated feature amount s is a maximization problem shown in the following equation (10).

式（１０）で示した最大化問題は、以下の式（１１）に示す固有値問題と等価である。 The maximization problem shown in equation (10) is equivalent to the eigenvalue problem shown in equation (11) below.

したがって、統合手段１１５は、この式（１１）を解いて固有値、及び固有ベクトルを得る。そして、統合手段１１５は、得られた最大の固有値に対応する固有ベクトルを重み係数ベクトルｗとして用いて統合特徴量を算出する。なお、ここで算出される重み係数ベクトルｗの各要素である重み係数ｗ_ｋ（ｋ＝１，２，…，Ｋ）は、統合特徴量と全ての第１特徴量及び第２特徴量とのそれぞれの相関から求まる指標値が決められた基準を満たす、又は超えるように決められる係数の例である。 Therefore, the integrating means 115 solves this equation (11) to obtain the eigenvalues and eigenvectors. Then, the integrating means 115 calculates the integrated feature amount using the eigenvector corresponding to the obtained maximum eigenvalue as the weighting coefficient vector w. Note that the weighting coefficient w _k (k=1, 2,..., K), which is each element of the weighting coefficient vector w calculated here, is a combination of the integrated feature quantity and all the first feature quantities and second feature quantities. This is an example of a coefficient determined so that the index value obtained from each correlation satisfies or exceeds a determined standard.

推定手段１１６は、統合手段１１５によって第１特徴量及び第２特徴量から算出された統合特徴量に基づいて取引レコードが異常である可能性を推定する。そして、推定手段１１６は、推定した結果を表示部１５により表示することで上述した可能性をユーザに提示する。 The estimating means 116 estimates the possibility that the transaction record is abnormal based on the integrated feature amount calculated from the first feature amount and the second feature amount by the integrating means 115. Then, the estimating means 116 presents the above-mentioned possibilities to the user by displaying the estimated results on the display unit 15.

つまり、この推定手段１１６は、第１算出手段と第２算出手段とを用いて、取引レコードごとに、設定手段にて設定された変数の各組合せについてそれぞれ算出された第１特徴量及び第２特徴量に基づいて、取引レコードが異常である可能性を推定する推定手段
の例である。 In other words, the estimation means 116 uses the first calculation means and the second calculation means to calculate the first feature amount and the second feature amount calculated for each combination of variables set by the setting means for each transaction record. This is an example of an estimation means that estimates the possibility that a transaction record is abnormal based on feature amounts.

なお、この推定手段１１６は、統合手段１１５の機能を含んでもよい。この場合、この推定手段１１６は、設定された変数の各組合せについてそれぞれ算出された第１特徴量及び第２特徴量を、その組合せのそれぞれに決められた係数を用いて統合して統合特徴量を算出し、その統合特徴量に基づいて取引レコードが異常である可能性を推定する推定手段の例である。 Note that this estimating means 116 may include the function of the integrating means 115. In this case, the estimating means 116 integrates the first feature amount and the second feature amount calculated for each combination of set variables using a coefficient determined for each of the combinations to obtain an integrated feature amount. This is an example of an estimation means that calculates the probability that a transaction record is abnormal based on the integrated feature amount.

＜情報処理装置の動作＞
＜全体の動作＞
図１１は、取引レコードの異常の可能性を推定する動作の流れの例を示すフロー図である。情報処理装置１のプロセッサ１１は、操作部１４を介してユーザから監査の対象となるデータを識別するためのデータＩＤの指定を受付ける。そして、プロセッサ１１は、指定されたデータＩＤで識別されるデータをメモリ１２から取得する（ステップＳ００１）。 <Operation of information processing device>
<Overall operation>
FIG. 11 is a flow diagram illustrating an example of the flow of operations for estimating the possibility of an abnormality in a transaction record. The processor 11 of the information processing device 1 receives from the user, via the operation unit 14, a designation of a data ID for identifying data to be audited. Then, the processor 11 acquires the data identified by the specified data ID from the memory 12 (step S001).

また、プロセッサ１１は、操作部１４を介してユーザから上述したデータに適用するシナリオを選択する操作を受付ける。プロセッサ１１は、受付けた操作が示すシナリオを設定する（ステップＳ００２）。 Further, the processor 11 receives an operation from the user via the operation unit 14 to select a scenario to be applied to the above-mentioned data. The processor 11 sets the scenario indicated by the received operation (step S002).

監査の対象となるデータが取得され、そのデータに適用されるシナリオが設定されると、プロセッサ１１は、データに含まれる取引レコードのそれぞれについて、シナリオに応じた第１特徴量を算出する（ステップＳ１００）。また、プロセッサ１１は、これと並行して、データに含まれる取引レコードのそれぞれについて、シナリオに応じた第２特徴量を算出する（ステップＳ２００）。ステップＳ１００、及びステップＳ２００の詳細は後述する。なお、ステップＳ１００、及びステップＳ２００は図１１に示すように並列処理によってそれぞれ行われてもよいが、逐次に行われてもよい。 When the data to be audited is acquired and a scenario applied to the data is set, the processor 11 calculates a first feature amount according to the scenario for each transaction record included in the data (step S100). Further, in parallel with this, the processor 11 calculates a second feature amount according to the scenario for each transaction record included in the data (step S200). Details of step S100 and step S200 will be described later. Note that step S100 and step S200 may be performed in parallel as shown in FIG. 11, or may be performed sequentially.

第１特徴量、及び第２特徴量が算出されると、プロセッサ１１は、これらに基づいて統合特徴量を算出する（ステップＳ００３）。そして、プロセッサ１１は、算出した統合特徴量に基づいて取引レコードの異常の可能性を推定し（ステップＳ００４）、推定結果を提示する（ステップＳ００５）。 Once the first feature amount and the second feature amount are calculated, the processor 11 calculates an integrated feature amount based on them (step S003). Then, the processor 11 estimates the possibility of abnormality in the transaction record based on the calculated integrated feature amount (step S004), and presents the estimation result (step S005).

＜第１特徴量の算出の動作＞
図１２は、第１特徴量の算出の動作の流れの例を示すフロー図である。第１特徴量の算出の動作は、上述したステップＳ１００の処理である。プロセッサ１１は、上述した統計的手法を用いてデータを質的変数ごとにグループに分類する（ステップＳ１０１）。 <Operation of calculating first feature amount>
FIG. 12 is a flow diagram illustrating an example of the flow of operation for calculating the first feature amount. The operation of calculating the first feature amount is the process of step S100 described above. The processor 11 classifies the data into groups for each qualitative variable using the above-described statistical method (step S101).

次にプロセッサ１１は、シナリオで指定された量的変数のグループ内における平均値を算出し（ステップＳ１０２）、条件付き確率を用いて取引レコードのそれぞれの異常度を第１特徴量として算出する（ステップＳ１０３）。 Next, the processor 11 calculates the average value within the group of quantitative variables specified in the scenario (step S102), and calculates the degree of abnormality of each transaction record as a first feature amount using conditional probability ( Step S103).

＜第２特徴量の算出の動作＞
図１３は、第２特徴量の算出の動作の流れの例を示すフロー図である。第２特徴量の算出の動作は、上述したステップＳ２００の処理である。プロセッサ１１は、ユーザが選択したシナリオにより指定される２つの質的変数の組合せの発生件数を集計する（ステップＳ２０１）。そしてプロセッサ１１は、集計した上述の組合せの発生件数の全件数に対する割合を同時発生確率として算出する（ステップＳ２０２）。 <Operation of calculating second feature amount>
FIG. 13 is a flow diagram illustrating an example of the flow of operations for calculating the second feature amount. The operation of calculating the second feature amount is the process of step S200 described above. The processor 11 totals the number of occurrences of the combination of two qualitative variables specified by the scenario selected by the user (step S201). Then, the processor 11 calculates the ratio of the aggregated number of occurrences of the above-mentioned combinations to the total number of occurrences as the probability of simultaneous occurrence (step S202).

また、プロセッサ１１は、上述した組合せに含まれる２つの質的変数のそれぞれの発生確率を算出する（ステップＳ２０３）。そして、プロセッサ１１は、ステップＳ２０２で算出した同時発生確率を、ステップＳ２０３で算出した２つの質的変数の発生確率の積で除算したリフト値を用いて第２特徴量を算出する（ステップＳ２０４）。 Furthermore, the processor 11 calculates the probability of occurrence of each of the two qualitative variables included in the above-described combination (step S203). Then, the processor 11 calculates a second feature quantity using a lift value obtained by dividing the simultaneous occurrence probability calculated in step S202 by the product of the occurrence probabilities of the two qualitative variables calculated in step S203 (step S204). .

以上、説明した動作により、この情報処理装置１は、複数の取引レコードの履歴を含むデータのうち、通常のパターンから大きく外れた履歴を異常、つまり不正の候補とみなしてアラートを上げる。これにより、この情報処理装置１は、例えば、「日本で買い物をした２時間後にブラジルで買い物をしている」等といった不審な履歴をその原因とともに特定することができる。 Through the operations described above, the information processing device 1 regards a history that deviates significantly from a normal pattern among data including the history of a plurality of transaction records as an abnormality, that is, a candidate for fraud, and raises an alert. As a result, the information processing device 1 can identify a suspicious history such as "shopping in Brazil two hours after shopping in Japan" along with its cause.

要するに、この情報処理装置１は、第１特徴量、及び第２特徴量を統合した統合特徴量によってデータに含まれる取引レコードのそれぞれが異常である可能性をシナリオに対応付けて推定する。異常と推定された取引レコードには、その原因となるシナリオが対応付けられており、シナリオには質的変数及び量的変数が指定されている。したがって、情報処理装置１は、教師なし学習により、販売データ、仕訳伝票（仕訳帳）、及び在庫データ等の、監査の対象となる監査データにおける異常な取引（操作）を取引単位で検出するとともに、その異常の原因を提示することができる。 In short, this information processing device 1 estimates the possibility that each transaction record included in the data is abnormal, in association with a scenario, using an integrated feature amount that integrates the first feature amount and the second feature amount. A transaction record estimated to be abnormal is associated with a scenario that causes the abnormality, and qualitative variables and quantitative variables are specified in the scenario. Therefore, the information processing device 1 uses unsupervised learning to detect abnormal transactions (operations) in audit data such as sales data, journal entry slips (journals), and inventory data on a transaction-by-transaction basis. , can present the cause of the abnormality.

＜変形例＞
以上が実施形態の説明であるが、この実施形態の内容は以下のように変形し得る。また、以下の変形例は、互いに組合されてもよい。 <Modified example>
The above is the description of the embodiment, but the content of this embodiment can be modified as follows. Moreover, the following modifications may be combined with each other.

＜１＞
上述した実施形態において、情報処理装置１は、ＣＰＵで構成されるプロセッサ１１を有していたが、情報処理装置１を制御する制御手段は他の構成であってもよい。 <1>
In the embodiment described above, the information processing device 1 had the processor 11 configured with a CPU, but the control means for controlling the information processing device 1 may have another configuration.

すなわち、情報処理装置１は、ＣＰＵ以外にも、例えばＧＰＵ（Graphics Processing Unit）、ＡＳＩＣ（Application Specific Integrated Circuit）、ＦＰＧＡ（Field Programmable Gate Array）、プログラマブル論理デバイス等、各種のプロセッサ等を、プロセッサ１１として有してもよい。 That is, in addition to the CPU, the information processing device 1 includes various processors such as a GPU (Graphics Processing Unit), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array), and a programmable logic device. You may have it as

＜２＞
上述した実施形態におけるプロセッサの動作は、１つのプロセッサによって成すのみでなく、物理的に離れた位置に存在する複数のプロセッサが協働して成すものであってもよい。 <2>
The operations of the processor in the embodiments described above may be performed not only by one processor, but also by the cooperation of a plurality of processors located at physically separate locations.

また、プロセッサの各動作の順序は、上述した実施形態において記載した順序のみに限定されるものではなく、適宜変更されてもよい。 Furthermore, the order of each operation of the processor is not limited to the order described in the embodiments described above, and may be changed as appropriate.

＜３＞
上述した実施形態において、第１特徴量、及び第２特徴量の算出に用いるシナリオはユーザによって選択されていたが、データの種類に応じて設定されてもよい。 <3>
In the embodiment described above, the scenario used to calculate the first feature amount and the second feature amount is selected by the user, but may be set depending on the type of data.

例えば、情報処理装置１のプロセッサ１１は、ユーザに指定されたデータの種類を示す種類ＩＤを、取引ＤＢ１２１のデータＩＤリスト１２１１から取得する。そして、プロセッサ１１は、取得した種類ＩＤに予め対応付けられている１以上のシナリオの番号等を設定ＤＢ１２３に設定すればよい。これにより、設定された番号等が示すシナリオによって指定されたデータにおいて着目すべき変数の組合せが特定される。 For example, the processor 11 of the information processing device 1 acquires a type ID indicating the type of data specified by the user from the data ID list 1211 of the transaction DB 121. Then, the processor 11 may set, in the setting DB 123, one or more scenario numbers that are associated in advance with the acquired type ID. Thereby, a combination of variables to be noted in the data specified by the scenario indicated by the set number etc. is specified.

この場合、この情報処理装置１は、データの種類を取得する取得手段を有し、予め決められた複数の組合せの中から、データの種類に応じた組合せを着目すべき変数の組合せとして設定する情報処理装置の例である。 In this case, the information processing device 1 has an acquisition means for acquiring the type of data, and sets a combination according to the type of data from among a plurality of predetermined combinations as a combination of variables to be focused on. This is an example of an information processing device.

＜４＞
上述した実施形態において、情報処理装置１のプロセッサ１１によって実行されるプログラムは、コンピュータを、複数の取引レコードを含むデータにおいて着目すべき変数の組合せを設定する設定手段と、各取引レコードに含まれる量的変数と質的変数との相関に基づく第１特徴量を算出する第１算出手段と、各取引レコードに含まれる２つの質的変数の相関に基づく第２特徴量を算出する第２算出手段と、第１算出手段と第２算出手段とを用いて、取引レコードごとに、設定手段にて設定された変数の各組合せについてそれぞれ算出された第１特徴量及び第２特徴量に基づいて、取引レコードが異常である可能性を推定する推定手段、として機能させるためのプログラムの例である。 <4>
In the embodiment described above, the program executed by the processor 11 of the information processing device 1 includes a setting means for setting a combination of variables to be noted in data including a plurality of transaction records, and a setting means for setting a combination of variables to be noted in data including a plurality of transaction records. a first calculation means for calculating a first feature based on the correlation between a quantitative variable and a qualitative variable; and a second calculation for calculating a second feature based on the correlation between two qualitative variables included in each transaction record. Based on the first feature quantity and the second feature quantity calculated for each combination of variables set by the setting means for each transaction record using the means, the first calculation means, and the second calculation means. , is an example of a program for functioning as an estimation means for estimating the possibility that a transaction record is abnormal.

このプログラムは、磁気テープ及び磁気ディスク等の磁気記録媒体、光ディスク等の光記録媒体、光磁気記録媒体、半導体メモリ等の、コンピュータ装置が読取り可能な記録媒体に記憶された状態で提供し得る。また、このプログラムは、インターネット等の通信回線経由でダウンロードされてもよい。 This program can be provided in a state stored in a computer readable recording medium such as a magnetic recording medium such as a magnetic tape and a magnetic disk, an optical recording medium such as an optical disk, a magneto-optical recording medium, or a semiconductor memory. Further, this program may be downloaded via a communication line such as the Internet.

１…情報処理装置、１１…プロセッサ、１１１…取得手段、１１２…設定手段、１１３…第１算出手段、１１４…第２算出手段、１１５…統合手段、１１６…推定手段、１２…メモリ、１２１…取引ＤＢ、１２１１…データＩＤリスト、１２１２…取引表、１２２…シナリオＤＢ、１２３…設定ＤＢ、１２３１…データＩＤリスト、１２３２…シナリオ番号リスト、１３…通信部、１４…操作部、１５…表示部。 DESCRIPTION OF SYMBOLS 1... Information processing device, 11... Processor, 111... Acquisition means, 112... Setting means, 113... First calculating means, 114... Second calculating means, 115... Integrating means, 116... Estimating means, 12... Memory, 121... Transaction DB, 1211...Data ID list, 1212...Transaction table, 122...Scenario DB, 123...Setting DB, 1231...Data ID list, 1232...Scenario number list, 13...Communication section, 14...Operation section, 15...Display section .

Claims

computer,
a setting means for setting a combination of variables to be focused on in data including multiple transaction records;
a first calculation means for calculating a first feature amount based on a correlation between a quantitative variable and a qualitative variable, which are the variables of interest included in each transaction record;
a second calculation means for calculating a second feature amount based on the correlation between two qualitative variables that are the variables of interest included in each transaction record;
Using the first calculation means and the second calculation means, the first feature amount and the second feature amount each calculated for each combination of variables set by the setting means for each transaction record. estimating means for estimating the possibility that the transaction record is abnormal based on;
A program to function as

The estimation means calculates an integrated feature amount by integrating the first feature amount and the second feature amount calculated for each of the set combinations using respective determined coefficients, and Estimating the possibility based on the feature amount,
Claim 1, wherein the coefficient is determined such that an index value obtained from the correlation between the integrated feature amount and all the first feature amounts and the second feature amount satisfies a predetermined criterion. The program described in.

The first calculation means classifies the data into a plurality of groups for each of the qualitative variables according to the similarity of trends in the quantitative variables included in the transaction records, and classifies each transaction record in each group. The program according to claim 1, wherein a statistical rarity of is calculated as the first feature quantity.

2. The program according to claim 1, wherein the setting means sets a combination selected by a user from among a plurality of predetermined combinations as the combination of variables to be noted.

comprising an acquisition means for acquiring the type of data,
2. The program according to claim 1, wherein the setting means sets a combination according to the type of data as the combination of variables of interest from among a plurality of predetermined combinations.

The program according to claim 1, wherein the qualitative variable is identification information of a department that made the transaction indicated by the transaction record.

The program according to claim 1, wherein the quantitative variable is a transaction amount indicated by the transaction record.

2. The first calculation means calculates, as the first feature amount, an amount determined from a ratio of the quantitative variable under conditions in which the qualitative variable is included in the transaction record. program.

The second calculation means uses a lift value obtained by dividing the proportion of each value of the two qualitative variables included in the transaction record by the product of the proportion of each value included in the transaction record. The program according to claim 1, further comprising calculating the second feature amount.

a setting means for setting a combination of variables to be focused on in data including multiple transaction records;
a first calculation means for calculating a first feature amount based on a correlation between a quantitative variable and a qualitative variable, which are the variables of interest included in each transaction record;
a second calculation means for calculating a second feature amount based on the correlation between two qualitative variables that are the variables of interest included in each transaction record;
Using the first calculation means and the second calculation means, the first feature amount and the second feature amount each calculated for each combination of variables set by the setting means for each transaction record. estimating means for estimating the possibility that the transaction record is abnormal based on the
An information processing device having: