JP7146218B1

JP7146218B1 - Information processing device, information processing method and program

Info

Publication number: JP7146218B1
Application number: JP2021210720A
Authority: JP
Inventors: 大介宮川; 哲平宇宿; 公介須崎; 聡近藤; 研吾白木; 貴央眞田; 優希柳岡
Original assignee: KPMG AZSA LLC
Current assignee: KPMG AZSA LLC
Priority date: 2021-12-24
Filing date: 2021-12-24
Publication date: 2022-10-04
Anticipated expiration: 2041-12-24
Also published as: JP2023095063A

Abstract

【課題】複数の企業が関与する不正のリスクを評価する。【解決手段】クラスタリング処理部１は、各企業の財務諸表、属性情報及び不正を行ったか否かを示す情報と、企業間の取引関係を示す情報と、が含まれる学習用データを読み込んで、１つのクラスタあたりに含まれるノード数が所定値以下になるようにクラスタリングしてネットワーク構造を取得する。クラスタ特徴量算出部２は、クリスタリング後のデータに含まれる各クラスタに属する各ノードの特徴量を算出し、算出した特徴量に基づいて各クラスタの特徴量を算出する。不正フラグ付与部３は、各クラスタに属するノードの不正を行ったか否かを示す情報に基づいて、当該クラスタに不正フラグを付与する。モデル構築部４は、不正フラグが付与されたデータを、特徴量を説明変数、不正フラグを目的変数として教師有り学習することで学習済みモデルを取得する。【選択図】図８The present invention evaluates the risk of fraud involving multiple companies. A clustering processing unit 1 reads learning data including financial statements of each company, attribute information, information indicating whether or not fraud has been committed, and information indicating business relationships between companies, A network structure is obtained by clustering so that the number of nodes included in one cluster is equal to or less than a predetermined value. The cluster feature amount calculation unit 2 calculates the feature amount of each node belonging to each cluster included in the data after clustering, and calculates the feature amount of each cluster based on the calculated feature amount. The fraud flag assigning unit 3 assigns an fraud flag to each cluster based on information indicating whether or not a node belonging to each cluster is fraudulent. The model construction unit 4 acquires a trained model by performing supervised learning on the data to which the fraud flag is assigned, using the feature quantity as an explanatory variable and the fraud flag as an objective variable. [Selection drawing] Fig. 8

Description

本発明は、情報処理装置、情報処理方法及びプログラムに関する。 The present invention relates to an information processing device, an information processing method, and a program.

機械学習やディープラーニングなどのＡＩ技術を利用して、会計監査における異常検出を行う試みが進展している。例えば、個々企業の勘定科目に着目して、勘定科目の値そのものの異常を検知する手法が提案されている。複数の勘定科目の変動にＶＡＲ（Vector Auto-Regression）構造を仮定してＬＡＳＳＯ（Least Absolute Shrinkage and Selection Operator）よるスパース（sparse）化を行って、異常な勘定科目の原因となる仕訳を検出する技術が提案されている（特許文献１）。この技術にかかる財務分析装置では、第１ベクトル生成部が、会計データの第１期間内の各勘定科目の各変動値を要素とする第１ベクトルを生成する。推定部が、複数の第１期間を含む第２期間内における複数の第１ベクトルに基づいて第１期間内における複数の勘定科目の各変動値を推定する。残差検出部が、変動値と実際の変動値との残差を検出する。異常候補特定部が、残差に相関する値が閾値を超える特定の第１期間における特定の勘定科目の変動値を抽出する。仕訳限定部が、特定の第１期間内の各仕訳の複数の勘定科目の各変動値を要素とする第２ベクトルを行方向に並べた第２行列を生成する。仕訳抽出部が、第２行列から、残差に相関する値が閾値を超える勘定科目を含む仕訳を抽出する。異常仕訳抽出部が、抽出された仕訳に含まれる異常を検知する異常検知部と、異常が検知された仕訳を抽出する。 Attempts to detect anomalies in accounting audits using AI technologies such as machine learning and deep learning are progressing. For example, a method has been proposed that focuses on account items of individual companies and detects anomalies in the account item values themselves. Assuming a VAR (Vector Auto-Regression) structure for fluctuations in multiple account items, sparse by LASSO (Least Absolute Shrinkage and Selection Operator) to detect journal entries that cause abnormal account items. A technique has been proposed (Patent Document 1). In the financial analysis device according to this technique, the first vector generation unit generates the first vector whose elements are the fluctuation values of the account items within the first period of the accounting data. An estimating unit estimates each fluctuation value of a plurality of account items within a first period based on a plurality of first vectors within a second period including a plurality of first periods. A residual detector detects a residual between the variation value and the actual variation value. An abnormality candidate identification unit extracts a variation value of a specific account item in a specific first period in which a value correlated with the residual exceeds a threshold. The journal limiting unit generates a second matrix in which second vectors are arranged in the row direction, the elements of which are the fluctuation values of the plurality of account items of each journal within a specific first period. A journal extractor extracts from the second matrix journals containing account items whose values correlated with the residuals exceed a threshold. An abnormal journal extraction unit extracts an abnormality detection unit that detects an abnormality contained in the extracted journal and a journal in which an abnormality is detected.

特開２０１９－６７０８６号公報JP 2019-67086 A

V. A. Traag, L. Waltman, N. J. van Eck, “From Louvain to Leiden: guaranteeing well-connected communities”, 26 March, 2019, Nature, Scientific Reports 9, Article Number 5233(2019), ２０２１年９月２９日検索、<URL: https://www.nature.com/articles/s41598-019-41695-z.pdf>V. A. Traag, L. Waltman, N. J. van Eck, "From Louvain to Leiden: guaranteeing well-connected communities", 26 March, 2019, Nature, Scientific Reports 9, Article Number 5233(2019), retrieved September 29, 2021, <URL: https://www.nature.com/articles/s41598-019-41695-z.pdf> 宇宿哲平、近藤聡、白木研吾、宮川大介、柳岡優希、「機械学習手法を用いた不正会計予測：非上場企業データを用いた検討」、2021年6月、一橋大学院金融戦略・経営財務プログラム、ワーキングペーパーシリーズ、FS-2021-J-001、２０２１年９月２９日検索、<URL: https://www.fs.hub.hit-u.ac.jp/inc/files/staff-research/workingpaper/FS-2021-J-001.pdf>Teppei Usuku, Satoshi Kondo, Kengo Shiraki, Daisuke Miyagawa, Yuki Yanagioka, "Predicting Fraudulent Accounting Using Machine Learning Techniques: A Study Using Unlisted Company Data," June 2021, Hitotsubashi Graduate School of Financial Strategy and Management Finance Program, Working Paper Series, FS-2021-J-001, retrieved on September 29, 2021, <URL: https://www.fs.hub.hit-u.ac.jp/inc/files/staff-research/workingpaper /FS-2021-J-001.pdf>

特許文献１の手法では、個々の勘定科目の異常を検知することはできる。しかし、複数の企業が関与する異常な挙動、例えば循環取引などの複数の企業が関与する取引不正を検知することは原理的に不可能である。 The technique of Patent Literature 1 can detect anomalies in individual account items. However, in principle, it is impossible to detect abnormal behavior involving multiple companies, such as fraudulent transactions involving multiple companies, such as circular transactions.

よって、複数の企業が関与する取引不正を検知するために、企業間の取引関係をも考慮した不正検知手法の確立が求められる。 Therefore, in order to detect fraudulent transactions in which multiple companies are involved, it is necessary to establish a fraud detection method that also considers the business relationships between companies.

本発明は上記の事情に鑑みて成されたものであり、複数の企業が関与する取引不正のリスクを評価することを目的とする。 The present invention has been made in view of the above circumstances, and an object of the present invention is to evaluate the risk of fraudulent transactions involving multiple companies.

一実施の形態にかかる情報処理装置は、多次元ベクトルで各企業の情報が表された各企業の財務諸表に含まれる複数の勘定科目の値を示す複数の変数、各企業の属性情報及び各企業が不正を行ったか否かを示す情報と、企業間の取引関係を示す情報と、が含まれる学習用データを読み込んで、１つのクラスタあたりに含まれるノード数が所定値以下になるようにクラスタリングし、各クラスタに対応するノードとノード間の取引関係を示すエッジとで構成されるネットワーク構造を取得するクラスタリング処理部と、前記クリスタリング後のデータに含まれる各クラスタに属する各ノードの特徴量を算出し、算出した特徴量に基づいて各クラスタの特徴量を算出する特徴量算出部と、各クラスタに属するノードの前記不正を行ったか否かを示す情報に基づいて、当該クラスタに不正フラグを付与する不正フラグ付与部と、前記不正フラグが付与されたデータを、前記特徴量を説明変数、前記不正フラグを目的変数として教師有り学習することで学習済みモデルを取得するモデル構築部と、を有するものである。 An information processing apparatus according to an embodiment includes a plurality of variables indicating the values of a plurality of account items included in financial statements of each company in which information of each company is represented by a multidimensional vector, attribute information of each company, and Read learning data containing information indicating whether a company has committed fraud and information indicating business relationships between companies, and adjust the number of nodes included in one cluster to a predetermined value or less. A clustering processing unit that performs clustering and acquires a network structure composed of nodes corresponding to each cluster and edges that indicate transaction relationships between the nodes, and characteristics of each node belonging to each cluster included in the data after the clustering. a feature quantity calculation unit that calculates the feature quantity of each cluster based on the calculated feature quantity; a fraud flag assigning unit that assigns a flag; and a model construction unit that obtains a trained model by performing supervised learning on the data to which the fraud flag is assigned, using the feature quantity as an explanatory variable and the fraud flag as an objective variable. ,

一実施の形態にかかる情報処理方法は、クラスタリング処理部が、多次元ベクトルで各企業の情報が表された各企業の財務諸表に含まれる複数の勘定科目の値を示す複数の変数、各企業の属性情報及び各企業が不正を行ったか否かを示す情報と、企業間の取引関係を示す情報と、が含まれる学習用データを読み込んで、１つのクラスタあたりに含まれるノード数が所定値以下になるようにクラスタリングし、各クラスタに対応するノードとノード間の取引関係を示すエッジとで構成されるネットワーク構造を取得し、特徴量算出部が、前記クリスタリング後のデータに含まれる各クラスタに属する各ノードの特徴量を算出し、算出した特徴量に基づいて各クラスタの特徴量を算出し、不正フラグ付与部が、各クラスタに属するノードの前記不正を行ったか否かを示す情報に基づいて、当該クラスタに不正フラグを付与し、モデル構築部が、前記不正フラグが付与されたデータを、前記特徴量を説明変数、前記不正フラグを目的変数として教師有り学習することで学習済みモデルを取得するものである。 In the information processing method according to one embodiment, the clustering processing unit generates a plurality of variables indicating the values of a plurality of account items included in the financial statements of each company in which information of each company is represented by a multidimensional vector, each company attribute information, information indicating whether or not each company has committed fraud, and information indicating the business relationship between companies, and the number of nodes included in one cluster is a predetermined value. Clustering is performed as follows, a network structure composed of nodes corresponding to each cluster and edges indicating transaction relationships between nodes is obtained, and the feature amount calculation unit calculates each included in the data after the clustering Information indicating whether the feature amount of each node belonging to the cluster has been calculated, the feature amount of each cluster has been calculated based on the calculated feature amount, and the fraud flag assigning unit has performed the fraud of the node belonging to each cluster. Based on, a fraud flag is given to the cluster, and the model construction unit performs supervised learning on the data to which the fraud flag is assigned, with the feature amount as an explanatory variable and the fraud flag as an objective variable. It is the one that gets the model.

一実施の形態にかかるプログラムは、多次元ベクトルで各企業の情報が表された各企業の財務諸表に含まれる複数の勘定科目の値を示す複数の変数、各企業の属性情報及び各企業が不正を行ったか否かを示す情報と、企業間の取引関係を示す情報と、が含まれる学習用データを読み込んで、１つのクラスタあたりに含まれるノード数が所定値以下になるようにクラスタリングし、各クラスタに対応するノードとノード間の取引関係を示すエッジとで構成されるネットワーク構造を取得する処理と、特徴量算出部が、前記クリスタリング後のデータに含まれる各クラスタに属する各ノードの特徴量を算出し、算出した特徴量に基づいて各クラスタの特徴量を算出する処理と、不正フラグ付与部が、各クラスタに属するノードの前記不正を行ったか否かを示す情報に基づいて、当該クラスタに不正フラグを付与する処理と、モデル構築部が、前記不正フラグが付与されたデータを、前記特徴量を説明変数、前記不正フラグを目的変数として教師有り学習することで学習済みモデルを取得する処理と、をコンピュータに実行させるものである。 A program according to one embodiment includes a plurality of variables indicating the values of a plurality of account items included in the financial statements of each company in which information of each company is represented by a multidimensional vector, attribute information of each company, and Read learning data containing information indicating whether or not fraud has been committed and information indicating business relationships between companies, and perform clustering so that the number of nodes included in each cluster is equal to or less than a predetermined value. , a process of acquiring a network structure composed of nodes corresponding to each cluster and edges indicating a transaction relationship between the nodes, and a feature amount calculation unit performing each node belonging to each cluster included in the data after crystalling and calculating the feature amount of each cluster based on the calculated feature amount; , a process of assigning a fraudulent flag to the cluster, and a model construction unit performing supervised learning on the data to which the fraudulent flag is assigned, using the feature amount as an explanatory variable and the fraudulent flag as an objective variable to create a trained model and a process of obtaining the .

一実施の形態にかかる情報処理装置は、多次元ベクトルで各企業の情報が表された各企業の財務諸表に含まれる複数の勘定科目の値を示す複数の変数、各企業の属性情報及び各企業が不正を行ったか否かを示す情報と、企業間の取引関係を示す情報と、が含まれる学習用データを読み込んで、１つのクラスタあたりに含まれるノード数が所定値以下になるようにクラスタリングし、各クラスタに対応するノードとノード間の取引関係を示すエッジとで構成されるネットワーク構造を取得するクラスタリング処理部と、前記クリスタリング後のデータに含まれる各クラスタに属する各ノードの特徴量を算出し、算出した特徴量に基づいて各クラスタの特徴量を算出する特徴量算出部と、各クラスタに属するノードの前記不正を行ったか否かを示す情報に基づいて、当該クラスタに不正フラグを付与する不正フラグ付与部と、前記不正フラグが付与されたデータを、前記特徴量を説明変数、前記不正フラグを目的変数として教師有り学習することで学習済みモデルを取得するモデル構築部と、を備える、モデル構築装置によって取得された前記モデルを保持するモデル格納部と、多次元ベクトルで各企業の情報が表された各企業の財務諸表に含まれる複数の勘定科目の値を示す複数の変数、各企業の属性情報及び企業間の取引関係を示す情報と、が含まれる入力データを読み込み、前記入力データを前記学習済みモデルに入力して、前記入力データに対応する企業が不正を行ったか否かを推定する推定処理部と、を有するものである。 An information processing apparatus according to an embodiment includes a plurality of variables indicating the values of a plurality of account items included in financial statements of each company in which information of each company is represented by a multidimensional vector, attribute information of each company, and Read learning data containing information indicating whether a company has committed fraud and information indicating business relationships between companies, and adjust the number of nodes included in one cluster to a predetermined value or less. A clustering processing unit that performs clustering and acquires a network structure composed of nodes corresponding to each cluster and edges that indicate transaction relationships between the nodes, and characteristics of each node belonging to each cluster included in the data after the clustering. a feature quantity calculation unit that calculates the feature quantity of each cluster based on the calculated feature quantity; a fraud flag assigning unit that assigns a flag; and a model construction unit that obtains a trained model by performing supervised learning on the data to which the fraud flag is assigned, using the feature quantity as an explanatory variable and the fraud flag as an objective variable. a model storage unit that holds the model acquired by the model construction device; variables, attribute information of each company, and information indicating business relationships between companies, and inputs the input data into the learned model so that the company corresponding to the input data commits fraud. and an estimation processing unit for estimating whether or not it has been performed.

本発明によれば、複数の企業が関与する取引不正のリスクを評価することができる。 According to the present invention, the risk of transaction fraud involving multiple companies can be evaluated.

実施の形態１にかかる会計情報処理装置を実現するためのシステム構成の一例を示す図である。1 is a diagram showing an example of a system configuration for realizing an accounting information processing apparatus according to Embodiment 1; FIG. 本実施の形態にかかる企業データの基本構成を模式的に示す図である。3 is a diagram schematically showing the basic configuration of company data according to the embodiment; FIG. 本実施の形態にかかる企業データを表形式にて示す図である。It is a figure which shows the enterprise data concerning this Embodiment in tabular form. エッジ情報を示すテーブルとエッジ情報に基づくノード及びエッジの例を示す図である。FIG. 3 is a diagram showing an example of a table showing edge information and nodes and edges based on the edge information; 不正履歴情報の第１の例を示す図である。FIG. 10 is a diagram showing a first example of fraud history information; 不正履歴情報の第２の例を示す図である。FIG. 10 is a diagram showing a second example of fraud history information; 不正履歴情報の第３の例を示す図である。FIG. 13 is a diagram showing a third example of fraud history information; 実施の形態１にかかる情報処理装置の構成を模式的に示す図である。1 is a diagram schematically showing the configuration of an information processing apparatus according to a first embodiment; FIG. 実施の形態１にかかる情報処理装置１００の変形例を模式的に示す図である。FIG. 5 is a diagram schematically showing a modification of the information processing apparatus 100 according to the first embodiment; FIG. 実施の形態１にかかる情報処理装置における処理のフローチャートである。4 is a flowchart of processing in the information processing apparatus according to the first embodiment; 各ノードの特徴量の例を表形式で示す図である。It is a figure which shows the example of the feature-value of each node in tabular form. 特徴量の算出に用いるネットワークの例を示す図である。FIG. 3 is a diagram showing an example of a network used for calculating feature amounts; FIG. 隣接行列の例示に用いるネットワークを示す図である。FIG. 3 shows a network used to illustrate adjacency matrices; 局所ネットワークの例を示す図である。FIG. 2 is a diagram showing an example of a local network; FIG. 各クラスタの特徴量の例を表形式で示す図である。It is a figure which shows the example of the feature-value of each cluster in tabular form. ＲＯＣ曲線とＡＵＣの一例を示す図である。It is a figure which shows an example of a ROC curve and AUC. 評価ケースＡにおける各特徴量の採択率を示す図である。FIG. 10 is a diagram showing the adoption rate of each feature amount in evaluation case A; 評価ケースＡにおいて学習用データを用いたときのＡＵＣとテスト用データを用いたときのＡＵＣを示す図である。FIG. 10 is a diagram showing AUC when learning data is used and AUC when test data is used in evaluation case A; 評価ケースＢにおける各特徴量の採択率を示す図である。FIG. 10 is a diagram showing the adoption rate of each feature quantity in evaluation case B; 評価ケースＢにおいて学習用データを用いたときのＡＵＣとテスト用データを用いたときのＡＵＣを示す図である。FIG. 10 is a diagram showing AUC when learning data is used and AUC when test data is used in evaluation case B; 評価ケースＣにおける各特徴量の採択率を示す図である。FIG. 10 is a diagram showing the adoption rate of each feature quantity in evaluation case C; 評価ケースＣにおいて学習用データを用いたときのＡＵＣとテスト用データを用いたときのＡＵＣを示す図である。FIG. 11 is a diagram showing AUC when learning data is used and AUC when test data is used in evaluation case C; クラスタに属するノードの最大値とＡＵＣとの関係を示す図である。FIG. 4 is a diagram showing the relationship between the maximum value of nodes belonging to a cluster and AUC; 実施の形態２にかかる情報処理装置の構成を模式的に示す図である。FIG. 10 is a diagram schematically showing the configuration of an information processing apparatus according to a second embodiment; FIG. 実施の形態２にかかる情報処理装置の変形例を示す図である。FIG. 10 is a diagram showing a modification of the information processing device according to the second embodiment; 実施の形態２にかかる情報処理装置の変形例を示す図である。FIG. 10 is a diagram showing a modification of the information processing device according to the second embodiment;

以下、図面を参照して本発明の実施の形態について説明する。各図面においては、同一要素には同一の符号が付されており、必要に応じて重複説明は省略される。 BEST MODE FOR CARRYING OUT THE INVENTION Hereinafter, embodiments of the present invention will be described with reference to the drawings. In each drawing, the same elements are denoted by the same reference numerals, and redundant description will be omitted as necessary.

実施の形態１
実施の形態１にかかる情報処理装置１００について説明する。情報処理装置１００は、個々の企業の財務諸表に含まれる各勘定科目や取引情報などを用いて、複数の企業間で行われる異常な取引（換言すれば、取引不正）の検知を行うものとして構成される。 Embodiment 1
An information processing apparatus 100 according to the first embodiment will be described. The information processing apparatus 100 detects abnormal transactions (in other words, fraudulent transactions) between a plurality of companies using account items and transaction information included in the financial statements of individual companies. Configured.

図１に、実施の形態１にかかる情報処理装置１００を実現するためのシステム構成の一例を示す。情報処理装置１００は、専用コンピュータ、パーソナルコンピュータ（ＰＣ）などのコンピュータ１１０により実現可能である。但し、コンピュータは、物理的に単一である必要はなく、分散処理を実行する場合には、複数であってもよい。図１に示すように、コンピュータ１１０は、ＣＰＵ（Central Processing Unit）１１、ＲＯＭ（Read Only Memory）１２及びＲＡＭ（Random Access Memory）１３を有し、これらがバス１４を介して相互に接続されている。尚、コンピュータを動作させるためのＯＳソフトなどは、説明を省略するが、この会計情報処理装置を構築するコンピュータも当然有しているものとする。 FIG. 1 shows an example of a system configuration for realizing an information processing apparatus 100 according to the first embodiment. The information processing apparatus 100 can be realized by a computer 110 such as a dedicated computer or a personal computer (PC). However, the computer does not need to be physically single, and multiple computers may be used when performing distributed processing. As shown in FIG. 1, a computer 110 has a CPU (Central Processing Unit) 11, a ROM (Read Only Memory) 12 and a RAM (Random Access Memory) 13, which are interconnected via a bus 14. there is Although the explanation of OS software and the like for operating the computer is omitted, it is assumed that the computer that constructs this accounting information processing apparatus also has a computer.

バス１４には、入出力インターフェイス１５が接続されている。入出力インターフェイス１５には、入力部１６、出力部１７、通信部１８及び記憶部１９が接続される。 An input/output interface 15 is connected to the bus 14 . An input unit 16 , an output unit 17 , a communication unit 18 and a storage unit 19 are connected to the input/output interface 15 .

入力部１６は、例えば、キーボード、マウス、センサなどより構成される。出力部１７は、例えば、ＬＣＤなどのディスプレイ装置やヘッドフォン及びスピーカなどの音声出力装置により構成される。通信部１８は、例えば、ルータやターミナルアダプタなどにより構成される。記憶部１９は、ハードディスク、フラッシュメモリなどの記憶装置により構成される。 The input unit 16 is composed of, for example, a keyboard, mouse, sensor, and the like. The output unit 17 is configured by, for example, a display device such as an LCD and an audio output device such as headphones and speakers. The communication unit 18 is configured by, for example, a router or a terminal adapter. The storage unit 19 is configured by a storage device such as a hard disk or flash memory.

ＣＰＵ１１は、ＲＯＭ１２に記憶されている各種プログラム、又は記憶部１９からＲＡＭ１３にロードされた各種プログラムに従って各種の処理を行うことが可能である。本実施の形態においては、ＣＰＵ１１は、例えば後述する情報処理装置１００の各部の処理を実行する。ＣＰＵ１１とは別にＧＰＵ（Graphics Processing Unitを設け、ＣＰＵ１１と同様に、ＲＯＭ１２に記憶されている各種プログラム、又は記憶部１９からＲＡＭ１３にロードされた各種プログラムに従って各種の処理、本実施の形態においては、例えば後述する情報処理装置１００の各部の処理を実行してもよい。なお、ＧＰＵは、定型的な処理を並列的に行う用途に適しており、後述するニューラルネットワークにおける処理などに適用することで、ＣＰＵ１１に比べて処理速度を向上させることも可能である。ＲＡＭ１３には又、ＣＰＵ１１及びＧＰＵが各種の処理を実行する上において必要なデータなども適宜記憶される。 The CPU 11 can perform various processes according to various programs stored in the ROM 12 or various programs loaded from the storage unit 19 to the RAM 13 . In this embodiment, the CPU 11 executes, for example, processing of each unit of the information processing apparatus 100, which will be described later. Separately from the CPU 11, a GPU (Graphics Processing Unit) is provided, and similar to the CPU 11, various processes are performed according to various programs stored in the ROM 12 or various programs loaded from the storage unit 19 to the RAM 13. In this embodiment, For example, the processing of each part of the information processing apparatus 100 described later may be performed.The GPU is suitable for performing routine processing in parallel, and can be applied to processing in a neural network described later. , it is also possible to improve the processing speed compared to the CPU 11. The RAM 13 also stores data necessary for the CPU 11 and GPU to execute various kinds of processing.

通信部１８は、ネットワーク３０を介して、サーバ４０と双方向の通信を行うことが可能である。通信部１８は、ＣＰＵ１１から提供されたデータをサーバ４０へ送信したり、サーバ４０から受信したデータをＣＰＵ１１、ＲＡＭ１３及び記憶部１９などへ出力することができる。通信部１８は、他の装置との間で、アナログ信号又はディジタル信号による通信を行ってもよい。記憶部１９はＣＰＵ１１との間でデータのやり取りが可能であり、情報の保存及び消去を行う。 The communication unit 18 can perform two-way communication with the server 40 via the network 30 . The communication unit 18 can transmit data provided from the CPU 11 to the server 40 and output data received from the server 40 to the CPU 11, the RAM 13, the storage unit 19, and the like. The communication unit 18 may communicate with other devices using analog signals or digital signals. The storage unit 19 can exchange data with the CPU 11, and stores and erases information.

入出力インターフェイス１５には、必要に応じてドライブ２０が接続されてもよい。ドライブ２０には、例えば、磁気ディスク２１、光ディスク２２、フレキシブルディスク２３又は半導体メモリ２４などの記憶媒体が適宜装着可能である。各記憶媒体から読み出されたコンピュータプログラムは、必要に応じて記憶部１９にインストールされてもよい。また、必要に応じて、ＣＰＵ１１が各種の処理を実行する上において必要なデータや、ＣＰＵ１１の処理の結果として得られたデータなどを各記憶媒体に記憶してもよい。 A drive 20 may be connected to the input/output interface 15 as necessary. A storage medium such as a magnetic disk 21, an optical disk 22, a flexible disk 23, or a semiconductor memory 24 can be appropriately loaded in the drive 20, for example. A computer program read from each storage medium may be installed in the storage unit 19 as necessary. In addition, data necessary for the CPU 11 to execute various types of processing, data obtained as a result of processing by the CPU 11, and the like may be stored in each storage medium as necessary.

次いで、本実施の形態で用いる学習用データの形式について説明する。本実施の形態にかかる学習用データＤＡＴは、企業を識別するための企業識別情報（例えばＩＤナンバーなど）と、複数の勘定科目のデータ値と、を少なくとも含む各種の変数が関連付けられた表形式のデータセットとして構成される。なお、ここでいうデータ値とは、数値データ及び文字データ（テキストデータ）の両方を含むものとする。図２に、本実施の形態にかかる学習用データＤＡＴの基本構成を模式的に示す。学習用データＤＡＴは、企業の会計情報などを示す企業データＣＰＲ、企業間の取引関係を示すエッジ情報ＥＤＧ及び企業が不正を行ったか否かを示す不正履歴情報ＵＤＨが結合されたデータとして構成される。 Next, the format of learning data used in this embodiment will be described. The learning data DAT according to the present embodiment is in a table format in which various variables including at least company identification information (for example, an ID number) for identifying a company and data values of a plurality of account items are associated. data set. The data value here includes both numerical data and character data (text data). FIG. 2 schematically shows the basic configuration of the learning data DAT according to this embodiment. The learning data DAT is composed of corporate data CPR indicating corporate accounting information, edge information EDG indicating business relationships between companies, and fraud history information UDH indicating whether or not a company has committed fraud. be.

企業データＣＰＲについて説明する。企業データＣＰＲは、各企業のＩＤに、財務諸表に含まれる情報と、各企業の属性を示す情報（例えば、業種や事業所の取材値など）の情報が紐付けられたものとして構成される。図３に、本実施の形態にかかる企業データＣＰＲの例を表形式にて示す。企業データＣＰＲの１つのレコードに関連付けられるフィールド、すなわち表の列方向には、企業識別情報（企業ＩＤ）、会計識別情報など、複数の勘定科目のデータ値が配列される。また、当然のことながら、表の行方向には複数のレコードが配列される。図３に示す様に、１つの企業識別情報に対して複数の会計識別情報が組み合わされ得るので、企業データＣＰＲには１つの企業に対応する複数のレコードが含まれ得る。 Enterprise data CPR will be described. The company data CPR is configured such that the ID of each company is linked with the information contained in the financial statements and the information indicating the attributes of each company (for example, the interview value of the industry and place of business). . FIG. 3 shows an example of the corporate data CPR according to this embodiment in tabular form. Data values of a plurality of account items, such as company identification information (company ID) and accounting identification information, are arranged in fields associated with one record of the company data CPR, that is, in the column direction of the table. Also, as a matter of course, a plurality of records are arranged in the row direction of the table. As shown in FIG. 3, multiple account identification information can be combined with one company identification information, so multiple records corresponding to one company can be included in the company data CPR.

企業識別情報は、企業名などのテキストデータでもよいし、識別番号（企業ＩＤ）などの数値データであってもよい。なお、図２及び３では、企業識別情報として企業ＩＤを用いている。また、企業識別情報は、必要に応じて、企業の業種を示す変数などの他の変数を含んでもよい。 The company identification information may be text data such as a company name, or numeric data such as an identification number (company ID). In addition, in FIGS. 2 and 3, a company ID is used as company identification information. The company identification information may also include other variables, such as a variable indicating the type of business of the company, as required.

変数には、貸借対照表、損益計算書及びキャッシュフロー計算書の各項目が含まれる。また、変数には、貸借対照表、損益計算書及びキャッシュフロー計算書の各項目以外の情報が含まれてもよい。 Variables include balance sheet, income statement and cash flow statement items. The variables may also include information other than the balance sheet, income statement, and cash flow statement items.

換言すれば、各企業のレコードは、各種のデータ値を成分とする多次元ベクトルとして記述され、企業データは、この多次元ベクトルで記述される各企業を表すノードを複数、含むものとして構成される。 In other words, each company's record is described as a multi-dimensional vector having various data values as components, and the company data is configured to include a plurality of nodes representing each company described by this multi-dimensional vector. be.

次いで、エッジ情報ＥＤＧについて説明する。本実施の形態にかかる企業データＣＰＲに含まれる企業のうち、取引関係がある２つの企業の間は、エッジで接続されている。そのため、以下で説明する学習処理における学習用データＤＡＴには、エッジを示すエッジ情報ＥＤＧが含まれる。 Next, edge information EDG will be described. Of the companies included in the company data CPR according to the present embodiment, two companies having a business relationship are connected by an edge. Therefore, learning data DAT in the learning process described below includes edge information EDG indicating an edge.

エッジ情報ＥＤＧの例について説明する。図４に、エッジ情報を示すテーブルとエッジ情報に基づくノード及びエッジの例を示す。この例では、ノードＮ１からノードＮ２～Ｎ４へのエッジ、ノードＮ２からノードＮ１及びノードＮ３へのエッジ、ノードＮ３からノードＮ４へのエッジが存在している。このテーブルを元に、エッジの向きを反映した有向グラフ又はエッジの向きを考慮しない無向グラフとしてネットワークを構成することができる。 An example of edge information EDG will be described. FIG. 4 shows an example of a table showing edge information and nodes and edges based on the edge information. In this example, there are edges from node N1 to nodes N2 to N4, edges from node N2 to nodes N1 and N3, and edges from node N3 to node N4. Based on this table, a network can be configured as a directed graph that reflects the direction of edges or an undirected graph that does not consider the direction of edges.

また、本実施の形態では、学習用データＤＡＴには、各企業について不正が有ったことを示す不正履歴情報ＵＤＨが含まれる。不正履歴情報ＵＤＨは、例えば、特定の列に集約されてもよく、集約された情報は、不正の種類の名称や不正を説明する文章などのテキスト情報として表されてもよい。不正履歴情報ＵＤＨにおいては、不正を示す情報は、特定の列に集約されてもよい。集約された情報は、不正の種類の名称や不正を説明する文章などのテキスト情報として表されてもよく、また、「１」又は「０」などの数値データやブーリアン型の変数など、各種の形式で表現されてもよい。 Further, in the present embodiment, the learning data DAT includes fraud history information UDH indicating that fraud has occurred for each company. The fraud history information UDH may, for example, be aggregated into a specific column, and the aggregated information may be represented as text information such as the name of the fraud type and sentences explaining the fraud. In the fraud history information UDH, information indicating fraud may be summarized in a specific column. The aggregated information may be represented as text information such as the name of the type of fraud and sentences explaining the fraud, and may also be expressed as numerical data such as "1" or "0" and various types of variables such as Boolean variables. may be expressed in the form

例えば、不正履歴情報ＵＤＨにおいて、勘定科目ごとに不正の有無を表示してもよい。図５に、不正履歴情報ＵＤＨの第１の例を示す。第１の例では、勘定科目ごとに、不正が有った場合には「１」、不正が無かった場合には「０」の値を付与することで、不正履歴情報ＵＤＨを構成している。 For example, in the fraud history information UDH, the presence or absence of fraud may be displayed for each account item. FIG. 5 shows a first example of fraud history information UDH. In the first example, the fraud history information UDH is configured by assigning a value of "1" if there is fraud and "0" if there is no fraud for each account item. .

また、例えば、不正履歴情報ＵＤＨにおいて、不正をテキストデータで表示してもよい。図６に、不正履歴情報ＵＤＨの第２の例を示す。第２の例では、「不正種類」の列に不正の種類を示すテキストデータを格納することで、不正履歴情報ＵＤＨを構成している。なお、この例では、不正が無かった場合には、「不正種類」の列は空欄又はデータなしとしている。 Further, for example, fraud may be displayed as text data in the fraud history information UDH. FIG. 6 shows a second example of fraud history information UDH. In the second example, the fraud history information UDH is configured by storing text data indicating the fraud type in the "fraud type" column. Note that in this example, if there is no fraud, the "fraud type" column is blank or has no data.

さらに、例えば、不正の種類名称ごとに列を設け、不正の有無を「１」又は「０」などの何らかの識別可能な形式で表現してもよい。図７に、不正履歴情報ＵＤＨの第２の例を示す。第３の例では、不正の種類ごとに列を設け、各列に対応する不正が有った場合に「１」、不正が無かった場合に「０」の値を付与することで、不正履歴情報ＵＤＨを構成している。 Further, for example, a column may be provided for each fraud type name, and the presence or absence of fraud may be expressed in some identifiable format such as "1" or "0". FIG. 7 shows a second example of fraud history information UDH. In the third example, a column is provided for each type of fraud, and a value of "1" is given when there is fraud corresponding to each column, and a value of "0" is given when there is no fraud. It constitutes the information UDH.

学習用データＤＡＴと、学習用データＤＡＴの基となる企業データＣＰＲ、エッジ情報ＥＤＧ及び不正履歴情報ＵＤＨとは、例えば、図１の記憶部１９などに格納されてもよい。また、これらのデータは、ネットワーク３０及び通信部１８を介してサーバ４０から与えられてもよいし、ドライブ２０を介して各所の記憶媒体から与えられてもよい。 The learning data DAT, and the company data CPR, edge information EDG, and fraud history information UDH that are the basis of the learning data DAT may be stored, for example, in the storage unit 19 shown in FIG. These data may be provided from the server 40 via the network 30 and the communication unit 18, or may be provided from various storage media via the drive 20. FIG.

次いで、情報処理装置１００の構成及び処理について説明する。本実施の形態では、情報処理装置１００は、上述の学習用データＤＡＴを用いて、複数のグループを含むグループ（クラスタ）内において、２以上の企業が関与する取引不正の有無と推定するものとして構成される。 Next, the configuration and processing of the information processing apparatus 100 will be described. In the present embodiment, information processing apparatus 100 uses the learning data DAT described above to estimate the presence or absence of fraudulent transactions involving two or more companies in a group (cluster) including a plurality of groups. Configured.

図８に、実施の形態１にかかる情報処理装置１００の構成を模式的に示す。情報処理装置１００は、ハードウェア上では、各処理は実際にはソフトウェアと上記ＣＰＵ１１などのハードウェア資源とが協働することで実現される。情報処理装置１００は、取引不正の推定のための学習済みモデルを作成する処理を実現するために、少なくともクラスタリング処理部１、クラスタ特徴量算出部２、不正フラグ付与部３及びモデル構築部４を有する。 FIG. 8 schematically shows the configuration of the information processing apparatus 100 according to the first embodiment. In the information processing apparatus 100, each process is actually realized by cooperation of software and hardware resources such as the CPU 11 on hardware. The information processing apparatus 100 includes at least a clustering processing unit 1, a cluster feature value calculation unit 2, a fraud flag assignment unit 3, and a model construction unit 4 in order to realize processing for creating a trained model for estimating fraudulent transactions. have.

以下で説明するモデルの構築処理は、クラスタリング処理部１、クラスタ特徴量算出部２、不正フラグ付与部３及びモデル構築部４を有する情報処理装置１００で実行することができるが、構築したモデルを評価するための構成を付加してもよい。図９に、情報処理装置１００の変形例である、情報処理装置１０１の構成を模式的に示す。情報処理装置１０１は、情報処理装置１００にテスト処理部５を追加した構成を有する。 The model building process described below can be executed by the information processing apparatus 100 having the clustering processing unit 1, the cluster feature amount calculation unit 2, the fraud flag adding unit 3, and the model building unit 4. You may add the structure for evaluating. FIG. 9 schematically shows the configuration of an information processing device 101, which is a modification of the information processing device 100. As shown in FIG. The information processing device 101 has a configuration obtained by adding a test processing unit 5 to the information processing device 100 .

以下、モデル構築処理とテスト処理について説明する。図１０に、実施の形態１にかかる情報処理装置１００における処理のフローチャートを示す。情報処理装置１００は、図９に示すステップＳ１～Ｓ５の処理を実行することで、取引不正を検知する学習済みモデルを作成及び評価が行われる。 Model construction processing and test processing will be described below. FIG. 10 shows a flowchart of processing in the information processing apparatus 100 according to the first embodiment. The information processing apparatus 100 creates and evaluates a learned model for detecting fraudulent transactions by executing the processes of steps S1 to S5 shown in FIG.

ステップＳ１：クラスタリング処理
クラスタリング処理部１は、企業データを取り込み、教師なし学習によるクラスタリングを行う。ステップＳ１は、以下のステップＳ１１～Ｓ１３を含む。 Step S1: Clustering Processing The clustering processing section 1 takes in corporate data and performs clustering by unsupervised learning. Step S1 includes steps S11 to S13 below.

ステップＳ１１：学習用データＤＡＴの読み込み
クラスタリング処理部１は、学習用データＤＡＴを読み込む。学習用データＤＡＴは、情報処理装置１００のオペレータが入力手段または通信手段を介して与えてもよいし、記憶装置（例えば、図１の記憶部１９）に予め格納されていてもよい。 Step S11: Read data DAT for learning The clustering processing unit 1 reads data DAT for learning. The learning data DAT may be provided by an operator of the information processing apparatus 100 via input means or communication means, or may be stored in advance in a storage device (for example, the storage section 19 in FIG. 1).

ステップＳ１２：各クラスタのノード数の最大値の読み込み
本実施の形態では、クラスタリング処理部１は、クラスタに含まれるノード数の最大値を制限可能なクラスタリング手法を用いてクラスタリングを行う。そのため、クラスタリング処理部１は、クラスタに含まれるノード数の最大値を読み込む。クラスタに含まれるノード数の最大値は、情報処理装置１００のオペレータが必要に応じて与えてもよいし、記憶装置（例えば、図１の記憶部１９）に予め格納されていてもよい。 Step S12: Read the Maximum Number of Nodes in Each Cluster In the present embodiment, the clustering processing unit 1 performs clustering using a clustering method capable of limiting the maximum number of nodes included in a cluster. Therefore, the clustering processing unit 1 reads the maximum number of nodes included in the cluster. The maximum number of nodes included in the cluster may be given by the operator of the information processing apparatus 100 as needed, or may be stored in advance in the storage device (for example, the storage unit 19 in FIG. 1).

ステップＳ１３：クラスタリング
クラスタリング処理部１は、クラスタに含まれるノード数の最大値を参照して、クラスタリングを行う。クラスタに含まれるノード数の最大値を制限可能なクラスタリング手法としては、例えばLeiden Algorithm（非特許文献１）を用いることができる。但し、クラスタに含まれるノード数の最大値を制限可能であれば、Leiden Algorithm以外の手法を適宜用いることができるのは言うまでもない。なお、この手法では、ノードとエッジとで構成されるネットワーク構造を適宜クラスタリングすることが可能である。 Step S13: Clustering The clustering processing unit 1 performs clustering by referring to the maximum number of nodes included in the cluster. As a clustering method capable of limiting the maximum number of nodes included in a cluster, for example, the Leiden Algorithm (Non-Patent Document 1) can be used. However, it goes without saying that any method other than the Leiden Algorithm can be appropriately used as long as it is possible to limit the maximum number of nodes included in the cluster. In this method, it is possible to appropriately cluster a network structure composed of nodes and edges.

ステップＳ２：クラスタ特徴量算出
クラスタ特徴量算出部２は、クラスタリング処理後の各ノードについて、特徴量の算出を行う。ステップＳ２は、以下のステップＳ２１及びＳ２２を含む。 Step S2: Cluster Feature Amount Calculation The cluster feature amount calculator 2 calculates a feature amount for each node after the clustering process. Step S2 includes steps S21 and S22 below.

ステップＳ２１：各企業（ノード）の特徴量算出
クラスタ特徴量算出部２は、まず、学習用データＤＡＴに含まれる各企業のデータに基づいて、特徴量を算出する。本実施の形態では、以下で説明する９つの特徴量を用いる。図１１に、各ノードの特徴量の例を表形式で示す。具体的には、特徴量として、不正リスクスコアＲＳ、次数中心性ＤＣ、固有ベクトル中心性ＥＣ、次数中心性と不正リスクスコアとの積ＤＣ＊ＲＳ、固有ベクトル中心性と不正リスクスコアＲＳとの積ＥＣ＊ＲＳ、クラスタ係数ＣＣ、局所固有ベクトル中心性ＥＥＣ、隣接ノードの次数の総和ＣＴ及びグループ次数中心性ＧＤＣを用いるものとする。 Step S21: Feature Amount Calculation of Each Company (Node) The cluster feature amount calculator 2 first calculates a feature amount based on the data of each company included in the learning data DAT. In the present embodiment, nine feature amounts described below are used. FIG. 11 shows an example of the feature amount of each node in tabular form. Specifically, as features, fraud risk score RS, degree centrality DC, eigenvector centrality EC, product DC*RS of degree centrality and fraud risk score, product EC of eigenvector centrality and fraud risk score RS *RS, cluster coefficient CC, local eigenvector centrality EEC, sum of adjacent node degrees CT and group degree centrality GDC shall be used.

なお、特徴量の理解を容易にするために、図１２に、特徴量の算出に用いるネットワークの例を示す。図１２のネットワークは、１４個のノードＮａ１～Ｎａ１４を含み、ノードＮａ１～Ｎａ５がクラスタＣ１、ノードＮａ６～Ｎａ８がクラスタＣ２、ノードＮａ９～Ｎａ１４がクラスタＣ３に属している。 To facilitate understanding of the feature amount, FIG. 12 shows an example of a network used for calculating the feature amount. The network of FIG. 12 includes 14 nodes Na1 to Na14, with nodes Na1 to Na5 belonging to cluster C1, nodes Na6 to Na8 belonging to cluster C2, and nodes Na9 to Na14 belonging to cluster C3.

第１の特徴量：不正リスクスコアＲＳ
各ノードの不正リスクスコア、すなわち個々の企業の不正リスクスコアであり、個々の財務諸表などの企業のデータから、各種の取引不正のリスクを示す不正リスクスコアＲＳを算出する。取引不正の例としては、２社の間の相対で行われる取引不正（例えば、買戻し条件付きの押し込み販売）、数社が結託して行われる取引不正（例えば、架空循環取引）、多数の企業が関与した大規模な取引不正などが有る。不正リスクスコアＲＳは、不正リスクスコアＲＳの算出については、例えば非特許文献２にかかる手法を含む、各種の手法を用いることができる。 First feature amount: fraud risk score RS
The fraud risk score of each node, that is, the fraud risk score of each individual company, is calculated from company data such as individual financial statements to calculate a fraud risk score RS indicating the risk of various fraudulent transactions. Examples of fraudulent transactions include fraudulent transactions between two companies (for example, forced sales with repurchase conditions), fraudulent transactions involving collusion between several companies (for example, fictitious circular transactions), and fraudulent transactions involving multiple companies. There are large-scale fraudulent transactions involving For the fraud risk score RS, various methods including the method described in Non-Patent Document 2, for example, can be used to calculate the fraud risk score RS.

不正リスクスコアＲＳは、各企業に紐付けられたものであり、個々の企業が不正を行うリスクを示す指標である。しかし、取引不正のリスクを高い精度で検知するには、各企業と取引企業との関係も考慮する必要が有ると考え得る。そこで、本実施の形態では、以下で説明する特徴量を導入する。 The fraud risk score RS is associated with each company, and is an index indicating the risk of individual companies committing fraud. However, in order to detect the risk of fraudulent transactions with high accuracy, it may be necessary to consider the relationship between each company and the trading company. Therefore, in the present embodiment, the feature amount described below is introduced.

まず、着目した企業が、取引ネットワークにおいてどれほど中心的役割を担っているかを評価するため、以下の第２及び第３の特徴量を導入する。 First, the following second and third features are introduced in order to evaluate how much the focused company plays a central role in the transaction network.

第２の特徴量：次数中心性ＤＣ
各ノードの次数中心性（Degree Centrality）ＤＣは、各ノードの隣接するノードの数、すなわち、各ノードとエッジで接続されるノードの数に基づいて算出される。言うまでもないが、次数中心性ＤＣは、相対取引についての指標である。注目するノードに接続されたエッジの数をＥＤ、ネットワークに属するノードの総数をＮとすると、注目するノードの次数中心性ＤＣは、以下の式で表される。

式［１］からわかるように、注目するノードに接続するエッジの数が多いほど、次数中心性ＤＣは大きな値となる。つまり、次数中心性ＤＣは、ネットワーク全体に含まれる企業に対して、着目した企業が直接取引をしている企業の割合を示す指標である。図１２のクラスタＣ２に属するノードＮａ６に注目すると、次数中心性は、４／（１４－１）＝０．３０７６．．．となる。 Second feature quantity: degree centrality DC
The degree centrality DC of each node is calculated based on the number of neighboring nodes of each node, ie, the number of nodes connected to each node by edges. Needless to say, degree centrality DC is a measure for bilateral trading. Assuming that the number of edges connected to the node of interest is ED and the total number of nodes belonging to the network is N, the degree centrality DC of the node of interest is expressed by the following equation.

As can be seen from Equation [1], the greater the number of edges connected to the node of interest, the greater the value of the degree centrality DC. In other words, the degree centrality DC is an index that indicates the proportion of companies with which the focused company has direct transactions with respect to companies included in the entire network. Focusing on node Na6 belonging to cluster C2 in FIG. 12, the degree centrality is 4/(14−1)=0.3076. . . becomes.

第３の特徴量：固有ベクトル中心性ＥＣ
各ノードの固有ベクトル中心性（Eigenvector Centrality）ＥＣは、以下で説明する隣接行列Ａについて最大の固有値と、最大の固有値に対応する固有ベクトルを求めることで算出される。言うまでもないが、固有ベクトル中心性ＥＣは、ネットワーク全体に含まれる企業に対して、着目した企業が、取引数の多い企業とどの程度取引があるかを示す指標である。隣接行列Ａは、ノード総数と同じ行数及び列数の正方行列であり、便宜上、行番号をｉ、列番号をｋ、ｉ行ｊ列の成分をＡ_ｉｊとする。言うまでもないが、行番号ｉ及び列番号ｊは、１以上Ｎ以下の整数である。

成分Ａ_ｉｊの値は、隣接する２つのノードがエッジで直接的に接続されている場合に「１」、隣接する２つのノードがエッジで接続されていない場合に「０」となる。 Third feature quantity: eigenvector centrality EC
The eigenvector centrality EC of each node is calculated by obtaining the maximum eigenvalue and the eigenvector corresponding to the maximum eigenvalue for the adjacency matrix A described below. Needless to say, the eigenvector centrality EC is an index that indicates to what extent a focused company has transactions with companies that have a large number of transactions among companies included in the entire network. The _adjacency matrix A is a square matrix having the same number of rows and columns as the total number of nodes. Needless to say, row number i and column number j are integers of 1 or more and N or less.

The value of the component A _ij is "1" when two adjacent nodes are directly connected by an edge, and "0" when two adjacent nodes are not connected by an edge.

隣接行列の具体例を示す。図１３に、隣接行列の例示に用いるネットワークを示す。このネットワークでは、５つのノードが存在し、ノードＮｂ２がノードＮｂ１、Ｎｂ３及びＮｂ４とエッジで接続され、ノードＮｂ４がノードＮｂ５とエッジで接続されている。このときの隣接行列は、以下の式で表されることとなる。

A specific example of an adjacency matrix is shown. FIG. 13 shows the network used to illustrate the adjacency matrix. In this network, there are five nodes, node Nb2 is edge-connected with nodes Nb1, Nb3 and Nb4, and node Nb4 is edge-connected with node Nb5. The adjacency matrix at this time is represented by the following formula.

次いで、以下の式を満たす、固有ベクトルｕと固有値λを算出する。

算出した固有値λから、最大の値を有する固有値λ_ＭＡＸを選択し、最大の固有値λ_ＭＡＸに対応する固有ベクトル（中心性ベクトルとも称する）ｕ_ＭＡＸの各成分ｕ_ｉを、ノードＮＤｉの特徴量である固有ベクトル中心性ＥＣｉとして算出する。例えば、式［３］の隣接行列Ａの最大の固有値λ_ＭＡＸは１．８４７７．．．となり、これに対応する固有ベクトルは、以下の式で表される。

Next, an eigenvector u and an eigenvalue λ that satisfy the following equations are calculated.

From the calculated eigenvalues λ, the eigenvalue λ _MAX having the maximum value is selected, and each component u _i of the eigenvector (also referred to as the centrality vector) u _MAX corresponding to the maximum eigenvalue λ _MAX is used as the feature quantity of the node NDi. It is calculated as the eigenvector centrality ECi. For example, the maximum eigenvalue λ _MAX of the adjacency matrix A in equation [3] is 1.8477. . . , and the corresponding eigenvector is expressed by the following equation.

次いで、着目した企業の不正リスクスコアと取引関係とを考慮した指標として、以下の第４及び第５の特徴量を導入する。 Next, the following fourth and fifth feature amounts are introduced as indexes considering the fraud risk score and business relationship of the company of interest.

第４の特徴量：次数中心性ＤＣと不正リスクスコアＲＳの積（ＤＣ＊ＲＳ）
各ノードについて、次数中心性ＤＣと不正リスクスコアＲＳの積ＤＣ＊ＲＳを算出し、これを各ノードの特徴量として用いる。次数中心性ＤＣと不正リスクスコアＲＳの積ＤＣ＊ＲＳは、着目した企業と直接取引している企業が多く、かつ、着目した企業自身の不正リスクが高いと、取引不正のリスクも高いことを示す指標。 Fourth feature amount: product of degree centrality DC and fraud risk score RS (DC*RS)
For each node, the product DC*RS of the degree centrality DC and the fraud risk score RS is calculated and used as the feature quantity of each node. The product DC*RS of the degree centrality DC and the fraud risk score RS indicates that there are many companies that have direct transactions with the focused company, and if the focused company itself has a high fraud risk, the risk of transaction fraud is also high. indicator.

第５の特徴量：固有ベクトル中心性ＥＣと不正リスクスコアＲＳの積（ＥＣ＊ＲＳ）
各ノードについて、固有ベクトル中心性ＥＣと不正リスクスコアＲＳの積ＥＣ＊ＲＳを算出し、これを各ノードの特徴量として用いる。固有ベクトル中心性ＥＣと不正リスクスコアＲＳの積ＥＣ＊ＲＳは、着目した企業が、取引数の多い企業と取引しており、かつ、着目した企業自身の不正リスクが高いと、取引不正のリスクも高いことを示す指標である。 Fifth feature quantity: product of eigenvector centrality EC and fraud risk score RS (EC*RS)
For each node, the product EC*RS of the eigenvector centrality EC and the fraud risk score RS is calculated and used as the feature quantity of each node. The product EC*RS of the eigenvector centrality EC and the fraud risk score RS is the risk of transaction fraud if the company in question has transactions with companies with a large number of transactions and the company itself has a high fraud risk. It is an index that indicates that it is high.

次いで、着目した企業と、着目した企業と取引関係がある企業とが、どれほど網羅的な取引ネットワークを構成しているかを評価するため、以下の第６の特徴量であるクラスタ係数を導入する。 Next, in order to evaluate how comprehensive a transaction network the focused company and the companies that have business relationships with the focused company constitute, a cluster coefficient, which is the following sixth feature quantity, is introduced.

第６の特徴量：クラスタ係数（Clustering Coefficient）ＣＣ
各ノードについて、クラスタ係数ＣＣを算出する。注目するノードを含む三角形の数をＴ（すなわち、各ノードとエッジで接続された２つの隣接ノード同士を接続するエッジの数）、注目するノードの次数（接続されたエッジの数）をＥＤとすると、注目するノードのクラスタ係数は、以下の式で表される。

式［１２］において、分母は、各ノードとエッジで接続される２つの隣接ノード同士をエッジで接続する場合のエッジの最大数（各ノードとエッジで接続される隣接ノードで構成される組み合わせの数）である。したがって、クラスタ係数ＣＣは、２つの隣接ノード同士をエッジで接続する場合のエッジの最大数に対して、実際に２つの隣接ノード同士を接続しているエッジの数の比率と示している。換言すれば、クラスタ係数ＣＣは、各ノードに対応する企業と、これと取引関係を有する企業とが、どれほど密なネットワークを形成しているかを見積る指標として利用することができる。図１２のクラスタＣ２に属するノードＮａ６に注目すると、クラスタ係数は、１／（４＊３／２）＝０．１６６６．．．となる。 Sixth Feature Amount: Clustering Coefficient CC
A cluster coefficient CC is calculated for each node. Let T be the number of triangles containing the node of interest (that is, the number of edges connecting each node and two adjacent nodes connected by an edge), and ED be the degree of the node of interest (the number of connected edges). Then, the cluster coefficient of the node of interest is represented by the following equation.

In Equation [12], the denominator is the maximum number of edges when each node and two adjacent nodes connected by an edge are connected by an edge (the number of combinations composed of each node and adjacent nodes connected by an edge). number). Therefore, the cluster coefficient CC indicates the ratio of the number of edges actually connecting two adjacent nodes to the maximum number of edges connecting two adjacent nodes. In other words, the cluster coefficient CC can be used as an index for estimating how dense a network is formed between the company corresponding to each node and the companies having business relationships with it. Focusing on node Na6 belonging to cluster C2 in FIG. 12, the cluster coefficient is 1/(4*3/2)=0.1666. . . becomes.

次いで、数社が結託して行われる取引不正を検知するための指標を導入する。数社が結託して行われる取引不正としては、各企業の２社先の取引関係にかかる不正、例えば架空循環取引などが知られている。このような、２社先の取引関係にかかる不正を検知するための指標として、第７の特徴量である局所固有ベクトル中心性を導入する。 Next, we introduce an indicator for detecting fraudulent transactions conducted by collusion of several companies. Fraudulent transactions involving the collusion of several companies, such as fictitious cyclical transactions, are known as frauds related to business relationships between two companies ahead of each other. Local eigenvector centrality, which is a seventh feature quantity, is introduced as an index for detecting such fraudulent transaction relationships between two companies.

第７の特徴量：局所固有ベクトル中心性ＥＥＣ
上述の固有ベクトル中心性ＥＣは１つ隣のノードまでのネットワークを対象として算出されるものであった。これに対し、ここでは、着目したノードと、１つ隣及び２つ隣のノードとで構成されるネットワークである局所ネットワーク（Egocentric Network: Egonet）を対象として、局所固有ベクトル中心性ＥＥＣ（Egonet Eigenvector Centrality）を算出する。 Seventh Feature Amount: Local Eigenvector Centrality EEC
The eigenvector centrality EC described above is calculated for the network up to the next node. On the other hand, here, for a local network (Egocentric Network: Egonet), which is a network composed of the node of interest and the one- and two-neighboring nodes, the local eigenvector centrality EEC (Egonet Eigenvector Centrality ) is calculated.

図１４に、局所ネットワークの例として、図１２の例においてノードＮａ６に着目した場合の局所ネットワークを示す。この例では、ノードＮａ６と、１つ隣のノードであるノードＮａ５、Ｎａ７、Ｎａ８及びＮａ９と、２つ隣のノードであるノードＮａ４、Ｎａ１０及びＮａ１３が局所ネットワークを構成する。この局所ネットワークに対して、ノードが隣接している場合に「１」、隣接していないに「０」となる成分からなる隣接行列Ｂを以下のように求める。なお、簡略化のため、行及び列に表示したノード番号は、「Ｎａ」を除く数字のみを表示している。

FIG. 14 shows, as an example of a local network, a local network focused on node Na6 in the example of FIG. In this example, a node Na6, nodes Na5, Na7, Na8 and Na9 which are adjacent nodes by one, and nodes Na4, Na10 and Na13 which are adjacent nodes by two constitute a local network. For this local network, an adjacency matrix B composed of elements that are "1" if the node is adjacent and "0" if the node is not adjacent is obtained as follows. For the sake of simplification, the node numbers displayed in rows and columns are only numerals excluding "Na".

次いで、以下の式を満たす固有値λを算出する。

算出した固有値λから最大の値を有する固有値λ_ＭＡＸを選択し、最大の固有値λ_ＭＡＸを局所固有ベクトル中心性ＥＥＣとして算出する。この例では、式［７］の隣接行列Ｂの最大の固有値λ_ＭＡＸである２．４６１４．．．が局所固有ベクトル中心性ＥＥＣとして算出される。 Next, an eigenvalue λ that satisfies the following equation is calculated.

The eigenvalue λ _MAX having the maximum value is selected from the calculated eigenvalues λ, and the maximum eigenvalue λ _MAX is calculated as the local eigenvector centrality EEC. In this example, the maximum eigenvalue λ _MAX of the adjacency matrix B in equation [7], 2.4614. . . is computed as the local eigenvector centrality EEC.

第８の特徴量：隣接ノードの次数の総和ＣＴ
隣接ノードの次数の総和ＣＴ、すなわち隣接ノードの取引関係数の総和（Co Transaction）を算出する。隣接ノードの次数の総和ＣＴは、着目した企業と直接取引している企業が、どれくらい他社と取引をしているかを示す指標である。図１２のクラスタＣ２に属するノードＮａ６に注目すると、隣接するノードＮａ５の次数が２、ノードＮａ７の次数が２、ノードＮａ８の次数が２、ノードＮａ９の次数が３なので、隣接ノードの次数の総和ＣＴは、２＋２＋２＋３＝９となる。 Eighth feature quantity: total CT of degrees of adjacent nodes
Calculate the sum CT of degrees of adjacent nodes, that is, the sum of transaction relations (Co Transaction) of adjacent nodes. The sum CT of degrees of adjacent nodes is an index indicating how many companies that directly trade with the focused company do business with other companies. Focusing on the node Na6 belonging to the cluster C2 in FIG. 12, since the degree of the neighboring node Na5 is 2, the degree of the node Na7 is 2, the degree of the node Na8 is 2, and the degree of the node Na9 is 3, the sum of the degrees of the neighboring nodes is CT becomes 2+2+2+3=9.

ステップＳ２２：クラスタの特徴量算出
各クラスタについて、属するノードの各特徴量の平均値を算出し、算出した平均値を注目するクラスタの各特徴量とする。図１５に、各クラスタの特徴量の例を表形式で示す。なお、ここでは、特徴量として用いる統計量として平均値を採用したが、必要に応じて、最大値、最小値及び中央値などの他の統計量を用いてもよい。 Step S22: Calculation of Feature Amount of Cluster For each cluster, the average value of each feature amount of the node to which it belongs is calculated, and the calculated average value is used as each feature amount of the cluster of interest. FIG. 15 shows an example of the feature amount of each cluster in tabular form. Here, the mean value is used as the statistic used as the feature quantity, but other statistic such as maximum value, minimum value and median value may be used as necessary.

さらに、クラスタの特徴量として、第９の特徴量であるグループ次数中心性ＧＤＣを算出する。 Further, as the cluster feature amount, the group degree centrality GDC, which is the ninth feature amount, is calculated.

第９の特徴量：グループ次数中心性ＧＤＣ
クラスタリング処理によって生じたクラスタのそれぞれについて、グループ次数中心性ＧＤＣ（Group Degree Centrality）を算出し、注目するクラスタに含まれるノードの特徴量として用いる。注目するクラスタに含まれるノードと、注目するクラスタ以外のクラスタに含まれるノードとを接続するエッジの総数をＥＤ_ＥＸＴ、注目するクラスタ以外のクラスタに含まれるノードの総数をＮ_ＥＸＴとする。グループ次数中心性ＧＤＣは、以下の式を用いて算出する。

図１２のクラスタＣ１に注目すると、クラスタＣ１のグループ次数中心性ＧＤＣは１／９＝０．１１１１．．．となる。 Ninth feature quantity: group degree centrality GDC
A group degree centrality GDC (Group Degree Centrality) is calculated for each cluster generated by the clustering process, and is used as a feature quantity of a node included in the cluster of interest. Let ED _EXT be the total number of edges connecting nodes included in the cluster of interest and nodes included in clusters other than the cluster of interest, and let _NEXT be the total number of nodes included in clusters other than the cluster of interest. The group degree centrality GDC is calculated using the following formula.

Focusing on cluster C1 in FIG. 12, the group degree centrality GDC of cluster C1 is 1/9=0.1111. . . becomes.

ステップＳ３
不正フラグ付与部３は、過去に不正があったノードを含むクラスタに対してフラグを付与する。具体的には、図１５に示す様に、不正フラグ付与部３は、過去に不正があったノードを含むクラスタの不正フラグを「１」、不正が有ったノードを含まないクラスタの不正フラグを「０」に設定する。不正フラグは、学習用データＤＡＴに含まれる不正履歴情報を示す列を参照し、不正が有った場合に不正フラグとして「１」を付与し、不正が無かった場合に不正フラグとして「０」を付与するというように、数値データによって表されてもよい。不正履歴情報に対応する列が複数有る場合には、参照する列の全てが不正が有ったことを示している場合に不正フラグ「１」を付与してもよいし、参照する列のいずれかに不正が有ったことを示している場合に不正フラグ「１」を付与してもよい。なお、不正フラグのデータ形式は数値データに限られず、例えば、ブーリアン型の変数など、各種の形式で表現されてもよい。 step S3
The fraud flag assigning unit 3 assigns flags to clusters including nodes that have been fraudulent in the past. Specifically, as shown in FIG. 15, the fraud flag assigning unit 3 sets the fraud flag of clusters containing nodes that have been fraudulent in the past to "1", and the fraud flags of clusters that do not contain nodes that have been fraudulent. to '0'. For the fraud flag, a column indicating fraud history information included in the learning data DAT is referred to. If there is fraud, "1" is given as the fraud flag, and if there is no fraud, "0" is given as the fraud flag. It may be represented by numerical data, such as giving If there are multiple columns corresponding to the fraud history information, the fraud flag "1" may be given if all of the referenced columns indicate that fraud has occurred. A fraud flag "1" may be given if it indicates that there is fraud. Note that the data format of the illegality flag is not limited to numerical data, and may be expressed in various formats such as a Boolean variable, for example.

ステップＳ４：モデル構築
モデル構築部４は、各クラスタの特徴量を説明変数、不正フラグを目的変数とする学習済みモデルの構築を行う。ここでは、学習済みモデルを示す処理ｆを構築するため、例えばロジスティック回帰を用いて学習済みモデルを構築する。なお、学習済みモデル構築には、ロジスティック回帰のみならず、ランダムフォレスト、サポートベクトル回帰など、各種の教師有り学習手法を適宜用いることができる。ステップＳ４は、以下のステップＳ４０～Ｓ４５を含む。 Step S4: Model Construction The model construction unit 4 constructs a learned model using the feature amount of each cluster as an explanatory variable and the fraud flag as an objective variable. Here, in order to construct the process f indicating the trained model, the trained model is constructed using logistic regression, for example. Note that not only logistic regression but also various supervised learning methods such as random forest and support vector regression can be appropriately used for building a trained model. Step S4 includes steps S40 to S45 below.

ステップＳ４０：処理回数初期値設定
処理回数ＮＵＭの初期値として、「１」を設定する。 Step S40: Initial value setting for the number of times of processing "1" is set as the initial value of the number of times of processing NUM.

ステップＳ４１：ノード抽出
不正フラグが「１」のクラスタから一定の割合のクラスタを抽出し、かつ、不正フラグが「０」のクラスタから一定の割合のクラスタを抽出して、抽出したクラスタを学習用データとして用いる。残りのクラスタはテスト用データとして用いる。本実施の形態では、例として、一定の割合を７割とする。但し、一定の割合の値は７割に限られるものではなく、任意の割合としてもよい。 Step S41: Node Extraction A certain percentage of clusters are extracted from clusters with fraud flags of “1” and a certain percentage of clusters are extracted from clusters with fraud flags of “0”, and the extracted clusters are used for learning. Used as data. The remaining clusters are used as test data. In this embodiment, as an example, the fixed ratio is 70%. However, the fixed ratio value is not limited to 70%, and may be any ratio.

ステップＳ４２：学習用データの特徴量標準化
学習用データについて、各クラスタの特徴量の標準化を行う。ここでは、各クラスタに含まれる特徴量の個数をＭ（上述の例ではＭ＝９）、クラスタに含まれる各特徴量をｘ_ｋ（ｋは、１以上Ｍ以下の整数）、抽出されたクラスタの特徴量ｘ_ｋの平均値をｘ_{ｋ＿ＡＶＥ}、抽出された特徴量ｘ_ｋの標準偏差をσ_ｘｋとする。このとき、標準化された特徴量ｘ_ｋｓは、以下の式で表される。

Step S42: Standardization of Feature Quantity of Learning Data For the learning data, the feature quantity of each cluster is standardized. Here, the number of feature amounts included in each cluster is M (M=9 in the above example), each feature amount included in the cluster is x _k (k is an integer from 1 to M), and the extracted cluster Let x _{k_AVE} be the average value of the feature quantity x _k of , and let σ _xk be the standard deviation of the extracted feature quantity x _k . At this time, the standardized feature quantity x _ks is represented by the following formula.

ステップＳ４３：学習用データの特徴量除外
標準化された特徴量ｘ_１ｓ～ｘ_９ｓから選んだ２つの特徴量の全ての組み合わせについて、相関ＣＲＲを算出する。そして、相関ＣＲＲが所定値ＴＨよりも大きい場合には、選んだ２つの特徴量の一方を学習用データから除外する。本実施の形態では、例えば所定値ＴＨを０．９とする。この場合に、標準化された特徴量ｘ_５ｓと標準化された特徴量ｘ_９ｓとの相関が０．９５である場合には、標準化された特徴量ｘ_９ｓを学習用データから除外する。 Step S43: Exclusion of Feature Amounts from Learning Data A correlation CRR is calculated for all combinations of two feature amounts selected from the standardized feature amounts _x1s _to x9s. Then, when the correlation CRR is greater than the predetermined value TH, one of the two selected feature amounts is excluded from the learning data. In this embodiment, for example, the predetermined value TH is set to 0.9. In this case, if the correlation between the standardized feature quantity x _5s and the standardized feature quantity x _9s is 0.95, the standardized feature quantity x _9s is excluded from the learning data.

ステップＳ４４：重み算出
次のステップＳ４５にて重み付けロジスティック回帰によって学習処理を行うために、データに付与する重みを算出する。ここで、抽出したクラスタ総数をｎ、不正フラグが「１」のクラスタの数をｎ_１、不正フラグが「０」のクラスタの数をｎ_０としたとき、不正フラグが「１」のクラスタの標準化された特徴量に対する重みとして、ｎ／２ｎ_１、不正フラグが「０」のクラスタの標準化された特徴量に対する重みとしてｎ／２ｎ_０を算出する。 Step S44: Weight Calculation In the next step S45, weights to be given to data are calculated in order to perform learning processing by weighted logistic regression. Here, when n is the total number of extracted clusters, n ₁ is the number of clusters with the fraud flag of “1”, and n ₀ is the number of clusters with the fraud flag of “0”, the number of clusters with the fraud flag of “1” is n/2n ₁ is calculated as the weight for the standardized feature amount, and n/2n ₀ is calculated as the weight for the standardized feature amount of the cluster whose fraud flag is "0".

ステップＳ４５：学習処理
上述したように、学習用データを用いて、例えばロジスティック回帰によって学習済みモデルを構築する。ここでは、ｎ個のクラスタに含まれる各クラスタの不正フラグをｙ_ｉ（ｉは、１からｎまでの整数）、各クラスタにおいて不正フラグが「１」となる確率をｐ_ｉ（０＜ｐｉ＜１）とする。各クラスタにおいて不正フラグが付与される事象が独立であるとすると、ステップＳ３においてｎ個のクラスタにそれぞれ付与したｎ個の不正フラグからなる順列が得られる確率である尤度Ｌは、以下の式で表される。

ここでは、式［１１］から、式［１２］に示す対数尤度ｌｏｇＬを用いる。

ここで、式［１２］の左辺にステップＳ４４で算出した式［１３］に示す重みｗ_ｉを乗じて、式［１２］を式［１４］に変換する。

Step S45: Learning Process As described above, a trained model is constructed using the learning data, for example, by logistic regression. Here, the fraud flag of each cluster included in n clusters is y _i (i is an integer from 1 to n), and the probability that the fraud flag is “1” in each cluster is p _i (0<pi< 1). Assuming that events to which fraud flags are assigned in each cluster are independent, the likelihood L, which is the probability of obtaining a permutation consisting of n fraud flags respectively assigned to n clusters in step S3, is expressed by the following equation: is represented by

Here, the logarithmic likelihood logL shown in Equation [12] is used from Equation [11].

Here, the left side of the equation [12] is multiplied by the weight _wi shown in the equation [13] calculated in step S44 to transform the equation [12] into the equation [14].

上述の通り、本実施の形態ではロジスティック回帰を行うので、不正フラグが「１」となる確率ｐ_ｉは、以下の式で表される。

式［１５］において、特徴量ｘ_ｉ＿ｋは、ｉ番目のクラスタのｋ番目の特徴量を示している。 As described above, since logistic regression is performed in this embodiment, the probability p _i that the fraud flag is "1" is expressed by the following equation.

In Equation [15], the feature quantity x _{i_k} indicates the k-th feature quantity of the i-th cluster.

以上の条件の下で回帰分析を行い、対数尤度ｌｏｇＬを最大にするβ_０～β_ｎを求めることで、学習済みモデルを構築することができる。 A trained model can be constructed by performing regression analysis under the above conditions and finding β ₀ to β _n that maximize the logarithmic likelihood logL.

ステップＳ５：テスト
テスト処理部５は、以下の手順で、学習済みモデルのテストを行う。ここでは、以下の手順で、テストを１００回繰り返すものとする。 Step S5: Test The test processing unit 5 tests the trained model according to the following procedure. Here, it is assumed that the test is repeated 100 times in the following procedure.

ステップＳ５１：テスト用データの特徴量除外
テスト用データから、ステップＳ４３で除外された特徴量を同様に除外する。 Step S51: Exclusion of feature amount from test data The feature amount excluded in step S43 is similarly excluded from the test data.

ステップＳ５２：テスト用データの特徴量標準化
テスト用データについて、学習用データの場合と同様に、各クラスタの特徴量の標準化を行う。ここでは、除外処理後の各クラスタに含まれる特徴量の個数をｍ（ｍは、Ｍ以下の整数）、各クラスタに含まれる各特徴量をｙ_ｋ（ｋは、１以上ｍ以下の整数）とする。このとき、標準化された特徴量ｙ_ｋｓは、以下の式で表される。

Step S52: Standardization of Feature Quantity of Test Data For the test data, the feature quantity of each cluster is standardized as in the case of the learning data. Here, the number of feature amounts included in each cluster after exclusion processing is m (m is an integer of M or less), and each feature amount included in each cluster is y _k (k is an integer of 1 or more and m or less) and At this time, the standardized feature amount y _ks is represented by the following formula.

ステップＳ５３：結果出力
学習済みモデルに、テスト用データの標準化された特徴量を有するクラスタを投入し、結果を取得する。 Step S53: Result output A cluster having the standardized feature amount of the test data is input to the trained model, and the result is obtained.

ステップＳ５４：ＡＵＣ算出
出力結果と、投入したノードに付された実際の不正フラグと、を比較してＲＯＣ（Receiver Operating Characteristic）曲線を取得し、ＲＯＣ曲線を用いてＡＵＣ（Area Under Curve）を算出する。 Step S54: AUC calculation Output results are compared with the actual fraud flag attached to the input node to obtain an ROC (Receiver Operating Characteristic) curve, and AUC (Area Under Curve) is calculated using the ROC curve. do.

ＲＯＣ曲線とＡＵＣについて簡潔に説明する。図１６に、ＲＯＣ曲線とＡＵＣの一例を示す。ＲＯＣ曲線は、真陽性の割合と偽陽性の割合として定義される点が描く軌跡に対応する曲線である。ＲＯＣ曲線の縦軸は真陽性の割合（True Positive Rate）であり、検出結果の横軸上に設定した閾値以上の範囲におけるpositiveを示す実線Ｐと横軸とに囲まれる部分の面積に対応する。ＲＯＣ曲線の横軸は偽陽性の割合（False Positive Rate）であり、予測結果の横軸上に設定した閾値以上の範囲におけるnegativeを示す破線Ｎと横軸とに囲まれる部分の面積に対応する。 Briefly describe the ROC curve and AUC. FIG. 16 shows an example of the ROC curve and AUC. A ROC curve is a curve corresponding to the locus of points defined as the true positive rate and the false positive rate. The vertical axis of the ROC curve is the true positive rate (True Positive Rate), and corresponds to the area surrounded by the solid line P and the horizontal axis indicating positive in the range above the threshold set on the horizontal axis of the detection result. . The horizontal axis of the ROC curve is the false positive rate (False Positive Rate), and corresponds to the area of the portion surrounded by the dashed line N indicating negative in the range above the threshold set on the horizontal axis of the prediction result and the horizontal axis. .

例として、ＲＯＣ曲線の横軸上に閾値ＴＨを設定し、閾値ＴＨに対応するＲＯＣ曲線上の点Ｐを示した。点Ｐにおける真陽性の割合（True Positive Rate）ＴＰＲ１は、検出結果の横軸上に設定した閾値ＴＨ以上の範囲におけるpositiveを示す実線Ｐと横軸とに囲まれる部分（細線ハッチングが施された部分）の面積に対応する。点Ｐにおける偽陽性の割合（False Positive Rate）ＦＰＲ１は、検出結果の横軸上に設定した閾値ＴＨ以上の範囲におけるnegativeを示す破線Ｎと横軸とに囲まれる部分（太線ハッチングが施された部分）の面積に対応する。 As an example, a threshold TH is set on the horizontal axis of the ROC curve, and a point P on the ROC curve corresponding to the threshold TH is shown. The true positive rate (True Positive Rate) TPR1 at the point P is the portion surrounded by the solid line P and the horizontal axis indicating positive in the range above the threshold TH set on the horizontal axis of the detection result (thin hatched part). The false positive rate (False Positive Rate) FPR1 at the point P is the portion surrounded by the dashed line N indicating negative in the range above the threshold TH set on the horizontal axis of the detection result and the horizontal axis (thick hatched part).

ＡＵＣは、ＲＯＣ曲線よりも下の部分（ハッチングが施された部分）の面積である。一般に、ＡＵＣの値は、事象の発生がランダムである場合には０．５となり、イベントの発生及び未発生の予測精度が高くなるほど１に近づくこととなる。 AUC is the area of the portion (hatched portion) below the ROC curve. In general, the value of AUC is 0.5 when the occurrence of events is random, and approaches 1 as the accuracy of predicting the occurrence and non-occurrence of events increases.

ステップＳ５５：テスト回数更新
テスト回数ＮＵＭを１だけ増加させる。 Step S55: Update number of tests Increase the number of tests NUM by one.

ステップＳ５６：終了判定
テスト回数ＮＵＭが１００未満である場合には、処理ステップＳ４１へ返す。テスト回数ＮＵＭが１００である場合には、処理を終了する。 Step S56: Termination Judgment If the test count NUM is less than 100, the process returns to step S41. If the test count NUM is 100, the process ends.

以上のステップＳ４１～Ｓ４５及びＳ５１～Ｓ５６を繰り返すことで、テストごとに学習用データに含まれるクラスタとテスト用データにクラスタとを抽出しなおして、テストを反復することができる。これにより、クラスタ抽出に起因するバラつきを平均化して、モデルをより高精度に評価することができる。 By repeating the above steps S41 to S45 and S51 to S56, the clusters included in the learning data and the clusters in the test data can be re-extracted for each test, and the test can be repeated. As a result, variations due to cluster extraction are averaged, and the model can be evaluated with higher accuracy.

なお、テストの回数は例示に過ぎず、テスト回数を１００回未満又は１００回よりも多くしてもよい。 Note that the number of tests is only an example, and the number of tests may be less than 100 or more than 100.

次いで、性能評価の実施例について説明する。本実施の形態では、ステップＳ１におけるクラスタリングでのクラスタに属するノードの最大値を１０００、５００及び１００の３段階に変化させて、性能評価結果の比較を行った。なお、以下では、クラスタリングを行った結果、ノード数が１つのクラスタが発生した場合には、そのクラスタについては除外している。 Next, an example of performance evaluation will be described. In this embodiment, the maximum value of the nodes belonging to the cluster in the clustering in step S1 was changed to three levels of 1000, 500 and 100, and the performance evaluation results were compared. In the following description, when a cluster with one node is generated as a result of clustering, that cluster is excluded.

評価ケースＡ
評価ケースＡにおいては、ステップＳ１におけるクラスタリングでのクラスタに属するノードの最大値を１０００とした。この条件において、不正フラグ「１」のクラスタ数は２５２２、不正フラグ「０」のクラスタは４７となった。このデータに対して１００回のテストを行い、ＡＵＣを算出した。なお、テスト用データを用いてのテストの結果の比較例として、構築したモデルに学習用データを投入した場合ＡＵＣの算出も行った。 Evaluation case A
In evaluation case A, the maximum number of nodes belonging to a cluster in the clustering in step S1 was set to 1,000. Under this condition, the number of clusters with the fraud flag "1" is 2522, and the number of clusters with the fraud flag "0" is 47. This data was tested 100 times and the AUC was calculated. As a comparative example of test results using test data, AUC was also calculated when learning data was input to the constructed model.

図１７に、評価ケースＡにおける各特徴量の採択率を示す。この例では、９つの特徴量のうち、グループ次数中心性ＧＤＣは１度も採用されず、稀に（数％程度の確率で）固有ベクトル中心性ＥＣが不採用となった。その他の７つの特徴量は、１００％採用された。 FIG. 17 shows the acceptance rate of each feature amount in the evaluation case A. In FIG. In this example, among the nine feature quantities, the group degree centrality GDC was never adopted, and the eigenvector centrality EC was rarely adopted (with a probability of several percent). The other seven features were adopted 100%.

図１８に、評価ケースＡにおいて学習用データを用いたときのＡＵＣとテスト用データを用いたときのＡＵＣを示す。この例では、学習用データを用いた場合のＡＵＣの平均値は０．７９８、テスト用データを用いたテストでのＡＵＣの平均値は０．７７２となった。テスト用データのＡＵＣは概ね０．７～０．８の間で推移しており、高い精度でクラスタ内の取引不正を検出できることが確認できた。 FIG. 18 shows AUC when learning data is used and AUC when test data is used in evaluation case A. In FIG. In this example, the average AUC value for the learning data was 0.798, and the average AUC value for the test using the test data was 0.772. The AUC of the test data generally fluctuates between 0.7 and 0.8, confirming that fraudulent transactions within a cluster can be detected with high accuracy.

評価ケースＢ
評価ケースＢにおいては、ステップＳ１におけるクラスタリングでのクラスタに属するノードの最大値を５００とした。この条件において、不正フラグ「１」のクラスタ数は４６９５、不正フラグ「０」のクラスタは４９となった。評価ケースＡと同様に、このデータに対して、学習用データとテスト用データとを用いて１００回のテストを行い、ＡＵＣを算出した。 Evaluation case B
In evaluation case B, the maximum number of nodes belonging to a cluster was set to 500 in clustering in step S1. Under this condition, the number of clusters with the fraud flag "1" is 4695, and the number of clusters with the fraud flag "0" is 49. Similar to evaluation case A, this data was tested 100 times using learning data and test data, and AUC was calculated.

図１９に、評価ケースＢにおける各特徴量の採択率を示す。この例では、９つの特徴量のうち、グループ次数中心性ＧＤＣは１度も採用されず、その他の８つの特徴量は１００％採用された。 FIG. 19 shows the acceptance rate of each feature amount in the evaluation case B. In FIG. In this example, among the nine features, the group degree centrality GDC was never adopted, and the other eight features were adopted 100%.

図２０に、評価ケースＢにおいて学習用データを用いたときのＡＵＣとテスト用データを用いたときのＡＵＣを示す。この例では、学習用データを用いた場合のＡＵＣの平均値は０．７７８、テスト用データを用いたテストでのＡＵＣの平均値は０．７３０となった。テスト用データのＡＵＣは概ね０．７～０．８の間で推移しており、評価ケースＢにおいても高い精度でクラスタ内の取引不正を検出できることが確認できた。 FIG. 20 shows AUC when learning data is used and AUC when test data is used in evaluation case B. In FIG. In this example, the average AUC value when using the learning data was 0.778, and the average AUC value in the test using the test data was 0.730. The AUC of the test data generally fluctuated between 0.7 and 0.8, and it was confirmed that even in the evaluation case B, fraudulent transactions within the cluster could be detected with high accuracy.

評価ケースＣ
評価ケースＣにおいては、ステップＳ１におけるクラスタリングでのクラスタに属するノードの最大値を１００とした。この条件において、不正フラグ「１」のクラスタ数は２０２９６、不正フラグ「０」のクラスタは５５となった。評価ケースＡ及びＢと同様に、このデータに対して、学習用データとテスト用データとを用いて１００回のテストを行い、ＡＵＣを算出した。 Evaluation case C
In evaluation case C, the maximum number of nodes belonging to a cluster in the clustering in step S1 was set to 100. Under this condition, the number of clusters with the fraud flag "1" is 20296, and the number of clusters with the fraud flag "0" is 55. Similar to evaluation cases A and B, this data was tested 100 times using learning data and test data, and AUC was calculated.

図２１に、評価ケースＣにおける各特徴量の採択率を示す。この例では、９つの特徴量のうち、グループ次数中心性ＧＤＣは１度も採用されず、その他の８つの特徴量は１００％採用された。 FIG. 21 shows the acceptance rate of each feature amount in evaluation case C. In FIG. In this example, among the nine features, the group degree centrality GDC was never adopted, and the other eight features were adopted 100%.

図２２に、評価ケースＣにおいて学習用データを用いたときのＡＵＣとテスト用データを用いたときのＡＵＣを示す。この例では、学習用データを用いた場合のＡＵＣの平均値は０．７１４、テスト用データを用いたテストでのＡＵＣの平均値は０．７６１となった。テスト用データのＡＵＣは概ね０．６～０．７の間で推移しており、評価ケースＡ及びＢと比べて取引不正の精度が低下していることが確認できた。 FIG. 22 shows AUC when learning data is used and AUC when test data is used in evaluation case C. In FIG. In this example, the average AUC value in the case of using the learning data was 0.714, and the average AUC value in the test using the test data was 0.761. The AUC of the test data generally fluctuates between 0.6 and 0.7, confirming that the accuracy of fraudulent transactions has declined compared to evaluation cases A and B.

評価ケースＡ～Ｃを比較すると、クラスタに属するノードの最大値を小さくし過ぎると、取引不正の検出精度が低下することが理解できる。上述のテスト結果からは、クラスタに属するノードの最大値を少なくとも数百個程度、望ましくは５００個以上とすることで、良好な取引不正の検出精度を実現できるものと推定できる。 Comparing the evaluation cases A to C, it can be understood that if the maximum value of the nodes belonging to the cluster is made too small, the fraudulent transaction detection accuracy is lowered. From the above test results, it can be inferred that by setting the maximum number of nodes belonging to a cluster to at least about several hundred, preferably 500 or more, it is possible to achieve good fraudulent transaction detection accuracy.

そこで、取引不正の検出精度を実現するには、クラスタに属するノードの最大値はいかなる値が好適であるかを検討するため、クラスタに属するノードの最大値を変化させたときのＡＵＣの変動を観察した。図２３に、クラスタに属するノードの最大値とＡＵＣとの関係を示す。ここでは、クラスタに属するノードの最大値を、１００～２５０００個の範囲で変化させた。具体的には、クラスタに属するノードの最大値を、１００、５００、１０００と変化させ、かつ、１０００～２５０００個の範囲では１０００個刻みで変化させた。 Therefore, in order to realize the detection accuracy of fraudulent transactions, in order to study what value is suitable for the maximum value of the nodes belonging to the cluster, the fluctuation of AUC when the maximum value of the nodes belonging to the cluster is changed is calculated as follows. Observed. FIG. 23 shows the relationship between the maximum value of nodes belonging to a cluster and AUC. Here, the maximum number of nodes belonging to a cluster was varied within a range of 100 to 25000. Specifically, the maximum number of nodes belonging to a cluster was changed to 100, 500, and 1000, and in the range of 1000 to 25000, it was changed in increments of 1000.

その結果、クラスタに属するノードの最大値が５０００～２００００個の範囲において、ＡＵＣの値が、０．８７～０．９０程度の高い値で安定する傾向が見られた。 As a result, in the range of 5000 to 20000 maximum nodes belonging to a cluster, the AUC value tended to stabilize at a high value of about 0.87 to 0.90.

以上、本構成によれば、個々の企業の不正リスクを推定するだけでは困難な、クラスタ内の複数のノード（企業）が関与した取引不正のリスクを推定することができる。これにより、循環取引などの複数の企業が関与するような取引不正のリスクを好適に推定することが可能となる。 As described above, according to this configuration, it is possible to estimate the risk of fraudulent transactions involving a plurality of nodes (companies) in a cluster, which is difficult only by estimating the fraud risk of individual companies. This makes it possible to suitably estimate the risk of fraudulent transactions involving multiple companies, such as circular transactions.

実施の形態２
実施の形態２にかかる情報処理装置について説明する。実施の形態１では、個々の企業の財務諸表に含まれる各勘定科目や取引情報などを用いた、取引不正を検知するための学習済みモデルの構築について説明した。本実施の形態では、構築した学習済みモデルを用いて、分析対象となる企業の情報を示す入力データを学習済みモデルに入力することで、分析対象の企業が不正を行うリスクを推定する構成について説明する。 Embodiment 2
An information processing apparatus according to the second embodiment will be described. In the first embodiment, construction of a trained model for detecting fraudulent transactions using account items and transaction information included in financial statements of individual companies has been described. In the present embodiment, the built trained model is used to input input data indicating the information of the company to be analyzed into the trained model, thereby estimating the risk of fraud by the company to be analyzed. explain.

図２４に、実施の形態２にかかる情報処理装置２００の構成を模式的に示す。情報処理装置２００は、実施の形態１にかかる情報処理装置１００のテスト処理部５を、推定処理部６に置換した構成を有する。 FIG. 24 schematically shows the configuration of an information processing apparatus 200 according to the second embodiment. The information processing device 200 has a configuration in which the test processing unit 5 of the information processing device 100 according to the first embodiment is replaced with an estimation processing unit 6 .

実施の形態１で説明したように、モデル構築部によって学習済みモデルＭＤが構築される。その後、学習済みモデルＭＤは、推定処理部６に渡される。学習済みモデルＭＤは、例えば、推定処理部６に設けられた記憶部に格納されていてもよいし、推定処理部６とは別に設けられた記憶部に格納され、推定処理部６が必要に応じて学習済みモデルＭＤを読み出してもよい。これらの記憶部としては、図１に示した記憶部１９など、適宜利用可能な記憶手段を用いることが可能である。 As described in the first embodiment, the model construction unit constructs the learned model MD. After that, the trained model MD is passed to the estimation processing section 6 . The trained model MD may be stored, for example, in a storage unit provided in the estimation processing unit 6, or may be stored in a storage unit provided separately from the estimation processing unit 6, and the estimation processing unit 6 is not necessary. The learned model MD may be read accordingly. As these storage units, appropriately available storage means such as the storage unit 19 shown in FIG. 1 can be used.

推定処理部６には、不正を行うリスクの分析対象となる企業の情報を示す入力データＩＮが入力される。入力データＩＮのデータ形式としては、例えば、上述の企業データとエッジ情報とで構成される。推定処理部６は、学習済みモデルＭＤに入力データＩＮを入力することで、対象企業が不正を行うリスク、例えば対象企業が不正を行う確率を示す情報である出力データＯＵＴを出力する。情報処理装置２００のユーザは、出力データＯＵＴを参照することで、対象企業が不正を行うリスクを認識することが可能である。 The estimation processing unit 6 receives input data IN indicating information about a company whose fraud risk is to be analyzed. The data format of the input data IN includes, for example, the above-described company data and edge information. By inputting the input data IN to the trained model MD, the estimation processing unit 6 outputs the output data OUT which is information indicating the risk of the target company committing fraud, for example, the probability of the target company committing fraud. A user of the information processing apparatus 200 can recognize the risk of fraudulent conduct by the target company by referring to the output data OUT.

なお、本実施の形態にかかる情報処理装置は、情報処理装置１０１のテスト処理部５を有していてもよい。図２５に、情報処理装置２００の変形例である情報処理装置２０１の構成を模式的に示す。テスト処理部５を設けることで、実施の形態１で説明したように、検証を行った学習済みモデルを用いて推定処理を行うことが可能である。 Note that the information processing apparatus according to this embodiment may have the test processing unit 5 of the information processing apparatus 101 . FIG. 25 schematically shows the configuration of an information processing device 201 that is a modification of the information processing device 200. As shown in FIG. By providing the test processing unit 5, as described in the first embodiment, it is possible to perform an estimation process using a verified learned model.

なお、上述の推定処理は、推定処理を行う情報処理装置とは異なる情報処理装置で構築された学習済みモデルＭＤを用いて行うことも可能である。図２６に、実施の形態２にかかる情報処理装置の変形例である情報処理装置２１０の構成を模式的に示す。情報処理装置２１０は、モデル格納部７及び推定処理部６を有する。 Note that the estimation process described above can also be performed using a trained model MD constructed by an information processing apparatus different from the information processing apparatus that performs the estimation process. FIG. 26 schematically shows the configuration of an information processing device 210 which is a modification of the information processing device according to the second embodiment. The information processing device 210 has a model storage unit 7 and an estimation processing unit 6 .

モデル格納部７には、情報処理装置２１０とは異なる、例えば情報処理装置１００によって構築された学習済みモデルＭＤが格納される。学習済みモデルＭＤは、通信回線や記憶媒体などを介して外部の異なる情報処装置から情報処理装置２１０に提供されて、モデル格納部７に格納される。モデル格納部７としては、図１に示した記憶部１９など、適宜利用可能な記憶手段を用いることが可能である。 The model storage unit 7 stores a learned model MD constructed by, for example, the information processing apparatus 100, which is different from the information processing apparatus 210. FIG. The learned model MD is provided to the information processing device 210 from a different external information processing device via a communication line, a storage medium, or the like, and stored in the model storage unit 7 . As the model storage unit 7, it is possible to use appropriately available storage means such as the storage unit 19 shown in FIG.

推定処理部６には、情報処理装置２００と同様に、不正を行うリスクの分析対象となる企業の情報を示す入力データＩＮが入力される。推定処理部６は、学習済みモデルＭＤに入力データＩＮを入力することで、対象企業が不正を行うリスク、例えば対象企業が不正を行う確率を示す情報である出力データＯＵＴを同様に出力することができる。 As with the information processing apparatus 200 , the estimation processing unit 6 receives input data IN indicating information about a company whose risk of fraud is to be analyzed. By inputting the input data IN to the trained model MD, the estimation processing unit 6 similarly outputs the output data OUT, which is information indicating the risk of the target company committing fraud, for example, the probability of the target company committing fraud. can be done.

情報処理装置２１０によれば、モデル構築装置である他の情報処理装置で構築された学習済みモデルを適宜用いて、推定処理を行うことができる。これにより、他の情報処理装置が複数存在する場合、推定処理に適した学習済みモデルの提供を適切な情報処理装置から受けることが可能となる。また、モデルを構築する情報処理装置の設置位置に依存することなく、所望の場所にて推定処理を行うこともできる。 According to the information processing device 210, the estimation process can be performed by appropriately using a trained model constructed by another information processing device that is a model construction device. Accordingly, when there are a plurality of other information processing apparatuses, it is possible to receive provision of a trained model suitable for estimation processing from an appropriate information processing apparatus. In addition, the estimation process can be performed at a desired location without depending on the installation position of the information processing device that constructs the model.

その他の実施の形態
なお、本発明は上記実施の形態に限られたものではなく、趣旨を逸脱しない範囲で適宜変更することが可能である。例えば、学習用データに含まれる財務諸表の内容及び項目（勘定科目）や、属性情報の項目（例えば、業種や事業所の所在地など）は、得られた項目の全てであってもよい、必要に応じて選択された一部の項目が含まれてもよいことは、言うまでもない。 Other Embodiments The present invention is not limited to the above-described embodiments, and can be modified as appropriate without departing from the scope of the invention. For example, the contents and items (account items) of financial statements included in the training data, and the items of attribute information (for example, industry and business location) may be all of the obtained items. It goes without saying that some items may be included that are selected according to.

上述の実施の形態にかかる情報処理装置が実行する処理は、ＡＳＩＣ（Application Specific Integrated Circuit）を含む半導体処理装置を用いて実現されてもよい。また、これらの処理は、少なくとも１つのプロセッサ（e.g. マイクロプロセッサ、ＣＰＵ、ＧＰＵ、ＭＰＵ、ＤＳＰ（Digital Signal Processor））を含むコンピュータシステムにプログラムを実行させることによって実現されてもよい。具体的には、これらの送信信号処理又は受信信号処理に関するアルゴリズムをコンピュータシステムに行わせるための命令群を含む１又は複数のプログラムを作成し、当該プログラムをコンピュータに供給すればよい。 The processing executed by the information processing apparatus according to the above embodiments may be implemented using a semiconductor processing apparatus including an ASIC (Application Specific Integrated Circuit). Also, these processes may be realized by causing a computer system including at least one processor (eg microprocessor, CPU, GPU, MPU, DSP (Digital Signal Processor)) to execute a program. Specifically, one or a plurality of programs containing a group of instructions for causing a computer system to execute algorithms relating to these transmission signal processing or reception signal processing may be created, and the programs may be supplied to the computer.

これらのプログラムは、様々なタイプの非一時的なコンピュータ可読媒体（non-transitory computer readable medium）を用いて格納され、コンピュータに供給することができる。非一時的なコンピュータ可読媒体は、様々なタイプの実体のある記録媒体（tangible storage medium）を含む。非一時的なコンピュータ可読媒体の例は、磁気記録媒体（例えばフレキシブルディスク、磁気テープ、ハードディスクドライブ）、光磁気記録媒体（例えば光磁気ディスク）、ＣＤ－ＲＯＭ（Read Only Memory）、ＣＤ－Ｒ、ＣＤ－Ｒ／Ｗ、半導体メモリ（例えば、マスクＲＯＭ、ＰＲＯＭ（Programmable ROM）、ＥＰＲＯＭ（Erasable PROM）、フラッシュＲＯＭ、ＲＡＭ（random access memory））を含む。また、プログラムは、様々なタイプの一時的なコンピュータ可読媒体（transitory computer readable medium）によってコンピュータに供給されてもよい。一時的なコンピュータ可読媒体の例は、電気信号、光信号、及び電磁波を含む。一時的なコンピュータ可読媒体は、電線及び光ファイバ等の有線通信路、又は無線通信路を介して、プログラムをコンピュータに供給できる。 These programs can be stored and delivered to computers using various types of non-transitory computer readable media. Non-transitory computer-readable media include various types of tangible storage media. Examples of non-transitory computer-readable media include magnetic recording media (eg, flexible discs, magnetic tapes, hard disk drives), magneto-optical recording media (eg, magneto-optical discs), CD-ROMs (Read Only Memory), CD-Rs, CD-R/W, semiconductor memory (eg, mask ROM, PROM (Programmable ROM), EPROM (Erasable PROM), flash ROM, RAM (random access memory)). The program may also be delivered to the computer on various types of transitory computer readable medium. Examples of transitory computer-readable media include electrical signals, optical signals, and electromagnetic waves. Transitory computer-readable media can deliver the program to the computer via wired channels, such as wires and optical fibers, or wireless channels.

１クラスタリング処理部
２クラスタ特徴量算出部
３不正フラグ付与部
４モデル構築部
５テスト処理部
６推定処理部
７モデル格納部
１１ＣＰＵ
１２ＲＯＭ
１３ＲＡＭ
１４バス
１５入出力インターフェイス
１６入力部
１７出力部
１８通信部
１９記憶部
２０ドライブ
２１磁気ディスク
２２光ディスク
２３フレキシブルディスク
２４半導体メモリ
３０ネットワーク
４０サーバ
１００、１０１、２００、２０１、２１０情報処理装置
１１０コンピュータ
ＣＰＲ企業データ
ＥＤＧエッジ情報
ＤＡＴ学習用データ
ＩＮ入力データ
ＭＤ学習済みモデル
ＵＤＨ不正履歴情報
REFERENCE SIGNS LIST 1 clustering processing unit 2 cluster feature quantity calculation unit 3 fraud flag assignment unit 4 model construction unit 5 test processing unit 6 estimation processing unit 7 model storage unit 11 CPU
12 ROMs
13 RAM
14 bus 15 input/output interface 16 input unit 17 output unit 18 communication unit 19 storage unit 20 drive 21 magnetic disk 22 optical disk 23 flexible disk 24 semiconductor memory 30 network 40 server 100, 101, 200, 201, 210 information processing device 110 computer CPR Corporate data EDG Edge information DAT Learning data IN Input data MD Trained model UDH Fraud history information

Claims

Multiple variables that indicate the values of multiple account items included in each company's financial statements, in which each company's information is represented by a multidimensional vector, each company's attribute information, and information that indicates whether each company has committed fraud and information indicating business relationships between companies, and perform clustering so that the number of nodes included in one cluster is equal to or less than a predetermined value, and the nodes corresponding to each cluster. A clustering processing unit that performs a clustering process for acquiring a network structure composed of edges that indicate a business relationship between
a feature amount calculation unit that calculates the feature amount of each node belonging to each cluster included in the data after the crystalling process , and calculates the feature amount of each cluster based on the calculated feature amount;
a fraud flag assigning unit that assigns a fraud flag to a cluster based on information indicating whether or not a node belonging to each cluster has performed fraud;
a model building unit that acquires a trained model by performing supervised learning on the data to which the fraud flag is assigned, with the feature quantity as an explanatory variable and the fraud flag as an objective variable;
Information processing equipment.

The feature amount of each cluster is calculated based on the statistics of the feature amount of the cluster,
The information processing device according to claim 1 .

the statistic is a mean, maximum, minimum or median;
The information processing apparatus according to claim 2.

The feature of each node is
the risk score of each company corresponding to each node;
degree centrality of each node and
Eigenvector centrality of each node and
a product of the risk score and the degree centrality of each node;
a product of the eigenvector centrality and the degree centrality of each node;
the cluster coefficient of each node, and
local eigenvector centrality, which is the eigenvector centrality calculated in the network from each node to two adjacent nodes connected to each node by edges;
sum of degrees of adjacent nodes to each node, and
The information processing apparatus according to any one of claims 1 to 3.

The features of each cluster are the risk score of each node, the degree centrality, the eigenvector centrality, the product of the risk score and the degree centrality, the product of the eigenvector centrality and the degree centrality, the a value calculated based on each statistic of the cluster coefficient, the local eigenvector centrality, and the sum of the degrees;
Group degree center, which is a value obtained by dividing the total number of edges connecting nodes included in a cluster of interest and nodes included in clusters other than the cluster of interest by the total number of nodes included in clusters other than the cluster of interest including sex and
The information processing apparatus according to claim 4.

The model construction unit extracts clusters to which fraud flags are attached at a predetermined ratio from the clusters included in the data to which the fraud flags are assigned, and randomly extracts clusters to which fraud flags are not attached at the predetermined ratio. learn what is extracted in
further comprising a test processing unit for inputting test data consisting of clusters not extracted from the clusters included in the data to which the fraud flag is attached to the trained model, and calculating an evaluation index of transaction fraud detection accuracy. ,
The information processing apparatus according to any one of claims 1 to 5.

By performing the processing by the clustering processing unit, the feature amount calculation unit, the fraud flag assignment unit, the model construction unit, and the test processing unit a plurality of times while changing the predetermined value, a plurality of Calculate the first evaluation index of
wherein the test processing unit determines the predetermined value based on the plurality of first evaluation indices;
The information processing device according to claim 6 .

calculating a plurality of second evaluation indices by repeating the processing by the model construction unit and the test processing unit a plurality of times for one of the predetermined values;
wherein the test processing unit calculates statistics of the plurality of second evaluation indices as the second evaluation index corresponding to the one predetermined value;
The information processing device according to claim 6 .

The statistic of the plurality of second evaluation indicators is an average value, maximum value, minimum value or median value,
The information processing apparatus according to claim 8 .

It includes multiple variables that indicate the values of multiple account items included in each company's financial statements, in which information about each company is represented by a multidimensional vector, information that indicates the attribute information of each company, and information that indicates the business relationships between companies. further comprising an estimation processing unit that reads the input data that is received, inputs the input data to the trained model, and estimates whether the company corresponding to the input data has committed fraud;
The information processing apparatus according to any one of claims 1 to 9.

A clustering processing unit obtains multiple variables representing the values of multiple account items included in the financial statements of each company in which the information of each company is represented by a multidimensional vector, attribute information of each company, and whether each company has committed fraud. Read learning data containing information indicating whether or not, and information indicating the business relationship between companies, and perform clustering so that the number of nodes included in one cluster is a predetermined value or less. Perform clustering processing to acquire a network structure composed of corresponding nodes and edges indicating transaction relationships between nodes,
A feature amount calculation unit calculates the feature amount of each node belonging to each cluster included in the data after the crystalling process , calculates the feature amount of each cluster based on the calculated feature amount,
a fraudulent flag assigning unit assigning an fraudulent flag to each cluster based on information indicating whether or not a node belonging to each cluster has committed fraud;
A model construction unit acquires a trained model by performing supervised learning using the data to which the fraud flag is assigned as learning data, the feature amount as an explanatory variable, and the fraud flag as an objective variable,
Information processing methods.

Each company's information is represented by a multidimensional vector whose components are multiple variables that indicate the values of multiple account items included in each company's financial statements, and information that indicates whether each company has committed fraud. , input data including enterprise data including the plurality of multidimensional vectors corresponding to the plurality of enterprises, and information indicating the business relationships of the plurality of enterprises, and a predetermined number of nodes included in one cluster. a clustering process for clustering so as to be equal to or less than the value, and obtaining a network structure composed of nodes corresponding to each cluster and edges indicating transaction relationships between the nodes;
A process in which the feature amount calculation unit calculates the feature amount of each node belonging to each cluster included in the data after the crystalling process , and calculates the feature amount of each cluster based on the calculated feature amount;
a process in which the fraud flag assigning unit assigns an fraud flag to the cluster based on the information indicating whether or not the node belonging to each cluster has committed fraud;
a process in which the model construction unit acquires a trained model by performing supervised learning using the data to which the fraud flag is assigned as learning data, the feature quantity as an explanatory variable, and the fraud flag as an objective variable; to run
program.

Multiple variables that indicate the values of multiple account items included in each company's financial statements, in which each company's information is represented by a multidimensional vector, each company's attribute information, and information that indicates whether each company has committed fraud and information indicating business relationships between companies, and perform clustering so that the number of nodes included in one cluster is equal to or less than a predetermined value, and the nodes corresponding to each cluster. A clustering processing unit that performs clustering processing to obtain a network structure composed of edges that indicate business relationships between clustering processing units; a feature amount calculation unit that calculates the feature amount of each cluster based on the feature amount obtained; and a fraud flag that assigns a fraud flag to the cluster based on information indicating whether or not the node belonging to each cluster has committed the fraud. and a model construction unit that acquires a trained model by performing supervised learning on the data to which the fraud flag is assigned, using the feature quantity as an explanatory variable and the fraud flag as an objective variable. a model repository that holds the model acquired by the device;
It includes multiple variables that indicate the values of multiple account items included in each company's financial statements, in which information about each company is represented by a multidimensional vector, information that indicates the attribute information of each company, and information that indicates the business relationships between companies. an estimation processing unit that reads the input data received, inputs the input data to the trained model, and estimates whether the company corresponding to the input data has committed fraud;
Information processing equipment.