JP2020119201A

JP2020119201A - Determination device, determination method and determination program

Info

Publication number: JP2020119201A
Application number: JP2019009273A
Authority: JP
Inventors: 清良披田野; Seira Hidano; 清本　晋作; Shinsaku Kiyomoto; 晋作清本
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2019-01-23
Filing date: 2019-01-23
Publication date: 2020-08-06
Anticipated expiration: 2039-01-23
Also published as: JP7075362B2

Abstract

To provide a determination device, a determination method, and a determination program that can determine malignant data regardless of distribution of evaluation data.SOLUTION: A determination device 1 includes: a calculation unit 11 that calculates, in an evaluation matrix in which evaluation values for a plurality of items for each user are stored, a cumulative probability distribution for a sample of evaluation values; and a determination unit 12 that determines, when difference between cumulative probability distributions of two different evaluation matrices and index based on the number of users in each of the two evaluation matrices satisfy a condition based on a predetermined significance level, that populations of the two evaluation matrices are not same.SELECTED DRAWING: Figure 1

Description

本発明は、データポイゾニング攻撃を判定する装置に関する。 The present invention relates to a device for determining a data poisoning attack.

推薦システム等で用いられる協調フィルタリングでは、同一システム内のユーザらによる複数のアイテムに対する評価データを用いて、未評価のアイテムに対する評価値を予測する。協調フィルタリングに対する攻撃としては、攻撃者が正規ユーザを装いながら不正にアイテムを評価し、評価データに悪性データを混入させるデータポイゾニング攻撃がある。データポイゾニング攻撃は、予測性能の低下、又は特定アイテムの人気向上若しくは低下を目的とする。 In the collaborative filtering used in a recommendation system or the like, evaluation values for un-evaluated items are predicted using evaluation data for a plurality of items by users in the same system. As an attack against collaborative filtering, there is a data poisoning attack in which an attacker improperly evaluates an item while pretending to be a legitimate user and mixes malicious data into the evaluation data. Data poisoning attacks are aimed at reducing predictive performance or increasing or decreasing the popularity of specific items.

協調フィルタリングに対するデータポイゾニング攻撃への対策としては、例えば非特許文献１で提案されたｔ検定を用いた方法がある。この方法では、評価値の分布が正規分布であることを仮定し、ｔ検定により分布の違いを検出することで、追加で与えられた評価データが悪性データであることを判定する。 As a countermeasure against a data poisoning attack for collaborative filtering, there is a method using a t-test proposed in Non-Patent Document 1, for example. In this method, it is assumed that the distribution of the evaluation values is a normal distribution, and the difference between the distributions is detected by the t-test to determine that the evaluation data additionally given is malignant data.

Ｂ．Ｌｉ，Ｙ．Ｗａｎｇ，Ａ．Ｓｉｎｇｈ，ａｎｄＹ．Ｖｏｒｏｂｅｙｃｈｉｋ：ＤａｔａＰｏｉｓｏｎｉｎｇＡｔｔａｃｋｓｏｎＦａｃｔｏｒｉｚａｔｉｏｎ−ＢａｓｅｄＣｏｌｌａｂｏｒａｔｉｖｅＦｉｌｔｅｒｉｎｇ，Ｐｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅ３ｒｄＮｅｕｒａｌＩｎｆｏｒｍａｔｉｏｎＰｒｏｃｅｓｓｉｎｇＳｙｓｔｅｍｓ（ＮＩＰＳ２０１６），ｐｐ．１−１３，２０１６．B. Li, Y. Wang, A.; Singh, and Y. Vorobechik: Data Poisoning Attacks on Factorization-Based Collaborative Filtering, Proceedings of the 3rd Neural Information, IPS Systems 20. 1-13, 2016.

前述の方法では、正規ユーザ及び悪性ユーザの評価値の分布が正規分布であることを仮定しているが、評価値の分布は必ずしも正規分布に従うとは限らない。このため、追加で与えられた悪性データを正しく検出できない場合が多かった。 In the above method, it is assumed that the distributions of the evaluation values of the normal user and the malicious user are the normal distribution, but the distribution of the evaluation values does not always follow the normal distribution. Therefore, in many cases, the additional malicious data cannot be detected correctly.

本発明は、評価データの分布によらず、悪性データを判定できる判定装置、判定方法及び判定プログラムを提供することを目的とする。 An object of the present invention is to provide a determination device, a determination method, and a determination program that can determine malignant data regardless of the distribution of evaluation data.

本発明に係る判定装置は、ユーザ毎の複数のアイテムに対する評価値が格納される評価行列において、前記評価値の標本に関する累積確率分布を算出する算出部と、互いに異なる二つの評価行列それぞれの前記累積確率分布の差、及び前記二つの評価行列それぞれにおけるユーザ数に基づく指標が所定の有意水準に基づく条件を満たす場合に、前記二つの評価行列の母集団が同一でないと判定する判定部と、を備える。 The determination device according to the present invention is, in an evaluation matrix in which evaluation values for a plurality of items for each user are stored, a calculation unit that calculates a cumulative probability distribution regarding a sample of the evaluation values, and two evaluation matrices for each of two different evaluation matrices. A difference between cumulative probability distributions, and a determination unit that determines that the populations of the two evaluation matrices are not the same when the index based on the number of users in each of the two evaluation matrices satisfies a condition based on a predetermined significance level, Equipped with.

前記算出部は、前記評価行列に格納された前記評価値の集合を前記標本とし、前記評価値のヒストグラムを表す前記累積確率分布を算出してもよい。 The calculating unit may calculate the cumulative probability distribution that represents a histogram of the evaluation values, using the set of the evaluation values stored in the evaluation matrix as the sample.

前記算出部は、前記二つの評価行列における互いに対応する部分を前記標本とし、当該標本における前記評価値の有無のヒストグラムを表す前記累積確率分布を算出してもよい。 The calculation unit may calculate, as the sample, portions corresponding to each other in the two evaluation matrices, and calculate the cumulative probability distribution representing a histogram of the presence or absence of the evaluation value in the sample.

前記算出部は、前記評価行列における前記アイテムの単位の部分行列を前記標本としてもよい。 The calculation unit may use the partial matrix of the unit of the item in the evaluation matrix as the sample.

前記判定部は、前記二つの評価行列において、前記条件を満たす前記アイテムの個数が所定以上の場合に、前記二つの評価行列の母集団が同一でないと判定してもよい。 The determination unit may determine that the populations of the two evaluation matrices are not the same when the number of items satisfying the condition in the two evaluation matrices is equal to or larger than a predetermined number.

本発明に係る判定方法は、ユーザ毎の複数のアイテムに対する評価値が格納される評価行列において、前記評価値の標本に関する累積確率分布を算出する算出ステップと、互いに異なる二つの評価行列それぞれの前記累積確率分布の差、及び前記二つの評価行列それぞれにおけるユーザ数に基づく指標が所定の有意水準に基づく条件を満たす場合に、前記二つの評価行列の母集団が同一でないと判定する判定ステップと、をコンピュータが実行する。 A determination method according to the present invention is, in an evaluation matrix in which evaluation values for a plurality of items for each user are stored, a calculation step of calculating a cumulative probability distribution regarding a sample of the evaluation values, and two evaluation matrices for each of two different evaluation matrices. A difference between cumulative probability distributions, and a determination step of determining that the populations of the two evaluation matrices are not the same when the index based on the number of users in each of the two evaluation matrices satisfies a condition based on a predetermined significance level, Is executed by the computer.

本発明に係る判定プログラムは、ユーザ毎の複数のアイテムに対する評価値が格納される評価行列において、前記評価値の標本に関する累積確率分布を算出する算出ステップと、互いに異なる二つの評価行列それぞれの前記累積確率分布の差、及び前記二つの評価行列それぞれにおけるユーザ数に基づく指標が所定の有意水準に基づく条件を満たす場合に、前記二つの評価行列の母集団が同一でないと判定する判定ステップと、をコンピュータに実行させるためのものである。 A determination program according to the present invention, in an evaluation matrix in which evaluation values for a plurality of items for each user are stored, a calculation step of calculating a cumulative probability distribution regarding a sample of the evaluation values, and two evaluation matrices different from each other. A difference between cumulative probability distributions, and a determination step of determining that the populations of the two evaluation matrices are not the same when the index based on the number of users in each of the two evaluation matrices satisfies a condition based on a predetermined significance level, To make a computer execute.

本発明によれば、評価データの分布によらず、悪性データを判定できる。 According to the present invention, malignant data can be determined regardless of the distribution of evaluation data.

実施形態に係る判定装置の機能構成を示すブロック図である。It is a block diagram which shows the functional structure of the determination device which concerns on embodiment. 実施形態に係る第１の判定方法の手順を示すフローチャートである。It is a flow chart which shows the procedure of the 1st judgment method concerning an embodiment. 実施形態に係る第２の判定方法の手順を示すフローチャートである。It is a flow chart which shows the procedure of the 2nd judgment method concerning an embodiment.

以下、本発明の実施形態の一例について説明する。
本実施形態に係る悪性データの判定方法では、強調フィルタリングに用いられる正規の評価行列に対して、追加で与えられた評価行列が悪性データであることを、コルモゴロフ・スミルノフ検定を用いて判定する。
コルモゴロフ・スミルノフ検定は、正規分布を仮定するｔ検定と異なり、分布によらないノンパラメトリックな検定手法の一つであり、これにより、評価値に関する分布の形状を仮定せずに、正規ユーザによる評価行列と悪性ユーザによる評価行列とが区別される。 Hereinafter, an example of the embodiment of the present invention will be described.
In the method for determining malignant data according to the present embodiment, it is determined using the Kolmogorov-Smirnov test that the evaluation matrix additionally provided is malignant data with respect to the regular evaluation matrix used for emphasis filtering.
The Kolmogorov-Smirnov test is one of the nonparametric test methods that does not rely on the distribution, unlike the t-test that assumes a normal distribution. This allows evaluation by a normal user without assuming the shape of the distribution for evaluation values. A matrix is distinguished from an evaluation matrix by a malicious user.

ここで、Ｍをｍ人の正規ユーザによるｎ個のアイテムに対する評価行列とする。Ｍ_ｉ，ｊは、評価行列Ｍのｉ番目の行（ユーザ）のｊ番目の列（アイテム）の評価を示す。ただし、Ｍは疎な行列であり、評価が未観測な要素を含む。
協調フィルタリングでは、評価行列Ｍを分析することで、Ｍにおける未観測な要素の値を推定する。 Here, let M be an evaluation matrix for n items by m regular users. M _i,j indicates the evaluation of the j-th column (item) of the i-th row (user) of the evaluation matrix M. However, M is a sparse matrix, and includes elements whose evaluation has not been observed.
In collaborative filtering, the value of an unobserved element in M is estimated by analyzing the evaluation matrix M.

協調フィルタリングに対するデータポイゾニング攻撃では、攻撃者は、正規ユーザになりますし、攻撃目的に応じて不正にアイテムを評価し、学習システムに対して悪性な評価データを追加する。Ｍ’をｍ’人の悪性ユーザによるｎ個のアイテムに対する評価行列とする。
本実施形態の判定装置１は、正規ユーザの評価行列Ｍに対して、評価行列Ｍ’が追加で与えられたときに、この評価行列Ｍ’が正規ユーザによるものか、悪性ユーザにより混入されたものかを判定する。 In a data poisoning attack on collaborative filtering, the attacker becomes a legitimate user, illegally evaluates items according to the purpose of the attack, and adds malicious evaluation data to the learning system. Let M′ be the evaluation matrix for n items by m′ malicious users.
When the evaluation matrix M′ is additionally given to the evaluation matrix M of the normal user, the determination apparatus 1 of the present embodiment has the evaluation matrix M′ by the normal user or mixed by the malicious user. Determine if it is a thing.

図１は、本実施形態に係る判定装置１の機能構成を示すブロック図である。
判定装置１は、サーバ装置又はパーソナルコンピュータ等の情報処理装置（コンピュータ）であり、制御部１０及び記憶部２０の他、各種データの入出力デバイス及び通信デバイス等を備える。 FIG. 1 is a block diagram showing a functional configuration of the determination device 1 according to the present embodiment.
The determination device 1 is an information processing device (computer) such as a server device or a personal computer, and includes an input/output device for various data, a communication device, and the like in addition to the control unit 10 and the storage unit 20.

制御部１０は、判定装置１の全体を制御する部分であり、記憶部２０に記憶された各種プログラムを適宜読み出して実行することにより、本実施形態における各機能を実現する。制御部１０は、ＣＰＵであってよい。 The control unit 10 is a unit that controls the entire determination device 1, and implements each function in the present embodiment by appropriately reading and executing various programs stored in the storage unit 20. The control unit 10 may be a CPU.

記憶部２０は、ハードウェア群を判定装置１として機能させるための各種プログラム、及び各種データ等の記憶領域であり、ＲＯＭ、ＲＡＭ、フラッシュメモリ又はハードディスク（ＨＤＤ）等であってよい。具体的には、記憶部２０は、本実施形態の各機能を制御部１０に実行させるためのプログラム（判定プログラム）、及び処理途中の各種データ等を記憶する。 The storage unit 20 is a storage area for storing various programs and various data for causing the hardware group to function as the determination device 1, and may be a ROM, a RAM, a flash memory, a hard disk (HDD), or the like. Specifically, the storage unit 20 stores a program (judgment program) for causing the control unit 10 to execute each function of the present embodiment, various data during processing, and the like.

制御部１０は、算出部１１と、判定部１２とを備える。制御部１０は、これらの機能部を動作させることにより、新たに追加された評価行列Ｍ’が既存の評価行列Ｍと同一の母集団に由来するものか否か、すなわち、正規の評価データか悪性の評価データかを判定する。 The control unit 10 includes a calculation unit 11 and a determination unit 12. The control unit 10 operates these functional units to determine whether the newly added evaluation matrix M′ is derived from the same population as the existing evaluation matrix M, that is, whether the evaluation matrix M is normal evaluation data. Determine if it is malignant evaluation data.

算出部１１は、ユーザ毎の複数のアイテムに対する評価値が格納される評価行列Ｍ及びＭ’において、評価値の標本に関する累積確率分布を算出する。
算出部１１は、例えば、評価行列Ｍ及びＭ’に格納された観測された評価値の集合を標本とし、評価値のヒストグラムを表す累積確率分布を算出する。
また、算出部１１は、例えば、二つの評価行列Ｍ及びＭ’における互いに対応する部分を標本とし、これらの標本における評価値の有無のヒストグラムを表す累積確率分布を算出する。具体的には、算出部１１は、評価行列Ｍ及びＭ’におけるアイテムの単位の部分行列を標本としてよい。 The calculation unit 11 calculates a cumulative probability distribution regarding a sample of evaluation values in the evaluation matrices M and M′ in which evaluation values for a plurality of items for each user are stored.
The calculation unit 11 calculates, for example, a cumulative probability distribution representing a histogram of evaluation values, using a set of observed evaluation values stored in the evaluation matrices M and M′ as a sample.
In addition, for example, the calculation unit 11 uses, as samples, portions corresponding to each other in the two evaluation matrices M and M′, and calculates a cumulative probability distribution that represents a histogram of the presence or absence of evaluation values in these samples. Specifically, the calculation unit 11 may use the partial matrix of the unit of the items in the evaluation matrices M and M′ as a sample.

判定部１２は、互いに異なる二つの評価行列Ｍ及びＭ’それぞれの累積確率分布の差、及び評価行列Ｍ及びＭ’それぞれにおけるユーザ数ｍ及びｍ’に基づく指標が所定の有意水準に基づく条件を満たす場合に、二つの評価行列Ｍ及びＭ’の母集団が同一でないと判定する。
また、判定部１２は、二つの評価行列Ｍ及びＭ’において、条件を満たす部分集合の数（アイテムの個数）が所定以上の場合に、二つの評価行列Ｍ及びＭ’の母集団が同一でないと判定してもよい。 The determination unit 12 determines whether the difference between the cumulative probability distributions of the two different evaluation matrices M and M′ and the index based on the number of users m and m′ in each of the evaluation matrices M and M′ are based on a predetermined significance level. When it is satisfied, it is determined that the populations of the two evaluation matrices M and M′ are not the same.
Further, in the two evaluation matrices M and M′, the determination unit 12 determines that the populations of the two evaluation matrices M and M′ are not the same when the number of subsets (the number of items) satisfying the condition is equal to or larger than a predetermined value. May be determined.

次に、判定装置１による悪性データの判定方法の手順を詳述する。
判定装置１は、例えば、以下に示す第１の判定方法又は第２の判定方法を採用する。また、判定装置１は、第１の判定方法及び第２の判定方法を共に実行し、いずれの方法においても母集団が同一でないと判定される場合に、評価行列Ｍ’が悪性な評価データであると判定してもよい。 Next, the procedure of the method of determining malicious data by the determination device 1 will be described in detail.
The determination device 1 employs, for example, the following first determination method or second determination method. In addition, the determination device 1 executes both the first determination method and the second determination method, and when it is determined that the populations are not the same in any of the methods, the evaluation matrix M′ is malignant evaluation data. It may be determined that there is.

［第１の判定方法］
第１の判定方法では、判定装置１は、コルモゴロフ・スミルノフ検定を用いて、評価行列Ｍ及びＭ’のそれぞれに含まれる観測された評価値の累積確率分布の差に基づいて、Ｍ’が悪性ユーザの評価行列であるかどうかを判定する。 [First determination method]
In the first determination method, the determination apparatus 1 uses the Kolmogorov-Smirnov test to determine that M′ is malignant based on the difference between the cumulative probability distributions of the observed evaluation values included in each of the evaluation matrices M and M′. It is determined whether the evaluation matrix is for the user.

図２は、本実施形態に係る第１の判定方法の手順を示すフローチャートである。
なお、コルモゴロフ・スミルノフ検定を用いるにあたって、Ｘを正規ユーザの評価値の標本とし、Ｘ’を悪性ユーザの評価値の標本とする。また、帰無仮説を「ＸとＸ’の母集団が同一である」とする。 FIG. 2 is a flowchart showing the procedure of the first determination method according to this embodiment.
When using the Kolmogorov-Smirnov test, let X be a sample of the evaluation value of a normal user and X′ be a sample of the evaluation value of a malignant user. In addition, the null hypothesis is “the populations of X and X′ are the same”.

ステップＳ１において、算出部１１は、正規ユーザの評価行列Ｍを用いて標本Ｘの累積確率分布Ｐ（ｘ）を算出する。また、算出部１１は、悪性ユーザの評価行列Ｍ’を用いて標本Ｘ’の累積確率分布Ｑ（ｘ）を算出する。
ここで、ｘは、評価行列Ｍ及びＭ’に含まれる観測された評価値である。例えば、評価行列Ｍ及びＭ’の要素として、各アイテムに対する実際の評価値１〜５と、無評価を示す０とが混在する場合、算出部１１は、無評価の０を除く評価値ｘ（＝１〜５）のヒストグラムから、累積確率分布を算出する。 In step S1, the calculation unit 11 calculates the cumulative probability distribution P(x) of the sample X using the evaluation matrix M of the regular user. Further, the calculation unit 11 calculates the cumulative probability distribution Q(x) of the sample X′ using the evaluation matrix M′ of the malicious user.
Here, x is an observed evaluation value included in the evaluation matrices M and M′. For example, when the actual evaluation values 1 to 5 for each item and 0 indicating no evaluation coexist as elements of the evaluation matrices M and M′, the calculation unit 11 causes the calculation unit 11 to calculate the evaluation value x( =1-5), the cumulative probability distribution is calculated.

ステップＳ２において、判定部１２は、累積確率分布Ｐ（ｘ）及びＱ（ｘ）に基づいて、以下の式によりコルモゴロフ・スミルノフ統計量Ｄを計算する。
Ｄ＝ｍａｘ_ｘ｜Ｐ（ｘ）−Ｑ（ｘ）｜ In step S2, the determination unit 12 calculates the Kolmogorov-Smirnov statistic D by the following formula based on the cumulative probability distributions P(x) and Q(x).
D=max _x |P(x)-Q(x)|

ステップＳ３において、判定部１２は、正規ユーザ数ｍ、悪性ユーザ数ｍ’、及び統計量Ｄから、悪性データを判定するための指標としてＤ［ｍｍ’／（ｍ＋ｍ’）］^１／２を計算する。 In step S3, the determination unit 12 calculates D[mm'/(m+m')] ^1/2 as an index for determining malicious data from the number m of regular users, the number m'of malicious users, and the statistic D. To do.

ステップＳ４において、判定部１２は、有意水準αに対して、Ｄ［ｍｍ’／（ｍ＋ｍ’）］^１／２の値がＫ_αより大きいか否かを判定する。ただし、Ｋ_αは、Ｐｒ［Ｄ［ｍｍ’／（ｍ＋ｍ’）］^１／２≦Ｋ_α］＝１−αを満たす数とする。この判定がＹＥＳの場合、処理はステップＳ５に移り、判定がＮＯの場合、処理はステップＳ６に移る。 In step S4, the determination unit 12 determines whether or not the value of D[mm'/(m+m')] ^1/2 is larger than K _{α with} respect to the significance level α. However, K _α is a number that satisfies Pr[D[mm′/(m+m′)] ^1/2 ≦K _α ]=1−α. If this determination is YES, the process proceeds to step S5, and if the determination is NO, the process proceeds to step S6.

ステップＳ５において、判定部１２は、帰無仮説を棄却し、ＸとＸ’の母集団が同一でない、すなわち、評価行列Ｍ’が悪性ユーザにより混入された悪性データであると判定する。 In step S5, the determination unit 12 rejects the null hypothesis and determines that the populations of X and X'are not the same, that is, the evaluation matrix M'is malignant data mixed by a malignant user.

ステップＳ６において、判定部１２は、ＸとＸ’の母集団が同一であり、評価行列Ｍ’が正規ユーザにより与えられた正規データであると判定する。 In step S6, the determination unit 12 determines that the populations of X and X'are the same and that the evaluation matrix M'is the normal data given by the normal user.

［第２の判定方法］
第２の判定方法では、判定装置１は、コルモゴロフ・スミルノフ検定を用いて、評価行列Ｍ及びＭ’のそれぞれにおける各アイテムに対する評価の付け方の違いに基づいて、Ｍ’が悪性ユーザの評価行列であるかどうかを判定する。 [Second determination method]
In the second determination method, the determination device 1 uses the Kolmogorov-Smirnov test, and M′ is an evaluation matrix of a malignant user, based on the difference in the evaluation method for each item in each of the evaluation matrices M and M′. Determine if there is.

図３は、本実施形態に係る第２の判定方法の手順を示すフローチャートである。
なお、コルモゴロフ・スミルノフ検定を用いるにあたって、Ｘ_ｊを正規ユーザによるｊ番目のアイテムに対する評価値の標本とし、Ｘ_ｊ’を悪性ユーザによるｊ番目のアイテムに対する評価値の標本とする。ただし、評価値は、０又は１の２値、すなわち、ユーザがｊ番目のアイテムを評価していれば評価の高低に関わらず１、評価していなければ０とする。また、帰無仮説を「Ｘ_ｊとＸ_ｊ’の母集団が同一である」とする。 FIG. 3 is a flowchart showing the procedure of the second determination method according to this embodiment.
When using the Kolmogorov-Smirnov test, let X _{j be} a sample of the evaluation value for the j-th item by the normal user and X _j ′ be a sample of the evaluation value for the j-th item by the malicious user. However, the evaluation value is a binary value of 0 or 1, that is, 1 if the user is evaluating the j-th item, regardless of the evaluation level, and 0 if not. In addition, the null hypothesis is “the populations of X _j and X _j ′ are the same”.

ステップＳ１１において、制御部１０は、ｎ個のアイテムのインデックスｊ、及び悪性データの判定用アイテム数ｋを、０に初期化する。 In step S11, the control unit 10 initializes the index j of n items and the number k of items for determining malicious data to 0.

ステップＳ１２において、判定部１２は、インデックスｊをカウントアップ（ｊ＝ｊ＋１）する。 In step S12, the determination unit 12 counts up the index j (j=j+1).

ステップＳ１３において、算出部１１は、正規ユーザの評価行列Ｍを用いて標本Ｘ_ｊの累積確率分布Ｐ（ｘ）を求める。また、算出部１１は、悪性ユーザの評価行列Ｍ’を用いて標本Ｘ_ｊ’の累積確率分布Ｑ（ｘ）を求める。
ここで、ｘは、前述の０又は１の評価値である。例えば、評価行列Ｍ及びＭ’の要素として、各アイテムに対する実際の評価値１〜５と、無評価を示す０とが混在する場合、算出部１１は、評価なし（ｘ＝０）と評価あり（ｘ＝１）のヒストグラムから、累積確率分布を求める。 In step S13, the calculation unit 11 obtains the cumulative probability distribution P(x) of the sample X _j using the evaluation matrix M of the regular user. In addition, the calculation unit 11 obtains the cumulative probability distribution Q(x) of the sample X _j ′ using the evaluation matrix M′ of the malicious user.
Here, x is the above-mentioned evaluation value of 0 or 1. For example, when the actual evaluation values 1 to 5 for each item and 0 indicating no evaluation are mixed as the elements of the evaluation matrices M and M′, the calculation unit 11 evaluates that there is no evaluation (x=0). The cumulative probability distribution is calculated from the histogram of (x=1).

ステップＳ１４において、判定部１２は、累積確率分布Ｐ（ｘ）及びＱ（ｘ）に基づいて、以下の式によりコルモゴロフ・スミルノフ統計量Ｄを計算する。
Ｄ＝ｍａｘ_ｘ｜Ｐ（ｘ）−Ｑ（ｘ）｜ In step S14, the determination unit 12 calculates the Kolmogorov-Smirnov statistic D by the following formula based on the cumulative probability distributions P(x) and Q(x).
D=max _x |P(x)-Q(x)|

ステップＳ１５において、判定部１２は、正規ユーザ数ｍ、悪性ユーザ数ｍ’、及び統計量Ｄから、悪性データを判定するための指標としてＤ［ｍｍ’／（ｍ＋ｍ’）］^１／２を計算する。 In step S15, the determination unit 12 calculates D[mm'/(m+m')] ^1/2 as an index for determining malignant data from the number m of regular users, the number m'of malicious users, and the statistic D. To do.

ステップＳ１６において、判定部１２は、有意水準αに対して、Ｄ［ｍｍ’／（ｍ＋ｍ’）］^１／２の値がＫ_αより大きいか否かを判定する。ただし、Ｋ_αは、Ｐｒ［Ｄ［ｍｍ’／（ｍ＋ｍ’）］^１／２≦Ｋ_α］＝１−αを満たす数とする。この判定がＹＥＳの場合、処理はステップＳ１７に移り、判定がＮＯの場合、処理はステップＳ１８に移る。 In step S16, the determination unit 12 determines whether or not the value of D[mm'/(m+m')] ^1/2 is larger than K _{α with} respect to the significance level α. However, K _α is a number that satisfies Pr[D[mm′/(m+m′)] ^1/2 ≦K _α ]=1−α. If the determination is YES, the process proceeds to step S17, and if the determination is NO, the process proceeds to step S18.

ステップＳ１７において、判定部１２は、帰無仮説を棄却し、Ｘ_ｊとＸ_ｊ’の母集団が同一でないと判定し、判定用アイテム数ｋをカウントアップ（ｋ＝ｋ＋１）する。 In step S17, the determination unit 12 rejects the null hypothesis, determines that the populations of X _j and X _j ′ are not the same, and counts up the number of determination items k (k=k+1).

ステップＳ１８において、判定部１２は、判定用アイテム数ｋが閾値ｔ以上か否かを判定する。この判定がＹＥＳの場合、処理はステップＳ１９に移り、判定がＮＯの場合、処理はステップＳ２０に移る。 In step S18, the determination unit 12 determines whether the number k of items for determination is equal to or greater than the threshold value t. If the determination is YES, the process proceeds to step S19, and if the determination is NO, the process proceeds to step S20.

ステップＳ１９において、判定部１２は、評価行列Ｍ’が悪性ユーザにより混入された悪性データであると判定し、処理を終了する。 In step S19, the determination unit 12 determines that the evaluation matrix M′ is the malicious data mixed by the malicious user, and ends the processing.

ステップＳ２０において、判定部１２は、インデックスｊがアイテム数ｎと等しい、すなわち、全てのアイテムに関して検定を行ったか否かを判定する。この判定がＹＥＳの場合、処理はステップＳ２１に移り、判定がＮＯの場合、処理はステップＳ１２に戻る。 In step S20, the determination unit 12 determines whether the index j is equal to the number of items n, that is, whether or not all items have been tested. If this determination is YES, the process proceeds to step S21, and if the determination is NO, the process returns to step S12.

ステップＳ２１において、判定部１２は、評価行列Ｍ’が正規ユーザにより与えられた正規データであると判定する。 In step S21, the determination unit 12 determines that the evaluation matrix M′ is the regular data given by the regular user.

本実施形態によれば、判定装置１は、ノンパラメトリックな検定手法の一つであるコルモゴロフ・スミルノフ検定を用いて、二つの評価行列に格納された評価値の標本に関する累積確率分布に基づき、二つの評価行列が同一の母集団に由来かどうかを判定する。これにより、判定装置１は、正規ユーザ及び悪性ユーザの評価値の分布の形状を仮定せずに、追加された評価行列が悪性データであるかどうかを判定できる。
したがって、判定装置１は、正規分布等の特定の分布形状に従わない評価データを扱う協調フィルタリングシステムにおいても、データポイゾニング攻撃の影響を抑えられる。 According to the present embodiment, the determination apparatus 1 uses the Kolmogorov-Smirnov test, which is one of the nonparametric test methods, based on the cumulative probability distribution regarding the samples of the evaluation values stored in the two evaluation matrices. Determine if two evaluation matrices are from the same population. Thereby, the determination device 1 can determine whether the added evaluation matrix is malicious data without assuming the shape of the distribution of the evaluation values of the normal user and the malicious user.
Therefore, the determination device 1 can suppress the influence of the data poisoning attack even in a collaborative filtering system that handles evaluation data that does not follow a specific distribution shape such as a normal distribution.

判定装置１は、評価行列に格納された評価値の集合を標本とし、観測された評価値毎のヒストグラムを表す累積確率分布を算出することにより、評価値の分布の類似性に基づいて、悪性データを適切に判定できる。 The determination device 1 uses a set of evaluation values stored in the evaluation matrix as a sample, and calculates a cumulative probability distribution representing a histogram for each observed evaluation value, to determine whether the distribution of evaluation values is malignant. Data can be judged appropriately.

判定装置１は、二つの評価行列における互いに対応する部分を標本とし、当該標本における評価値の有無のヒストグラムを表す累積確率分布を算出することにより、評価行列における特定の部分（例えば、アイテム群）に着目した標本の類似性に基づいて、悪性データを適切に判定できる。 The determination device 1 uses a portion corresponding to each other in two evaluation matrices as a sample, and calculates a cumulative probability distribution that represents a histogram of the presence or absence of the evaluation value in the sample to determine a specific portion (for example, an item group) in the evaluation matrix. The malignant data can be appropriately determined based on the similarity of the samples focused on.

判定装置１は、評価行列におけるアイテムの単位の部分行列を標本とすることにより、評価データ全体の分布だけでなく、アイテム毎の評価値の分布に対してもコルモゴロフ・スミルノフ検定を適用できる。これにより、判定装置１は、攻撃者が正規ユーザの評価値の分布と同一の又は類似の分布を持つように悪性な評価値を決定していても、例えば、悪性ユーザが特定のアイテムだけを評価するようなデータポイゾニング攻撃を検知し、悪性データを排除できる。
このとき、判定装置１は、母集団が同一でないと判定されたアイテムの個数が所定以上の場合に、二つの評価行列の母集団も同一でなく、悪性ユーザの評価データが追加されたと判定してもよい。これにより、判定装置１は、特定のアイテムのみで判定することによる誤検出を抑制し、悪性データを適切に判定できる。 The determination device 1 can apply the Kolmogorov-Smirnov test not only to the distribution of the entire evaluation data but also to the distribution of the evaluation value for each item by using the submatrix of the item of the evaluation matrix as a sample. As a result, even if the determination device 1 determines a malicious evaluation value such that the attacker has the same or similar distribution as the distribution of the evaluation values of the legitimate users, for example, the malicious user only selects a specific item. Detect malicious data poisoning attacks and eliminate malicious data.
At this time, the determination device 1 determines that the populations of the two evaluation matrices are not the same and the evaluation data of the malignant user is added when the number of items for which the populations are not the same is equal to or larger than a predetermined number. May be. As a result, the determination device 1 can suppress erroneous detection due to determination using only a specific item and can appropriately determine malignant data.

以上、本発明の実施形態について説明したが、本発明は前述した実施形態に限るものではない。また、前述した実施形態に記載された効果は、本発明から生じる最も好適な効果を列挙したに過ぎず、本発明による効果は、実施形態に記載されたものに限定されるものではない。 Although the embodiments of the present invention have been described above, the present invention is not limited to the above-described embodiments. Further, the effects described in the above-described embodiments are merely enumeration of the most suitable effects produced by the present invention, and the effects according to the present invention are not limited to those described in the embodiments.

判定装置１による判定方法は、ソフトウェアにより実現される。ソフトウェアによって実現される場合には、このソフトウェアを構成するプログラムが、情報処理装置（コンピュータ）にインストールされる。また、これらのプログラムは、ＣＤ−ＲＯＭのようなリムーバブルメディアに記録されてユーザに配布されてもよいし、ネットワークを介してユーザのコンピュータにダウンロードされることにより配布されてもよい。さらに、これらのプログラムは、ダウンロードされることなくネットワークを介したＷｅｂサービスとしてユーザのコンピュータに提供されてもよい。 The determination method by the determination device 1 is realized by software. When implemented by software, a program forming the software is installed in an information processing device (computer). In addition, these programs may be recorded on a removable medium such as a CD-ROM and distributed to users, or may be distributed by being downloaded to users' computers via a network. Further, these programs may be provided to the user's computer as a Web service via the network without being downloaded.

１判定装置
１０制御部
１１算出部
１２判定部
２０記憶部 1 determination device 10 control unit 11 calculation unit 12 determination unit 20 storage unit

Claims

In an evaluation matrix in which evaluation values for a plurality of items for each user are stored, a calculation unit that calculates a cumulative probability distribution regarding a sample of the evaluation values,
When the difference between the cumulative probability distributions of two different evaluation matrices and the index based on the number of users in each of the two evaluation matrices satisfy a condition based on a predetermined significance level, the population of the two evaluation matrices is A determination device that determines that they are not the same.

The determination device according to claim 1, wherein the calculation unit uses the set of evaluation values stored in the evaluation matrix as the sample and calculates the cumulative probability distribution representing a histogram of the evaluation values.

The determination device according to claim 1, wherein the calculation unit calculates, as the sample, portions corresponding to each other in the two evaluation matrices, the cumulative probability distribution representing a histogram of the presence or absence of the evaluation value in the sample.

The determination device according to claim 3, wherein the calculation unit uses the partial matrix of the unit of the item in the evaluation matrix as the sample.

The determination device according to claim 4, wherein the determination unit determines that the populations of the two evaluation matrices are not the same when the number of items satisfying the condition in the two evaluation matrices is equal to or larger than a predetermined number.

In an evaluation matrix in which evaluation values for a plurality of items for each user are stored, a calculation step of calculating a cumulative probability distribution regarding a sample of the evaluation values,
When the difference between the cumulative probability distributions of two different evaluation matrices and the index based on the number of users in each of the two evaluation matrices satisfy a condition based on a predetermined significance level, the population of the two evaluation matrices is And a determination step of determining that they are not the same.

In an evaluation matrix in which evaluation values for a plurality of items for each user are stored, a calculation step of calculating a cumulative probability distribution regarding a sample of the evaluation values,
When the difference between the cumulative probability distributions of two different evaluation matrices and the index based on the number of users in each of the two evaluation matrices satisfy a condition based on a predetermined significance level, the population of the two evaluation matrices is A determination program for causing a computer to execute a determination step of determining that they are not the same.