JP7075362B2

JP7075362B2 - Judgment device, judgment method and judgment program

Info

Publication number: JP7075362B2
Application number: JP2019009273A
Authority: JP
Inventors: 清良披田野; 晋作清本
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2019-01-23
Filing date: 2019-01-23
Publication date: 2022-05-25
Anticipated expiration: 2039-01-23
Also published as: JP2020119201A

Description

本発明は、データポイゾニング攻撃を判定する装置に関する。 The present invention relates to a device for determining a data poisoning attack.

推薦システム等で用いられる協調フィルタリングでは、同一システム内のユーザらによる複数のアイテムに対する評価データを用いて、未評価のアイテムに対する評価値を予測する。協調フィルタリングに対する攻撃としては、攻撃者が正規ユーザを装いながら不正にアイテムを評価し、評価データに悪性データを混入させるデータポイゾニング攻撃がある。データポイゾニング攻撃は、予測性能の低下、又は特定アイテムの人気向上若しくは低下を目的とする。 In collaborative filtering used in recommender systems and the like, evaluation values for unrated items are predicted using evaluation data for a plurality of items by users in the same system. As an attack against collaborative filtering, there is a data poisoning attack in which an attacker illegally evaluates an item while pretending to be a legitimate user and mixes malicious data into the evaluation data. Data poisoning attacks aim to reduce predictive performance or increase or decrease the popularity of specific items.

協調フィルタリングに対するデータポイゾニング攻撃への対策としては、例えば非特許文献１で提案されたｔ検定を用いた方法がある。この方法では、評価値の分布が正規分布であることを仮定し、ｔ検定により分布の違いを検出することで、追加で与えられた評価データが悪性データであることを判定する。 As a countermeasure against data poisoning attacks against collaborative filtering, for example, there is a method using the t-test proposed in Non-Patent Document 1. In this method, it is assumed that the distribution of the evaluation values is a normal distribution, and the difference in the distribution is detected by the t-test to determine that the additionally given evaluation data is malignant data.

Ｂ．Ｌｉ，Ｙ．Ｗａｎｇ，Ａ．Ｓｉｎｇｈ，ａｎｄＹ．Ｖｏｒｏｂｅｙｃｈｉｋ：ＤａｔａＰｏｉｓｏｎｉｎｇＡｔｔａｃｋｓｏｎＦａｃｔｏｒｉｚａｔｉｏｎ－ＢａｓｅｄＣｏｌｌａｂｏｒａｔｉｖｅＦｉｌｔｅｒｉｎｇ，Ｐｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅ３ｒｄＮｅｕｒａｌＩｎｆｏｒｍａｔｉｏｎＰｒｏｃｅｓｓｉｎｇＳｙｓｔｅｍｓ（ＮＩＰＳ２０１６），ｐｐ．１－１３，２０１６．B. Li, Y. Wang, A. Singh, and Y. Vorobeychik: Data Poisoning Attacks on Factorization-Based Collaborative Filtering, Proceedings of the 3rd Conference on Neural Information Processing Systems (IPS16). 1-13, 2016.

前述の方法では、正規ユーザ及び悪性ユーザの評価値の分布が正規分布であることを仮定しているが、評価値の分布は必ずしも正規分布に従うとは限らない。このため、追加で与えられた悪性データを正しく検出できない場合が多かった。 In the above method, it is assumed that the distribution of the evaluation values of the normal user and the malignant user is a normal distribution, but the distribution of the evaluation values does not always follow the normal distribution. For this reason, it was often the case that the additional malignant data given could not be detected correctly.

本発明は、評価データの分布によらず、悪性データを判定できる判定装置、判定方法及び判定プログラムを提供することを目的とする。 An object of the present invention is to provide a determination device, a determination method, and a determination program capable of determining malignant data regardless of the distribution of evaluation data.

本発明に係る判定装置は、ユーザ毎の複数のアイテムに対する評価値が格納される評価行列において、前記評価値の標本に関する累積確率分布を算出する算出部と、互いに異なる二つの評価行列それぞれの前記累積確率分布の差、及び前記二つの評価行列それぞれにおけるユーザ数に基づく指標が所定の有意水準に基づく条件を満たす場合に、前記二つの評価行列の母集団が同一でないと判定する判定部と、を備える。 The determination device according to the present invention includes a calculation unit that calculates a cumulative probability distribution for a sample of the evaluation value in an evaluation matrix that stores evaluation values for a plurality of items for each user, and two evaluation matrices that are different from each other. A determination unit that determines that the populations of the two evaluation matrices are not the same when the difference in the cumulative probability distribution and the index based on the number of users in each of the two evaluation matrices satisfy the conditions based on the predetermined significance level. To prepare for.

前記算出部は、前記評価行列に格納された前記評価値の集合を前記標本とし、前記評価値のヒストグラムを表す前記累積確率分布を算出してもよい。 The calculation unit may calculate the cumulative probability distribution representing the histogram of the evaluation values by using the set of the evaluation values stored in the evaluation matrix as the sample.

前記算出部は、前記二つの評価行列における互いに対応する部分を前記標本とし、当該標本における前記評価値の有無のヒストグラムを表す前記累積確率分布を算出してもよい。 The calculation unit may calculate the cumulative probability distribution representing a histogram of the presence or absence of the evaluation value in the sample, with the parts corresponding to each other in the two evaluation matrices as the sample.

前記算出部は、前記評価行列における前記アイテムの単位の部分行列を前記標本としてもよい。 The calculation unit may use the submatrix of the unit of the item in the evaluation matrix as the sample.

前記判定部は、前記二つの評価行列において、前記条件を満たす前記アイテムの個数が所定以上の場合に、前記二つの評価行列の母集団が同一でないと判定してもよい。 In the two evaluation matrices, the determination unit may determine that the populations of the two evaluation matrices are not the same when the number of the items satisfying the above conditions is a predetermined number or more.

本発明に係る判定方法は、ユーザ毎の複数のアイテムに対する評価値が格納される評価行列において、前記評価値の標本に関する累積確率分布を算出する算出ステップと、互いに異なる二つの評価行列それぞれの前記累積確率分布の差、及び前記二つの評価行列それぞれにおけるユーザ数に基づく指標が所定の有意水準に基づく条件を満たす場合に、前記二つの評価行列の母集団が同一でないと判定する判定ステップと、をコンピュータが実行する。 The determination method according to the present invention includes a calculation step for calculating a cumulative probability distribution for a sample of the evaluation values in an evaluation matrix in which evaluation values for a plurality of items for each user are stored, and the above-mentioned two evaluation matrices that are different from each other. A determination step for determining that the populations of the two evaluation matrices are not the same when the difference in the cumulative probability distribution and the index based on the number of users in each of the two evaluation matrices satisfy the conditions based on the predetermined significance level. Is executed by the computer.

本発明に係る判定プログラムは、ユーザ毎の複数のアイテムに対する評価値が格納される評価行列において、前記評価値の標本に関する累積確率分布を算出する算出ステップと、互いに異なる二つの評価行列それぞれの前記累積確率分布の差、及び前記二つの評価行列それぞれにおけるユーザ数に基づく指標が所定の有意水準に基づく条件を満たす場合に、前記二つの評価行列の母集団が同一でないと判定する判定ステップと、をコンピュータに実行させるためのものである。 The determination program according to the present invention includes a calculation step for calculating a cumulative probability distribution for a sample of the evaluation values in an evaluation matrix in which evaluation values for a plurality of items for each user are stored, and the above-mentioned two evaluation matrices that are different from each other. A determination step for determining that the populations of the two evaluation matrices are not the same when the difference in the cumulative probability distribution and the index based on the number of users in each of the two evaluation matrices satisfy the conditions based on a predetermined significance level. Is for letting the computer execute.

本発明によれば、評価データの分布によらず、悪性データを判定できる。 According to the present invention, malignant data can be determined regardless of the distribution of evaluation data.

実施形態に係る判定装置の機能構成を示すブロック図である。It is a block diagram which shows the functional structure of the determination apparatus which concerns on embodiment. 実施形態に係る第１の判定方法の手順を示すフローチャートである。It is a flowchart which shows the procedure of the 1st determination method which concerns on embodiment. 実施形態に係る第２の判定方法の手順を示すフローチャートである。It is a flowchart which shows the procedure of the 2nd determination method which concerns on embodiment.

以下、本発明の実施形態の一例について説明する。
本実施形態に係る悪性データの判定方法では、強調フィルタリングに用いられる正規の評価行列に対して、追加で与えられた評価行列が悪性データであることを、コルモゴロフ・スミルノフ検定を用いて判定する。
コルモゴロフ・スミルノフ検定は、正規分布を仮定するｔ検定と異なり、分布によらないノンパラメトリックな検定手法の一つであり、これにより、評価値に関する分布の形状を仮定せずに、正規ユーザによる評価行列と悪性ユーザによる評価行列とが区別される。 Hereinafter, an example of the embodiment of the present invention will be described.
In the method for determining malignant data according to the present embodiment, it is determined by using the Kolmogorov-Smirnov test that the evaluation matrix additionally given is malignant data with respect to the normal evaluation matrix used for emphasis filtering.
The Kolmogorov-Smirnov test is one of the non-parametric test methods that does not depend on the distribution, unlike the t-test that assumes a normal distribution. A distinction is made between the matrix and the evaluation matrix by malicious users.

ここで、Ｍをｍ人の正規ユーザによるｎ個のアイテムに対する評価行列とする。Ｍ_ｉ，ｊは、評価行列Ｍのｉ番目の行（ユーザ）のｊ番目の列（アイテム）の評価を示す。ただし、Ｍは疎な行列であり、評価が未観測な要素を含む。
協調フィルタリングでは、評価行列Ｍを分析することで、Ｍにおける未観測な要素の値を推定する。 Here, M is an evaluation matrix for n items by m regular users. M _{i and j} indicate the evaluation of the j-th column (item) of the i-th row (user) of the evaluation matrix M. However, M is a sparse matrix and includes elements whose evaluation has not been observed.
In collaborative filtering, the value of an unobserved element in M is estimated by analyzing the evaluation matrix M.

協調フィルタリングに対するデータポイゾニング攻撃では、攻撃者は、正規ユーザになりますし、攻撃目的に応じて不正にアイテムを評価し、学習システムに対して悪性な評価データを追加する。Ｍ’をｍ’人の悪性ユーザによるｎ個のアイテムに対する評価行列とする。
本実施形態の判定装置１は、正規ユーザの評価行列Ｍに対して、評価行列Ｍ’が追加で与えられたときに、この評価行列Ｍ’が正規ユーザによるものか、悪性ユーザにより混入されたものかを判定する。 In a data poisoning attack against collaborative filtering, the attacker becomes a legitimate user, fraudulently evaluates items according to the purpose of the attack, and adds malicious evaluation data to the learning system. Let M'be the evaluation matrix for n items by m'malignant users.
In the determination device 1 of the present embodiment, when the evaluation matrix M'is additionally given to the evaluation matrix M of the regular user, the evaluation matrix M'is by the regular user or is mixed by the malicious user. Determine if it is a thing.

図１は、本実施形態に係る判定装置１の機能構成を示すブロック図である。
判定装置１は、サーバ装置又はパーソナルコンピュータ等の情報処理装置（コンピュータ）であり、制御部１０及び記憶部２０の他、各種データの入出力デバイス及び通信デバイス等を備える。 FIG. 1 is a block diagram showing a functional configuration of the determination device 1 according to the present embodiment.
The determination device 1 is an information processing device (computer) such as a server device or a personal computer, and includes a control unit 10 and a storage unit 20, as well as various data input / output devices and communication devices.

制御部１０は、判定装置１の全体を制御する部分であり、記憶部２０に記憶された各種プログラムを適宜読み出して実行することにより、本実施形態における各機能を実現する。制御部１０は、ＣＰＵであってよい。 The control unit 10 is a part that controls the entire determination device 1, and realizes each function in the present embodiment by appropriately reading and executing various programs stored in the storage unit 20. The control unit 10 may be a CPU.

記憶部２０は、ハードウェア群を判定装置１として機能させるための各種プログラム、及び各種データ等の記憶領域であり、ＲＯＭ、ＲＡＭ、フラッシュメモリ又はハードディスク（ＨＤＤ）等であってよい。具体的には、記憶部２０は、本実施形態の各機能を制御部１０に実行させるためのプログラム（判定プログラム）、及び処理途中の各種データ等を記憶する。 The storage unit 20 is a storage area for various programs and various data for making the hardware group function as the determination device 1, and may be a ROM, RAM, flash memory, hard disk (HDD), or the like. Specifically, the storage unit 20 stores a program (determination program) for causing the control unit 10 to execute each function of the present embodiment, various data in the process of processing, and the like.

制御部１０は、算出部１１と、判定部１２とを備える。制御部１０は、これらの機能部を動作させることにより、新たに追加された評価行列Ｍ’が既存の評価行列Ｍと同一の母集団に由来するものか否か、すなわち、正規の評価データか悪性の評価データかを判定する。 The control unit 10 includes a calculation unit 11 and a determination unit 12. By operating these functional units, the control unit 10 determines whether the newly added evaluation matrix M'is derived from the same population as the existing evaluation matrix M, that is, whether it is normal evaluation data. Determine if it is malignant evaluation data.

算出部１１は、ユーザ毎の複数のアイテムに対する評価値が格納される評価行列Ｍ及びＭ’において、評価値の標本に関する累積確率分布を算出する。
算出部１１は、例えば、評価行列Ｍ及びＭ’に格納された観測された評価値の集合を標本とし、評価値のヒストグラムを表す累積確率分布を算出する。
また、算出部１１は、例えば、二つの評価行列Ｍ及びＭ’における互いに対応する部分を標本とし、これらの標本における評価値の有無のヒストグラムを表す累積確率分布を算出する。具体的には、算出部１１は、評価行列Ｍ及びＭ’におけるアイテムの単位の部分行列を標本としてよい。 The calculation unit 11 calculates a cumulative probability distribution for a sample of evaluation values in the evaluation matrices M and M'where evaluation values for a plurality of items for each user are stored.
The calculation unit 11 uses, for example, a set of observed evaluation values stored in the evaluation matrices M and M'as a sample, and calculates a cumulative probability distribution representing a histogram of the evaluation values.
Further, the calculation unit 11 uses, for example, the parts corresponding to each other in the two evaluation matrices M and M'as samples, and calculates the cumulative probability distribution representing the histogram of the presence / absence of the evaluation values in these samples. Specifically, the calculation unit 11 may sample the submatrix of the unit of the item in the evaluation matrix M and M'.

判定部１２は、互いに異なる二つの評価行列Ｍ及びＭ’それぞれの累積確率分布の差、及び評価行列Ｍ及びＭ’それぞれにおけるユーザ数ｍ及びｍ’に基づく指標が所定の有意水準に基づく条件を満たす場合に、二つの評価行列Ｍ及びＭ’の母集団が同一でないと判定する。
また、判定部１２は、二つの評価行列Ｍ及びＭ’において、条件を満たす部分集合の数（アイテムの個数）が所定以上の場合に、二つの評価行列Ｍ及びＭ’の母集団が同一でないと判定してもよい。 The determination unit 12 determines the condition that the difference between the two evaluation matrices M and M'that are different from each other and the cumulative probability distribution of each, and the index based on the number of users m and m'in each of the evaluation matrices M and M'is based on a predetermined significance level. If so, it is determined that the populations of the two evaluation matrices M and M'are not the same.
Further, in the two evaluation matrices M and M', the determination unit 12 does not have the same population of the two evaluation matrices M and M'when the number of subsets (number of items) satisfying the conditions is equal to or more than a predetermined value. May be determined.

次に、判定装置１による悪性データの判定方法の手順を詳述する。
判定装置１は、例えば、以下に示す第１の判定方法又は第２の判定方法を採用する。また、判定装置１は、第１の判定方法及び第２の判定方法を共に実行し、いずれの方法においても母集団が同一でないと判定される場合に、評価行列Ｍ’が悪性な評価データであると判定してもよい。 Next, the procedure of the method for determining malicious data by the determination device 1 will be described in detail.
The determination device 1 employs, for example, the first determination method or the second determination method shown below. Further, the determination device 1 executes both the first determination method and the second determination method, and when it is determined that the populations are not the same in either method, the evaluation matrix M'is malignant evaluation data. It may be determined that there is.

［第１の判定方法］
第１の判定方法では、判定装置１は、コルモゴロフ・スミルノフ検定を用いて、評価行列Ｍ及びＭ’のそれぞれに含まれる観測された評価値の累積確率分布の差に基づいて、Ｍ’が悪性ユーザの評価行列であるかどうかを判定する。 [First determination method]
In the first determination method, the determination device 1 uses the Kolmogorov-Smirnov test, and M'is malignant based on the difference in the cumulative probability distribution of the observed evaluation values included in each of the evaluation matrices M and M'. Determine if it is a user evaluation matrix.

図２は、本実施形態に係る第１の判定方法の手順を示すフローチャートである。
なお、コルモゴロフ・スミルノフ検定を用いるにあたって、Ｘを正規ユーザの評価値の標本とし、Ｘ’を悪性ユーザの評価値の標本とする。また、帰無仮説を「ＸとＸ’の母集団が同一である」とする。 FIG. 2 is a flowchart showing the procedure of the first determination method according to the present embodiment.
In using the Kolmogorov-Smirnov test, X is a sample of the evaluation value of a regular user, and X'is a sample of the evaluation value of a malignant user. Further, the null hypothesis is that "the population of X and X'is the same".

ステップＳ１において、算出部１１は、正規ユーザの評価行列Ｍを用いて標本Ｘの累積確率分布Ｐ（ｘ）を算出する。また、算出部１１は、悪性ユーザの評価行列Ｍ’を用いて標本Ｘ’の累積確率分布Ｑ（ｘ）を算出する。
ここで、ｘは、評価行列Ｍ及びＭ’に含まれる観測された評価値である。例えば、評価行列Ｍ及びＭ’の要素として、各アイテムに対する実際の評価値１～５と、無評価を示す０とが混在する場合、算出部１１は、無評価の０を除く評価値ｘ（＝１～５）のヒストグラムから、累積確率分布を算出する。 In step S1, the calculation unit 11 calculates the cumulative probability distribution P (x) of the sample X using the evaluation matrix M of the normal user. Further, the calculation unit 11 calculates the cumulative probability distribution Q (x) of the sample X'using the evaluation matrix M'of the malignant user.
Here, x is an observed evaluation value included in the evaluation matrices M and M'. For example, when the actual evaluation values 1 to 5 for each item and 0 indicating no evaluation are mixed as the elements of the evaluation matrices M and M', the calculation unit 11 determines the evaluation value x (excluding 0 of no evaluation). The cumulative probability distribution is calculated from the histograms of = 1 to 5).

ステップＳ２において、判定部１２は、累積確率分布Ｐ（ｘ）及びＱ（ｘ）に基づいて、以下の式によりコルモゴロフ・スミルノフ統計量Ｄを計算する。
Ｄ＝ｍａｘ_ｘ｜Ｐ（ｘ）－Ｑ（ｘ）｜ In step S2, the determination unit 12 calculates the Kolmogorov-Smirnov statistic D by the following equation based on the cumulative probability distributions P (x) and Q (x).
D = max _x | P (x) -Q (x) |

ステップＳ３において、判定部１２は、正規ユーザ数ｍ、悪性ユーザ数ｍ’、及び統計量Ｄから、悪性データを判定するための指標としてＤ［ｍｍ’／（ｍ＋ｍ’）］^１／２を計算する。 In step S3, the determination unit 12 calculates D [mm'/ (m + m')] ^1/2 as an index for determining malignant data from the number of regular users m, the number of malignant users m', and the statistic D. do.

ステップＳ４において、判定部１２は、有意水準αに対して、Ｄ［ｍｍ’／（ｍ＋ｍ’）］^１／２の値がＫ_αより大きいか否かを判定する。ただし、Ｋ_αは、Ｐｒ［Ｄ［ｍｍ’／（ｍ＋ｍ’）］^１／２≦Ｋ_α］＝１－αを満たす数とする。この判定がＹＥＳの場合、処理はステップＳ５に移り、判定がＮＯの場合、処理はステップＳ６に移る。 In step S4, the determination unit 12 determines whether or not the value of D [mm'/ (m + m')] ^1/2 is larger than K _α with respect to the significance level α. However, K _α is a number that satisfies Pr [D [mm'/ (m + m')] ^1/2 ≤ K _α ] = 1-α. If this determination is YES, the process proceeds to step S5, and if the determination is NO, the process proceeds to step S6.

ステップＳ５において、判定部１２は、帰無仮説を棄却し、ＸとＸ’の母集団が同一でない、すなわち、評価行列Ｍ’が悪性ユーザにより混入された悪性データであると判定する。 In step S5, the determination unit 12 rejects the null hypothesis and determines that the populations of X and X'are not the same, that is, the evaluation matrix M'is malignant data contaminated by a malignant user.

ステップＳ６において、判定部１２は、ＸとＸ’の母集団が同一であり、評価行列Ｍ’が正規ユーザにより与えられた正規データであると判定する。 In step S6, the determination unit 12 determines that the population of X and X'is the same, and the evaluation matrix M'is regular data given by the regular user.

［第２の判定方法］
第２の判定方法では、判定装置１は、コルモゴロフ・スミルノフ検定を用いて、評価行列Ｍ及びＭ’のそれぞれにおける各アイテムに対する評価の付け方の違いに基づいて、Ｍ’が悪性ユーザの評価行列であるかどうかを判定する。 [Second judgment method]
In the second determination method, the determination device 1 uses the Kolmogorov-Smirnov test, and M'is the evaluation matrix of the malignant user based on the difference in the evaluation method for each item in each of the evaluation matrices M and M'. Determine if it exists.

図３は、本実施形態に係る第２の判定方法の手順を示すフローチャートである。
なお、コルモゴロフ・スミルノフ検定を用いるにあたって、Ｘ_ｊを正規ユーザによるｊ番目のアイテムに対する評価値の標本とし、Ｘ_ｊ’を悪性ユーザによるｊ番目のアイテムに対する評価値の標本とする。ただし、評価値は、０又は１の２値、すなわち、ユーザがｊ番目のアイテムを評価していれば評価の高低に関わらず１、評価していなければ０とする。また、帰無仮説を「Ｘ_ｊとＸ_ｊ’の母集団が同一である」とする。 FIG. 3 is a flowchart showing the procedure of the second determination method according to the present embodiment.
In using the Kolmogorov-Smirnov test, X _j is a sample of the evaluation value for the j-th item by a regular user, and X j'is a sample of the evaluation value for the _j -th item by a malignant user. However, the evaluation value is a binary value of 0 or 1, that is, 1 regardless of whether the evaluation is high or low if the user evaluates the jth item, and 0 if the user does not evaluate the item. Further, the null hypothesis is that "the population of X _j and X _j'is the same".

ステップＳ１１において、制御部１０は、ｎ個のアイテムのインデックスｊ、及び悪性データの判定用アイテム数ｋを、０に初期化する。 In step S11, the control unit 10 initializes the index j of n items and the number of items k for determining malicious data to 0.

ステップＳ１２において、判定部１２は、インデックスｊをカウントアップ（ｊ＝ｊ＋１）する。 In step S12, the determination unit 12 counts up the index j (j = j + 1).

ステップＳ１３において、算出部１１は、正規ユーザの評価行列Ｍを用いて標本Ｘ_ｊの累積確率分布Ｐ（ｘ）を求める。また、算出部１１は、悪性ユーザの評価行列Ｍ’を用いて標本Ｘ_ｊ’の累積確率分布Ｑ（ｘ）を求める。
ここで、ｘは、前述の０又は１の評価値である。例えば、評価行列Ｍ及びＭ’の要素として、各アイテムに対する実際の評価値１～５と、無評価を示す０とが混在する場合、算出部１１は、評価なし（ｘ＝０）と評価あり（ｘ＝１）のヒストグラムから、累積確率分布を求める。 In step S13, the calculation unit 11 obtains the cumulative probability distribution P (x) of the sample X _j using the evaluation matrix M of the normal user. Further, the calculation unit 11 obtains the cumulative probability distribution Q (x) of the sample X _j'using the evaluation matrix M'of the malignant user.
Here, x is the above-mentioned evaluation value of 0 or 1. For example, when the actual evaluation values 1 to 5 for each item and 0 indicating no evaluation are mixed as elements of the evaluation matrices M and M', the calculation unit 11 evaluates as no evaluation (x = 0). The cumulative probability distribution is obtained from the histogram of (x = 1).

ステップＳ１４において、判定部１２は、累積確率分布Ｐ（ｘ）及びＱ（ｘ）に基づいて、以下の式によりコルモゴロフ・スミルノフ統計量Ｄを計算する。
Ｄ＝ｍａｘ_ｘ｜Ｐ（ｘ）－Ｑ（ｘ）｜ In step S14, the determination unit 12 calculates the Kolmogorov-Smirnov statistic D by the following equation based on the cumulative probability distributions P (x) and Q (x).
D = max _x | P (x) -Q (x) |

ステップＳ１５において、判定部１２は、正規ユーザ数ｍ、悪性ユーザ数ｍ’、及び統計量Ｄから、悪性データを判定するための指標としてＤ［ｍｍ’／（ｍ＋ｍ’）］^１／２を計算する。 In step S15, the determination unit 12 calculates D [mm'/ (m + m')] ^1/2 as an index for determining malignant data from the number of regular users m, the number of malignant users m', and the statistic D. do.

ステップＳ１６において、判定部１２は、有意水準αに対して、Ｄ［ｍｍ’／（ｍ＋ｍ’）］^１／２の値がＫ_αより大きいか否かを判定する。ただし、Ｋ_αは、Ｐｒ［Ｄ［ｍｍ’／（ｍ＋ｍ’）］^１／２≦Ｋ_α］＝１－αを満たす数とする。この判定がＹＥＳの場合、処理はステップＳ１７に移り、判定がＮＯの場合、処理はステップＳ１８に移る。 In step S16, the determination unit 12 determines whether or not the value of D [mm'/ (m + m')] ^1/2 is larger than K _α with respect to the significance level α. However, K _α is a number that satisfies Pr [D [mm'/ (m + m')] ^1/2 ≤ K _α ] = 1-α. If this determination is YES, the process proceeds to step S17, and if the determination is NO, the process proceeds to step S18.

ステップＳ１７において、判定部１２は、帰無仮説を棄却し、Ｘ_ｊとＸ_ｊ’の母集団が同一でないと判定し、判定用アイテム数ｋをカウントアップ（ｋ＝ｋ＋１）する。 In step S17, the determination unit 12 rejects the null hypothesis, determines that the populations of X _j and X _j'are not the same, and counts up the number of determination items k (k = k + 1).

ステップＳ１８において、判定部１２は、判定用アイテム数ｋが閾値ｔ以上か否かを判定する。この判定がＹＥＳの場合、処理はステップＳ１９に移り、判定がＮＯの場合、処理はステップＳ２０に移る。 In step S18, the determination unit 12 determines whether or not the number of determination items k is equal to or greater than the threshold value t. If this determination is YES, the process proceeds to step S19, and if the determination is NO, the process proceeds to step S20.

ステップＳ１９において、判定部１２は、評価行列Ｍ’が悪性ユーザにより混入された悪性データであると判定し、処理を終了する。 In step S19, the determination unit 12 determines that the evaluation matrix M'is malignant data mixed by a malignant user, and ends the process.

ステップＳ２０において、判定部１２は、インデックスｊがアイテム数ｎと等しい、すなわち、全てのアイテムに関して検定を行ったか否かを判定する。この判定がＹＥＳの場合、処理はステップＳ２１に移り、判定がＮＯの場合、処理はステップＳ１２に戻る。 In step S20, the determination unit 12 determines whether the index j is equal to the number of items n, that is, whether or not the test has been performed for all the items. If this determination is YES, the process proceeds to step S21, and if the determination is NO, the process returns to step S12.

ステップＳ２１において、判定部１２は、評価行列Ｍ’が正規ユーザにより与えられた正規データであると判定する。 In step S21, the determination unit 12 determines that the evaluation matrix M'is regular data given by a regular user.

本実施形態によれば、判定装置１は、ノンパラメトリックな検定手法の一つであるコルモゴロフ・スミルノフ検定を用いて、二つの評価行列に格納された評価値の標本に関する累積確率分布に基づき、二つの評価行列が同一の母集団に由来かどうかを判定する。これにより、判定装置１は、正規ユーザ及び悪性ユーザの評価値の分布の形状を仮定せずに、追加された評価行列が悪性データであるかどうかを判定できる。
したがって、判定装置１は、正規分布等の特定の分布形状に従わない評価データを扱う協調フィルタリングシステムにおいても、データポイゾニング攻撃の影響を抑えられる。 According to the present embodiment, the determination device 1 uses the Kolmogorov-Smirnov test, which is one of the nonparametric test methods, and is based on the cumulative probability distribution of the sample of the evaluation values stored in the two evaluation matrices. Determine if two evaluation matrices are from the same population. Thereby, the determination device 1 can determine whether or not the added evaluation matrix is malignant data without assuming the shape of the distribution of the evaluation values of the regular user and the malignant user.
Therefore, the determination device 1 can suppress the influence of the data poisoning attack even in the collaborative filtering system that handles evaluation data that does not follow a specific distribution shape such as a normal distribution.

判定装置１は、評価行列に格納された評価値の集合を標本とし、観測された評価値毎のヒストグラムを表す累積確率分布を算出することにより、評価値の分布の類似性に基づいて、悪性データを適切に判定できる。 The determination device 1 uses a set of evaluation values stored in the evaluation matrix as a sample, calculates a cumulative probability distribution representing a histogram for each observed evaluation value, and is malignant based on the similarity of the distribution of the evaluation values. Data can be judged appropriately.

判定装置１は、二つの評価行列における互いに対応する部分を標本とし、当該標本における評価値の有無のヒストグラムを表す累積確率分布を算出することにより、評価行列における特定の部分（例えば、アイテム群）に着目した標本の類似性に基づいて、悪性データを適切に判定できる。 The determination device 1 uses a sample corresponding to each other in two evaluation matrices, and calculates a cumulative probability distribution representing a histogram of the presence or absence of evaluation values in the sample to obtain a specific part (for example, a group of items) in the evaluation matrix. Malignant data can be appropriately determined based on the similarity of the specimens focusing on.

判定装置１は、評価行列におけるアイテムの単位の部分行列を標本とすることにより、評価データ全体の分布だけでなく、アイテム毎の評価値の分布に対してもコルモゴロフ・スミルノフ検定を適用できる。これにより、判定装置１は、攻撃者が正規ユーザの評価値の分布と同一の又は類似の分布を持つように悪性な評価値を決定していても、例えば、悪性ユーザが特定のアイテムだけを評価するようなデータポイゾニング攻撃を検知し、悪性データを排除できる。
このとき、判定装置１は、母集団が同一でないと判定されたアイテムの個数が所定以上の場合に、二つの評価行列の母集団も同一でなく、悪性ユーザの評価データが追加されたと判定してもよい。これにより、判定装置１は、特定のアイテムのみで判定することによる誤検出を抑制し、悪性データを適切に判定できる。 The determination device 1 can apply the Kolmogorov-Smirnov test not only to the distribution of the entire evaluation data but also to the distribution of the evaluation values for each item by using the submatrix of the unit of the item in the evaluation matrix as a sample. As a result, even if the determination device 1 determines the malicious evaluation value so that the attacker has the same or similar distribution as the distribution of the evaluation value of the legitimate user, for example, the malicious user selects only a specific item. It can detect data poisoning attacks that are evaluated and eliminate malicious data.
At this time, the determination device 1 determines that the populations of the two evaluation matrices are not the same and the evaluation data of the malicious user is added when the number of items determined to be not the same population is equal to or more than a predetermined number. You may. As a result, the determination device 1 can suppress erroneous detection by determining only a specific item, and can appropriately determine malignant data.

以上、本発明の実施形態について説明したが、本発明は前述した実施形態に限るものではない。また、前述した実施形態に記載された効果は、本発明から生じる最も好適な効果を列挙したに過ぎず、本発明による効果は、実施形態に記載されたものに限定されるものではない。 Although the embodiments of the present invention have been described above, the present invention is not limited to the above-described embodiments. Moreover, the effects described in the above-described embodiments are merely a list of the most suitable effects resulting from the present invention, and the effects according to the present invention are not limited to those described in the embodiments.

判定装置１による判定方法は、ソフトウェアにより実現される。ソフトウェアによって実現される場合には、このソフトウェアを構成するプログラムが、情報処理装置（コンピュータ）にインストールされる。また、これらのプログラムは、ＣＤ－ＲＯＭのようなリムーバブルメディアに記録されてユーザに配布されてもよいし、ネットワークを介してユーザのコンピュータにダウンロードされることにより配布されてもよい。さらに、これらのプログラムは、ダウンロードされることなくネットワークを介したＷｅｂサービスとしてユーザのコンピュータに提供されてもよい。 The determination method by the determination device 1 is realized by software. When realized by software, the programs that make up this software are installed in the information processing device (computer). Further, these programs may be recorded on a removable medium such as a CD-ROM and distributed to the user, or may be distributed by being downloaded to the user's computer via a network. Further, these programs may be provided to the user's computer as a Web service via a network without being downloaded.

１判定装置
１０制御部
１１算出部
１２判定部
２０記憶部 1 Judgment device 10 Control unit 11 Calculation unit 12 Judgment unit 20 Storage unit

Claims

In an evaluation matrix in which evaluation values for a plurality of items for each user are stored, a calculation unit for calculating a cumulative probability distribution for a sample of the evaluation values, and a calculation unit.
If the difference in the cumulative probability distribution of each of the two different evaluation matrices and the index based on the number of users in each of the two evaluation matrices satisfy the condition based on a predetermined significance level, the population of the two evaluation matrices becomes It is equipped with a determination unit that determines that they are not the same.
The calculation unit uses a submatrix corresponding to one or more specific item groups in the evaluation matrix as the sample, and determines that the cumulative probability distribution is calculated from a histogram using the presence or absence of the evaluation value in the sample as a variable. Device.

The determination device according to claim 1 , wherein the calculation unit uses a submatrix of units of the item in the evaluation matrix as the sample.

The determination device according to claim 2 , wherein the determination unit determines that the populations of the two evaluation matrices are not the same when the number of the items satisfying the above conditions is a predetermined number or more in the two evaluation matrices.

In an evaluation matrix in which evaluation values for a plurality of items for each user are stored, a calculation step for calculating a cumulative probability distribution for a sample of the evaluation values, and a calculation step.
If the difference between the cumulative probability distributions of two different evaluation matrices and the index based on the number of users in each of the two evaluation matrices satisfy the conditions based on a predetermined significance level, the population of the two evaluation matrices becomes The computer executes the determination step of determining that they are not the same,
In the calculation step, a determination method in which a submatrix corresponding to one or more specific item groups in the evaluation matrix is used as the sample, and the cumulative probability distribution is calculated from a histogram using the presence or absence of the evaluation value in the sample as a variable. ..

In an evaluation matrix in which evaluation values for a plurality of items for each user are stored, a calculation step for calculating a cumulative probability distribution for a sample of the evaluation values, and a calculation step.
If the difference between the cumulative probability distributions of two different evaluation matrices and the index based on the number of users in each of the two evaluation matrices satisfy the conditions based on a predetermined significance level, the population of the two evaluation matrices becomes Have the computer execute the determination step to determine that they are not the same ,
In the calculation step, the partial matrix corresponding to one or more specific item groups in the evaluation matrix is used as the sample, and the cumulative probability distribution is calculated from the histogram with the presence or absence of the evaluation value in the sample as a variable. Judgment program.