JP7096222B2

JP7096222B2 - Risk assessment device, risk assessment method and risk assessment program

Info

Publication number: JP7096222B2
Application number: JP2019178329A
Authority: JP
Inventors: 知明三本; 晋作清本
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2019-09-30
Filing date: 2019-09-30
Publication date: 2022-07-05
Anticipated expiration: 2039-09-30
Also published as: JP2021056698A

Description

本発明は、文書データを公開する際のリスクを評価する方法に関する。 The present invention relates to a method for assessing the risk of publishing document data.

従来、データセットの匿名化技術として、ｋ－匿名化等の様々な手法が提案されている。ところが、これらの手法は、一般の文書データを対象とするものではなかった。文書データの匿名化に関しては、非特許文献１及び２のように、文書中の単語の出現回数等から情報量を算出することで、リスクを評価する手法が提案されている。 Conventionally, various methods such as k-anonymization have been proposed as data set anonymization techniques. However, these methods did not target general document data. Regarding the anonymization of document data, as in Non-Patent Documents 1 and 2, a method of evaluating a risk by calculating the amount of information from the number of appearances of words in a document has been proposed.

ＤａｖｉｄＳｎｃｈｅｚ，ａｎｄＭｏｎｔｓｅｒｒａｔＢａｔｅｔ， “Ｃ－ｓａｎｉｔｉｚｅｄ：Ａｐｒｉｖａｃｙｍｏｄｅｌｆｏｒｄｏｃｕｍｅｎｔｒｅｄａｃｔｉｏｎａｎｄｓａｎｉｔｉｚａｔｉｏｎ”，ＪｏｕｒｎａｌｏｆｔｈｅＡｓｓｏｃｉａｔｉｏｎｆｏｒＩｎｆｏｒｍａｔｉｏｎＳｃｉｅｎｃｅａｎｄＴｅｃｈｎｏｌｏｇｙ，１４８－１６３，２０１６，ＷｉｌｅｙＯｎｌｉｎｅＬｉｂｒａｒｙ．David Snchez, and Montserrat Batet, "C-sanitized: A privacy model for document redaction and sanitization", Journal of the Engineering6, Engineering6 ＶｅｎｋａｔｅｓａｎＴ．Ｃｈａｋａｒａｖａｒｔｈｙ，ＨｉｍａｎｓｈｕＧｕｐｔａ，ＰｒａｓａｎＲｏｙ，ａｎｄＭｕｋｅｓｈＫ．Ｍｏｈａｎｉａ， “ＥｆｆｉｃｉｅｎｔＴｅｃｈｎｉｑｕｅｓｆｏｒＤｏｃｕｍｅｎｔＳａｎｉｔｉｚａｔｉｏｎ”，Ｐｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅ１７ｔｈＡＣＭｃｏｎｅｒｅｎｃｅｏｎＩｎｆｏｒｍａｔｉｏｎａｎｄｋｎｏｗｌｅｄｇｅｍａｎａｇｅｍｅｎｔ，８４３－８５２，２００８．Venkatesan T. et al. Chakaravarty, Himanshu Gupta, Prasan Roy, and Mukesh K.K. Mohania, “Efficient Techniques for Document Sanitation”, Proceedings of the 17th ACM connerence on Information and knowledge management, 843-852.

従来の手法では、文書データに含まれる個人の病名、又は政治・宗教に関する思想等、センシティブな情報が秘匿されているかという観点でリスク評価が実施されている。
しかしながら、例えば学校の事故レポートのように、文書データ自体に個人と紐づけられたセンシティブな情報が含まれていない場合であっても、この文書データに関連する個人情報及び追加情報が入手され紐付けされる可能性があった。したがって、文書データのみから十分にリスクを評価することは難しかった。 In the conventional method, a risk assessment is carried out from the viewpoint of whether sensitive information such as an individual's disease name contained in document data or an idea about politics / religion is concealed.
However, even if the document data itself does not contain sensitive information associated with the individual, such as a school accident report, personal information and additional information related to this document data will be obtained and linked. Could be attached. Therefore, it was difficult to sufficiently evaluate the risk from only the document data.

本発明は、文書データを公開する際のリスクを適切に評価できるリスク評価装置、リスク評価方法及びリスク評価プログラムを提供することを目的とする。 An object of the present invention is to provide a risk assessment device, a risk assessment method, and a risk assessment program that can appropriately evaluate the risk when publishing document data.

本発明に係るリスク評価装置は、文書データに含まれる単語それぞれの情報量を算出する情報量算出部と、前記情報量の高い所定数の単語のうち、所定数の単語の組み合わせによりＷｅｂ検索を行い、検索結果の上位から所定数を取得する検索部と、前記検索部により取得された検索結果に対して、前記所定数の単語に含まれ、かつ、前記組み合わせに含まれない単語群との一致度合いに基づいて、前記文書データとの紐づけによる文書リスクを評価する評価部と、を備える。 The risk evaluation device according to the present invention performs a Web search by a combination of an information amount calculation unit that calculates the amount of information of each word included in document data and a predetermined number of words among a predetermined number of words having a high amount of information. A search unit that performs and acquires a predetermined number from the top of the search results, and a group of words that are included in the predetermined number of words and are not included in the combination with respect to the search results acquired by the search unit. It is provided with an evaluation unit that evaluates the document risk by associating with the document data based on the degree of matching.

前記検索部は、指定された最大数までの単語の組み合わせによりＷｅｂ検索を行ってもよい。 The search unit may perform a Web search by combining up to a specified maximum number of words.

前記検索部は、前記組み合わせのパターンを変えて、指定された回数のＷｅｂ検索を行い、それぞれの検索結果の上位を取得してもよい。 The search unit may change the pattern of the combination, perform a Web search a specified number of times, and acquire the higher rank of each search result.

前記評価部は、前記検索部により取得された検索結果のうち、前記一致度合いが閾値を超える割合に基づいて評価してもよい。 The evaluation unit may evaluate based on the ratio of the search results acquired by the search unit in which the degree of matching exceeds the threshold value.

前記リスク評価装置は、前記検索部により取得された検索結果から、所定の条件を満たす固有名詞を抽出する抽出部を備え、前記評価部は、前記固有名詞が抽出された場合に、前記文書リスクを高く調整してもよい。 The risk assessment device includes an extraction unit that extracts a proper noun satisfying a predetermined condition from the search results acquired by the search unit, and the evaluation unit extracts the document risk when the proper noun is extracted. May be adjusted high.

前記リスク評価装置は、前記文書データの話題性を示す指標を取得する指標取得部を備え、前記評価部は、前記指標に応じて、前記文書リスクの評価を調整してもよい。 The risk assessment device includes an index acquisition unit that acquires an index indicating the topicality of the document data, and the evaluation unit may adjust the evaluation of the document risk according to the index.

前記指標取得部は、前記文書データの内容を、機械学習により所定の区分のいずれかに分類し、当該区分に対応付けられた指標を取得してもよい。 The index acquisition unit may classify the content of the document data into any of a predetermined category by machine learning and acquire an index associated with the category.

前記情報量算出部は、前記一致度合いが閾値を超えた検索結果の文書データに含まれる単語それぞれの情報量を算出し、前記検索部は、前記検索結果の文書データに所定以上の情報量の単語が含まれる場合、当該単語を含む組み合わせにより再度Ｗｅｂ検索を行ってもよい。 The information amount calculation unit calculates the amount of information of each word included in the document data of the search result in which the degree of matching exceeds the threshold, and the search unit calculates the amount of information in the document data of the search result to be more than a predetermined amount. If a word is included, the Web search may be performed again by the combination including the word.

前記評価部は、さらに、前記組み合わせ毎に前記文書リスクを評価し、当該文書リスクを統合することにより、単語毎の個別リスクを評価してもよい。 The evaluation unit may further evaluate the document risk for each combination and evaluate the individual risk for each word by integrating the document risk.

前記評価部は、前記個別リスクが所定以上の単語を所定のルールに従って汎化した場合の文書データに対して前記文書リスクを再評価し、汎化による当該文書リスクの変化量を提示してもよい。 Even if the evaluation unit re-evaluates the document risk for the document data when the individual risk is generalized according to a predetermined rule and presents the amount of change in the document risk due to the generalization. good.

本発明に係るリスク評価方法は、文書データに含まれる単語それぞれの情報量を算出する情報量算出ステップと、前記情報量の高い所定数の単語のうち、所定数の単語の組み合わせによりＷｅｂ検索を行い、検索結果の上位から所定数を取得する検索ステップと、前記検索ステップにおいて取得された検索結果に対して、前記所定数の単語に含まれ、かつ、前記組み合わせに含まれない単語群との一致度合いに基づいて、前記文書データとの紐づけによる文書リスクを評価する評価ステップと、をコンピュータが実行する。 The risk evaluation method according to the present invention performs a Web search by combining a predetermined number of words among a predetermined number of words having a high amount of information and an information amount calculation step for calculating the amount of information of each word included in the document data. A search step for acquiring a predetermined number from the top of the search results, and a group of words included in the predetermined number of words and not included in the combination with respect to the search results acquired in the search step. Based on the degree of matching, the computer executes an evaluation step of evaluating the document risk by associating with the document data.

本発明に係るリスク評価プログラムは、前記リスク評価装置としてコンピュータを機能させるためのものである。 The risk assessment program according to the present invention is for operating a computer as the risk assessment device.

本発明によれば、文書データを公開する際のリスクを適切に評価できる。 According to the present invention, the risk of publishing document data can be appropriately evaluated.

本実施形態において想定される攻撃を例示する図である。It is a figure which illustrates the attack assumed in this embodiment. 本実施形態におけるリスク評価装置の機能構成を示す図である。It is a figure which shows the functional structure of the risk assessment apparatus in this embodiment. 本実施形態におけるリスク評価方法を示すフローチャートである。It is a flowchart which shows the risk assessment method in this embodiment.

以下、本発明の実施形態の一例について説明する。
本実施形態におけるリスク評価方法では、攻撃者が一般的な検索能力を保有することを想定し、Ｗｅｂ検索による攻撃に対する文書データのリスクが評価される。 Hereinafter, an example of the embodiment of the present invention will be described.
In the risk assessment method in the present embodiment, it is assumed that the attacker possesses general search ability, and the risk of document data against an attack by Web search is evaluated.

図１は、本実施形態において想定される攻撃を例示する図である。
攻撃者は、文書データからキーワードを抽出し、これらのキーワードを用いてＷｅｂ検索を行う。その後、攻撃者は、検索結果から文書データに関する情報、特に個人を特定し、特定した情報（例えば、「△△君」）と文書データに含まれるセンシティブな情報（例えば、「給付金１５００万円」）との紐付けを試みる。 FIG. 1 is a diagram illustrating an attack assumed in the present embodiment.
The attacker extracts keywords from the document data and performs a Web search using these keywords. After that, the attacker identifies the information about the document data from the search results, especially the individual, and the identified information (for example, "△△ -kun") and the sensitive information contained in the document data (for example, "benefit 15 million yen"). ”) And try to link it.

本実施形態のリスク評価方法を実施する装置（コンピュータ）は、このようなＷｅｂ検索をシミュレーションすることで、文書データに関連する情報が攻撃者に発見されるリスクを定量的に評価する。 The device (computer) that implements the risk assessment method of the present embodiment quantitatively evaluates the risk that information related to document data will be discovered by an attacker by simulating such a Web search.

図２は、本実施形態におけるリスク評価装置１の機能構成を示す図である。
リスク評価装置１は、サーバ又はパーソナルコンピュータ等の情報処理装置（コンピュータ）であり、制御部１０及び記憶部２０の他、各種データの入出力デバイス及び通信デバイス等を備える。 FIG. 2 is a diagram showing a functional configuration of the risk assessment device 1 in the present embodiment.
The risk assessment device 1 is an information processing device (computer) such as a server or a personal computer, and includes a control unit 10 and a storage unit 20, as well as various data input / output devices and communication devices.

制御部１０は、リスク評価装置１の全体を制御する部分であり、記憶部２０に記憶された各種プログラムを適宜読み出して実行することにより、本実施形態における各機能を実現する。制御部１０は、ＣＰＵであってよい。 The control unit 10 is a part that controls the entire risk assessment device 1, and realizes each function in the present embodiment by appropriately reading and executing various programs stored in the storage unit 20. The control unit 10 may be a CPU.

記憶部２０は、ハードウェア群をリスク評価装置１として機能させるための各種プログラム、及び各種データ等の記憶領域であり、ＲＯＭ、ＲＡＭ、フラッシュメモリ又はハードディスク（ＨＤＤ）等であってよい。具体的には、記憶部２０は、本実施形態の各機能を制御部１０に実行させるためのプログラム（リスク評価プログラム）、パラメータ、及びこのプログラムが処理対象とする文書データを含む文書データセット等を記憶する。 The storage unit 20 is a storage area for various programs and various data for making the hardware group function as the risk evaluation device 1, and may be a ROM, RAM, flash memory, hard disk (HDD), or the like. Specifically, the storage unit 20 includes a program (risk evaluation program) for causing the control unit 10 to execute each function of the present embodiment, parameters, a document data set including document data to be processed by this program, and the like. Remember.

制御部１０は、形態素解析部１１と、情報量算出部１２と、検索部１３と、抽出部１４と、指標取得部１５と、評価部１６とを備える。
制御部１０は、これらの機能部により、文書データから関連情報を検索されるリスクを評価することで、文書データの匿名化を促す。 The control unit 10 includes a morphological analysis unit 11, an information amount calculation unit 12, a search unit 13, an extraction unit 14, an index acquisition unit 15, and an evaluation unit 16.
The control unit 10 promotes anonymization of the document data by evaluating the risk of the related information being searched from the document data by these functional units.

形態素解析部１１は、対象の文書データに対して形態素解析を行い、単語に分割する。さらに、形態素解析部１１は、得られた単語のうち、リスクとなりうる特定の品詞（例えば、名詞、動詞等）のものを抽出する。 The morphological analysis unit 11 performs morphological analysis on the target document data and divides it into words. Further, the morphological analysis unit 11 extracts specific parts of speech (for example, nouns, verbs, etc.) that may pose a risk from the obtained words.

情報量算出部１２は、形態素解析部１１により抽出された単語それぞれの情報量を算出する。
単語ｘの情報量Ｉ（ｘ）は、例えば、Ｉ（ｘ）＝－ｌｏｇＰ（ｘ）で表現できる。なお、Ｐ（ｘ）は、単語ｘの出現確率を表し、ｘの出現回数を全単語数で割ることで求められる。あるいは、情報量Ｉ（ｘ）は、文書データセットＤを用いて、ＴＦ－ＩＤＦ等の指標により算出されてもよい。 The information amount calculation unit 12 calculates the information amount of each word extracted by the morphological analysis unit 11.
The information amount I (x) of the word x can be expressed by, for example, I (x) = −logP (x). Note that P (x) represents the probability of appearance of the word x, and is obtained by dividing the number of appearances of x by the total number of words. Alternatively, the information amount I (x) may be calculated by an index such as TF-IDF using the document data set D.

検索部１３は、算出された情報量の高い所定数（ｎ個）の単語のうち、指定された最大数（ｍ個）までの単語の組み合わせによりＷｅｂ検索を行い、検索結果の上位から所定数を取得する。
なお、組み合わせの数は、Σ_ｍ（_ｎＣ_ｍ）通りとなり、ｎ及びｍの指定によっては全通りの検索の回数が膨大となる。このため、検索の回数に上限を設ける、あるいは、一度の検索に用いる単語（キーワード）の数を最大数ｍまで変動させるのではなく所定数に固定させてもよい。
検索部１３は、単語の組み合わせのパターンを変えて、例えば指定された回数のＷｅｂ検索を行い、それぞれの検索結果の上位（例えば１０件ずつ）を取得する。 The search unit 13 performs a Web search by combining up to a specified maximum number (m) of words among a predetermined number (n) of words having a high amount of calculated information, and performs a Web search from the top of the search results. To get.
The number of combinations is Σ _m ( _n C _m ), and depending on the designation of n and m, the number of searches for all the combinations becomes enormous. Therefore, an upper limit may be set for the number of searches, or the number of words (keywords) used for one search may be fixed to a predetermined number instead of being varied up to a maximum of several meters.
The search unit 13 changes the pattern of word combinations, performs Web searches a specified number of times, and acquires the top ranks of each search result (for example, 10 items each).

抽出部１４は、検索部１３により取得された検索結果から、所定の条件を満たす固有名詞を抽出する。
例えば、文書データが事故レポート等の場合、固有名詞は、被害者の名前が相当し、攻撃者により文書データと紐づけられることで個人に関するセンシティブな情報が知られることとなる。 The extraction unit 14 extracts a proper noun satisfying a predetermined condition from the search results acquired by the search unit 13.
For example, when the document data is an accident report or the like, the proper noun corresponds to the name of the victim, and the attacker associates the document data with the document data so that sensitive information about the individual is known.

指標取得部１５は、文書データの話題性を示す指標を取得し、評価部１６へ提供する。
話題性は、例えば、事故による怪我の程度等であり、関連情報の多さ、すなわち検索されやすさを示す。この指標は、文書データに予め手動で付与されていてもよいし、既存の言語処理の手法を用いて文書データの内容に応じて付与されてもよい。
例えば、指標取得部１５は、文書データの内容を、機械学習により所定の区分（例えば、重症又は軽傷、あるいは、死亡事故又は非死亡事故）のいずれかに分類し、この区分に対応付けられた指標を取得する。 The index acquisition unit 15 acquires an index indicating the topicality of the document data and provides it to the evaluation unit 16.
The topicality is, for example, the degree of injury due to an accident, and indicates the amount of related information, that is, the ease of searching. This index may be manually given to the document data in advance, or may be given according to the content of the document data by using an existing language processing method.
For example, the index acquisition unit 15 classifies the contents of the document data into one of predetermined categories (for example, serious or minor injuries, or fatal accidents or non-fatal accidents) by machine learning, and is associated with this category. Get the index.

評価部１６は、検索部１３により取得された検索結果の全体に対して、文書データから抽出された所定数（ｎ個）の単語に含まれ、かつ、検索に用いた組み合わせに含まれない単語群との一致度合いに基づいて、文書データとの紐づけによるリスクを評価する。
具体的には、例えば、評価部１６は、検索部１３により取得された検索結果のうち、一致度合いが閾値を超える、すなわち検索キーワードに用いなかった単語と同一又は類似の単語が所定以上含まれる検索結果（記事）の割合に基づいて評価してよい。 The evaluation unit 16 includes words included in a predetermined number (n) of words extracted from the document data and not included in the combination used for the search with respect to the entire search results acquired by the search unit 13. Evaluate the risk of linking with document data based on the degree of agreement with the group.
Specifically, for example, the evaluation unit 16 includes, among the search results acquired by the search unit 13, words having a matching degree exceeding the threshold value, that is, words that are the same as or similar to words that were not used as the search keyword. Evaluation may be based on the percentage of search results (articles).

また、評価部１６は、指標取得部１５から得られた指標に応じて、リスクの評価を調整する。すなわち、文書データの話題性が高い場合、関連情報が検索される可能性も高いため、リスクが高く評価される。
さらに、評価部１６は、抽出部１４により被害者の個人名等の固有名詞が抽出された場合に、リスクを高く調整する。 Further, the evaluation unit 16 adjusts the risk evaluation according to the index obtained from the index acquisition unit 15. That is, when the topicality of the document data is high, there is a high possibility that related information will be searched, so the risk is highly evaluated.
Further, the evaluation unit 16 adjusts the risk to a high level when the extraction unit 14 extracts a proper noun such as a victim's personal name.

評価部１６は、さらに、Ｗｅｂ検索を行った単語の組み合わせ毎にリスクを評価し、これらのリスクを統合することにより、単語毎の個別リスクを評価してもよい。例えば、単語の一致度合いが閾値を超える検索結果が所定以上得られた際の検索キーワードに含まれる単語は、記載されることにリスクがあると判断される。さらに、異なる組み合わせでも同様にリスクが高いと判断される単語については、より高いリスクがあると評価される。
得られた単語毎の評価は、ユーザに提示されて個別リスクの高い単語の匿名化が促される。あるいは、所定以上の個別リスクのある単語が自動で汎化されることで匿名化されてもよいし、汎化候補が提示されてもよい。 The evaluation unit 16 may further evaluate the risk for each combination of words that have been searched on the Web, and evaluate the individual risk for each word by integrating these risks. For example, it is determined that there is a risk that a word included in a search keyword when a search result in which the degree of matching of words exceeds a threshold value is obtained is a predetermined value or more. Furthermore, words that are judged to be similarly high risk in different combinations are evaluated as having higher risk.
The obtained word-by-word evaluation is presented to the user to promote anonymization of words with high individual risk. Alternatively, words with a predetermined or higher individual risk may be anonymized by being automatically generalized, or generalization candidates may be presented.

さらに、評価部１６は、個別リスクの高い単語を汎化した場合の文書データのリスクを再評価し、汎化による文書データのリスクの変化量（低下量）をユーザに提示してもよい。
なお、汎化の対象は、個別リスクが所定以上の単語全てであってもよいが、評価部１６は、個別リスクが上位の単語を優先して、順に文書データのリスクの変化量と共にユーザに提示してもよい。 Further, the evaluation unit 16 may re-evaluate the risk of the document data when a word having a high individual risk is generalized, and present to the user the amount of change (decrease) in the risk of the document data due to the generalization.
The target of generalization may be all words whose individual risk is equal to or higher than a predetermined value, but the evaluation unit 16 gives priority to the words having the higher individual risk, and sequentially gives the user the amount of change in the risk of the document data. You may present it.

図３は、本実施形態におけるリスク評価方法を示すフローチャートである。
ここでは、文書データから抽出する検索キーワードの候補となる単語の数ｎ、検索キーワードとして用いる単語数ｍ、Ｗｅｂ検索の実行回数ｉ、検索結果の取得数ｊ、及び文書データの話題性（センシティビティ）を示す指標εがパラメータとして入力されているものとする。なお、指標εは、前述のように文書データの意味解析により算出されてもよい。 FIG. 3 is a flowchart showing a risk assessment method in the present embodiment.
Here, the number of words that can be candidates for search keywords extracted from the document data n, the number of words used as search keywords m, the number of times web search is executed i, the number of search result acquisitions j, and the topicality (sensitivity) of the document data. ) Is input as a parameter. The index ε may be calculated by semantic analysis of document data as described above.

ステップＳ１において、形態素解析部１１は、対象の文書データに対して形態素解析を行い、名詞及び動詞等の特定の品詞の単語を、攻撃者により検索キーワードとされる可能性が高い単語として抽出する。 In step S1, the morphological analysis unit 11 performs morphological analysis on the target document data and extracts words having a specific part of speech such as a noun and a verb as words that are likely to be used as search keywords by an attacker. ..

ステップＳ２において、情報量算出部１２は、ステップＳ１で抽出された単語それぞれの情報量を、出現頻度に基づく指標により算出する。 In step S2, the information amount calculation unit 12 calculates the information amount of each word extracted in step S1 by an index based on the frequency of appearance.

ステップＳ３において、検索部１３は、ステップＳ２で算出された情報量が高いｎ個の単語を抽出し、この中からｍ個の単語をランダムに選択してＷｅｂ検索をｋ回実行する。そして、検索部１３は、Ｗｅｂ検索の度に上位からｊ個の検索結果を、全部でｉ×ｊ個の検索結果を得る。 In step S3, the search unit 13 extracts n words with a high amount of information calculated in step S2, randomly selects m words from them, and executes a Web search k times. Then, the search unit 13 obtains j search results from the top and i × j search results in total each time the Web is searched.

ステップＳ４において、評価部１６は、ステップＳ３で得られたｉ×ｊ個の検索結果から、検索キーワードに使われなかったｎ－ｍ個の単語と同一の又は類似した単語が含まれる割合が所定以上の関連文書を選別する。そして、評価部１６は、検索結果全体に対して選別された関連文書の割合に応じたリスクの評価値を算出する。 In step S4, the evaluation unit 16 determines the ratio of the i × j search results obtained in step S3 that include the same or similar words as the nm words that were not used as the search keyword. Select the above related documents. Then, the evaluation unit 16 calculates a risk evaluation value according to the ratio of the related documents selected to the entire search result.

ステップＳ５において、抽出部１４は、ステップＳ４で選別された関連文書の中に、被害者の名前等、特定の条件を満たす固有名詞が存在するか否かを判定する。この判定がＹＥＳの場合、処理はステップＳ６に移り、判定がＮＯの場合、処理はステップＳ７に移る。 In step S5, the extraction unit 14 determines whether or not there is a proper noun satisfying a specific condition such as the name of the victim in the related documents selected in step S4. If this determination is YES, the process proceeds to step S6, and if the determination is NO, the process proceeds to step S7.

ステップＳ６において、評価部１６は、ステップＳ４で算出された評価値を調整し、リスクを高く評価する。なお、評価部１６は、ステップＳ５において該当の固有名詞が存在する関連文書の割合に応じて評価値の上げ幅又は上げ率を調整してもよい。 In step S6, the evaluation unit 16 adjusts the evaluation value calculated in step S4 and highly evaluates the risk. In addition, the evaluation unit 16 may adjust the increase range or the increase rate of the evaluation value according to the ratio of the related documents in which the corresponding proper noun exists in step S5.

ステップＳ７において、評価部１６は、文書データの話題性を示す指標εに基づいて、評価値を調整し、話題性の高い文書データほど、リスクを高く評価する。 In step S7, the evaluation unit 16 adjusts the evaluation value based on the index ε indicating the topicality of the document data, and the higher the topicality of the document data, the higher the risk is evaluated.

本実施形態によれば、リスク評価装置１は、文書データに含まれる情報量の高い所定数の単語のうち、指定された所定数の単語の組み合わせを検索キーワードとしてＷｅｂ検索を行い、検索結果の上位から所定数を取得する。リスク評価装置１は、検索結果に対して、所定数の単語に含まれ、かつ、検索キーワードに含まれない単語群との一致度合いに基づいて、文書データとの紐づけによるリスクを評価する。
これにより、リスク評価装置１は、文書データを公開する際に、実際の攻撃をシミュレーションすることで、文書データに関連する個人及び追加情報等が攻撃者に入手されるリスクを定量的に適切に評価することができる。 According to the present embodiment, the risk evaluation device 1 performs a Web search using a combination of a specified predetermined number of words as a search keyword among a predetermined number of words having a high amount of information contained in the document data, and the search result is obtained. Get the specified number from the top. The risk assessment device 1 evaluates the risk associated with the document data based on the degree of matching with the word group included in a predetermined number of words and not included in the search keyword for the search result.
As a result, when the document data is released, the risk assessment device 1 simulates an actual attack to quantitatively and appropriately determine the risk that an attacker will obtain personal information and additional information related to the document data. Can be evaluated.

リスク評価装置１は、指定された最大数までの単語の組み合わせによりＷｅｂ検索を行うことにより、攻撃者による検索キーワードの選択数を複数シミュレーションでき、文書データのリスクを適切に評価できる。 The risk assessment device 1 can simulate a plurality of search keyword selections by an attacker by performing a Web search by combining up to a specified maximum number of words, and can appropriately evaluate the risk of document data.

リスク評価装置１は、検索キーワードの組み合わせのパターンを変えて、指定された回数のＷｅｂ検索を行い、それぞれの検索結果の上位を取得する。
これにより、リスク評価装置１は、複数の検索パターンをシミュレーションすることで、様々な観点の検索結果を取得でき、関連情報が入手されるリスクを、より適切に評価できる。 The risk assessment device 1 changes the pattern of the combination of search keywords, performs a Web search a specified number of times, and acquires the top of each search result.
As a result, the risk assessment device 1 can acquire search results from various viewpoints by simulating a plurality of search patterns, and can more appropriately evaluate the risk of obtaining related information.

リスク評価装置１は、検索結果のうち、検索キーワード以外の単語の一致度合いが閾値を超える割合に基づいてリスクを評価する。
これにより、リスク評価装置１は、文書データと紐付けられる関連情報を効率的に判別し、リスクを適切に評価できる。 The risk assessment device 1 evaluates the risk based on the percentage of the search results in which the degree of matching of words other than the search keyword exceeds the threshold.
As a result, the risk assessment device 1 can efficiently determine the related information associated with the document data and appropriately evaluate the risk.

リスク評価装置１は、検索結果から、所定の条件を満たす固有名詞が抽出された場合に、リスクを高く調整する。
これにより、リスク評価装置１は、攻撃者により文書データと個人名又は学校名等の固有名詞とが紐付けられる可能性を判定し、適切にリスクを評価できる。 The risk assessment device 1 adjusts the risk to a high level when a proper noun satisfying a predetermined condition is extracted from the search result.
As a result, the risk assessment device 1 can determine the possibility that the document data is associated with a proper noun such as an individual name or a school name by an attacker, and can appropriately evaluate the risk.

リスクを評価装置１は、文書データの話題性を示す指標に応じて、リスクの評価を調整する。
例えば文書データが事故レポートの場合、事故の程度によって記事の数が異なるため、重大事故で話題性が高い場合には、低い情報量の単語からでも容易に当該事故の記事が検索されることから、リスクを評価装置１は、関連情報の紐付けのリスクを現実に則して適切に評価できる。 The risk evaluation device 1 adjusts the risk evaluation according to an index indicating the topicality of the document data.
For example, if the document data is an accident report, the number of articles varies depending on the degree of the accident, so if the topic is high in a serious accident, the article of the accident can be easily searched even from a word with a low amount of information. , The risk evaluation device 1 can appropriately evaluate the risk of associating related information in accordance with the actual situation.

また、リスクを評価装置１は、文書データの内容を、機械学習により所定の区分のいずれかに分類し、これらの区分に対応付けられた指標を取得することで、事前に判別されない指標を適切に付与してリスクを適切に評価できる。 Further, the risk evaluation device 1 classifies the contents of the document data into one of the predetermined categories by machine learning, and acquires the index associated with these categories to appropriately determine the index that is not determined in advance. Can be given to and the risk can be evaluated appropriately.

リスク評価装置１は、単語の組み合わせ毎にリスクを評価し、評価結果を統合することにより、単語毎の個別リスクを評価する。
これにより、リスク評価装置１は、文書データに含まれる個別リスクの高い単語を提示して公開前に匿名化を促す、又は自動的に汎化することで、文書データのリスクを低減させることができる。 The risk assessment device 1 evaluates the risk for each combination of words, and evaluates the individual risk for each word by integrating the evaluation results.
As a result, the risk assessment device 1 can reduce the risk of the document data by presenting words with high individual risk contained in the document data and promoting anonymization before publication, or by automatically generalizing the words. can.

さらに、リスク評価装置１は、個別リスクが所定以上の単語を汎化した場合の文書データのリスクを再評価し、汎化による文書データのリスクの変化量を提示する。
これにより、リスク評価装置１は、どのような汎化で文書データのリスクがどれだけ低下するかを示し、ユーザに文書データの匿名化を適切なレベルで実施させることができる。 Further, the risk assessment device 1 re-evaluates the risk of the document data when the individual risk generalizes words having a predetermined value or more, and presents the amount of change in the risk of the document data due to the generalization.
Thereby, the risk assessment device 1 can show how much the risk of the document data is reduced by what kind of generalization, and can make the user perform anonymization of the document data at an appropriate level.

以上、本発明の実施形態について説明したが、本発明は前述した実施形態に限るものではない。また、前述した実施形態に記載された効果は、本発明から生じる最も好適な効果を列挙したに過ぎず、本発明による効果は、実施形態に記載されたものに限定されるものではない。 Although the embodiments of the present invention have been described above, the present invention is not limited to the above-described embodiments. Moreover, the effects described in the above-described embodiments are merely a list of the most suitable effects resulting from the present invention, and the effects according to the present invention are not limited to those described in the embodiments.

前述の実施形態では、評価対象の文書データに含まれる単語によりＷｅｂ検索を行ったが、実際には、検索結果に含まれる単語による再検索が行われることも考えられる。
したがって、リスク評価装置１は、リスク評価のために検索された関連文書からも同様に単語の情報量を算出し、所定以上の情報量の単語が含まれる場合、これらの単語を含む検索キーワードにより再度Ｗｅｂ検索を行ってもよい。
これにより、攻撃者の検索行動をより忠実にシミュレーションできるため、リスクのより適切な評価が期待できる。 In the above-described embodiment, the Web search is performed by the word included in the document data to be evaluated, but in reality, it is conceivable that the search is performed again by the word included in the search result.
Therefore, the risk evaluation device 1 similarly calculates the amount of information of a word from the related documents searched for the risk evaluation, and when a word having a predetermined amount of information or more is included, the search keyword including these words is used. You may perform the Web search again.
As a result, the search behavior of the attacker can be simulated more faithfully, and a more appropriate evaluation of the risk can be expected.

リスク評価装置１によるリスク評価方法は、ソフトウェアにより実現される。ソフトウェアによって実現される場合には、このソフトウェアを構成するプログラムが、情報処理装置（コンピュータ）にインストールされる。また、これらのプログラムは、ＣＤ－ＲＯＭのようなリムーバブルメディアに記録されてユーザに配布されてもよいし、ネットワークを介してユーザのコンピュータにダウンロードされることにより配布されてもよい。さらに、これらのプログラムは、ダウンロードされることなくネットワークを介したＷｅｂサービスとしてユーザのコンピュータに提供されてもよい。 The risk assessment method by the risk assessment device 1 is realized by software. When realized by software, the programs that make up this software are installed in the information processing device (computer). Further, these programs may be recorded on a removable medium such as a CD-ROM and distributed to the user, or may be distributed by being downloaded to the user's computer via a network. Further, these programs may be provided to the user's computer as a Web service via a network without being downloaded.

１リスク評価装置
１０制御部
１１形態素解析部
１２情報量算出部
１３検索部
１４抽出部
１５指標取得部
１６評価部
２０記憶部 1 Risk assessment device 10 Control unit 11 Morphological analysis unit 12 Information amount calculation unit 13 Search unit 14 Extraction unit 15 Index acquisition unit 16 Evaluation unit 20 Storage unit

Claims

An information amount calculation unit that calculates the amount of information for each word contained in the document data,
A search unit that performs a Web search by combining a plurality of words from a predetermined number of words from the top of the amount of information and acquires a predetermined number from the top of the search results.
Document risk due to association with the document data based on the degree of matching with the word group included in the predetermined number of words and not included in the combination with respect to the search result acquired by the search unit. A risk assessment device equipped with an evaluation unit that evaluates.

The risk assessment device according to claim 1, wherein the search unit performs a Web search by combining up to a designated maximum number of words.

The risk assessment device according to claim 1 or 2, wherein the search unit changes the pattern of the combination, performs a Web search a specified number of times, and acquires a higher rank of each search result.

The risk assessment device according to any one of claims 1 to 3, wherein the evaluation unit evaluates based on the ratio of the degree of matching exceeding the threshold value among the search results acquired by the search unit.

It is provided with an extraction unit that extracts a proper noun satisfying a predetermined condition from the search results acquired by the search unit.
The risk assessment device according to any one of claims 1 to 4, wherein the evaluation unit adjusts the document risk to a high level when the proper noun is extracted.

It is equipped with an index acquisition unit that acquires an index indicating the topicality of the document data.
The risk assessment device according to any one of claims 1 to 5, wherein the evaluation unit adjusts the assessment of the document risk according to the index.

The risk assessment device according to claim 6, wherein the index acquisition unit classifies the contents of the document data into any of a predetermined category by machine learning, and acquires an index associated with the category.

The information amount calculation unit calculates the amount of information of each word included in the document data of the search result in which the degree of matching exceeds the threshold value.
The risk assessment according to any one of claims 1 to 7, wherein when the document data of the search result contains a word having a predetermined amount of information or more, the search unit performs a Web search again by a combination including the word. Device.

The risk assessment device according to claim 1 to claim 8, wherein the evaluation unit further evaluates the document risk for each of the combinations and evaluates the individual risk for each word by integrating the document risks.

The evaluation unit re-evaluates the document risk with respect to the document data when the individual risk is generalized according to a predetermined rule, and presents the amount of change in the document risk due to the generalization. The risk assessment device according to 9.

An information amount calculation step that calculates the amount of information for each word contained in the document data,
A search step of performing a Web search by combining a plurality of words from a predetermined number of words from the top of the amount of information and acquiring a predetermined number from the top of the search results.
Document risk due to association with the document data based on the degree of matching with the word group included in the predetermined number of words and not included in the combination with respect to the search result acquired in the search step. The evaluation steps to evaluate and the risk assessment method performed by the computer.

A risk assessment program for operating a computer as the risk assessment device according to any one of claims 1 to 10.