JP6441930B2

JP6441930B2 - Data analysis apparatus, data analysis apparatus control method, and data analysis apparatus control program

Info

Publication number: JP6441930B2
Application number: JP2016537661A
Authority: JP
Inventors: 守本　正宏; 正宏守本; 秀樹武田; 和巳蓮子
Original assignee: Ubic Inc
Current assignee: Ubic Inc
Priority date: 2014-07-30
Filing date: 2014-07-30
Publication date: 2018-12-19
Anticipated expiration: 2034-07-30
Also published as: TW201610727A; JPWO2016016974A1; WO2016016974A1

Description

本発明は、新たに取得されたデータから所定の事案と関係するデータを抽出可能なデータ分析装置等に関するものである。 The present invention relates to a data analysis apparatus or the like that can extract data related to a predetermined case from newly acquired data.

価格カルテルに加担した嫌疑により企業が独占禁止法（反トラスト法）違反を追求されたり、内部者の手引きにより機密情報が漏洩されたりなど、企業のリーガルリスクが増大している背景から、上記のような不正行為を検知するシステムが望まれている。例えば、特許文献１には、訴訟において証拠として提出するために収集された、デジタル化された文書情報を分析し、訴訟への利用が容易になるように分別する文書分別システムが開示されている。 Due to the background that corporate legal risks are increasing, such as companies pursuing violations of the antitrust law (antitrust law) due to suspicion of participating in the price cartel, and leaking confidential information by internal guides, There is a need for a system that detects such fraud. For example, Patent Document 1 discloses a document separation system that analyzes digitized document information collected for submission as evidence in a lawsuit and separates the information so that it can be easily used in a lawsuit. .

一方、コンピュータ利用に関するビヘイビア（どのファイルにアクセスしたかなど）を記録する技術も、従来から提案されている。例えば、特許文献２には、情報ネットワークの利用者の行動を把握して、情報漏洩の予防対策に役立つ情報ファイル漏洩を検知するための表示方法が開示されている。 On the other hand, techniques for recording behaviors related to computer use (such as which file has been accessed) have been proposed. For example, Patent Document 2 discloses a display method for detecting information file leakage useful for information leakage prevention measures by grasping the behavior of an information network user.

特開２０１３−１８２３３８号公報JP 2013-182338 A 特開２００７−３０４９４３号公報JP 2007-304943 A

特許文献１に開示された従来のシステムは、例えば、ネットワーク上を日々流通する電子メールから上記のような不正行為の予兆を検知することはできない。当該システムは、当該不正行為が起こった後に提起された訴訟において提出すべき関連文書を分析するものであるため、当該分析の対象となるすべての文書が事前に存在することを前提としているからである。 For example, the conventional system disclosed in Patent Literature 1 cannot detect a sign of fraudulent activity as described above from e-mail distributed daily on a network. Since the system analyzes related documents to be submitted in a lawsuit filed after the fraud has occurred, it assumes that all documents subject to the analysis exist in advance. is there.

進行中の不正行為を捕捉する方法として、例えば、特許文献２に開示された表示方法のようにユーザのビヘイビアを記録し、問題視される所定のビヘイビア（「不正行為」として定義したビヘイビア）が発見された場合、管理者に警告を出す方法が考えられる。しかし、上記方法では、（ａ）上記所定のビヘイビアが発見された時点では、すでに不正行為が生じた後であることが多い、および（ｂ）不正行為を事前に検知するために警告の要件を緩めるほど警告が頻発し、監視が実効的でなくなるという問題が生じる。 As a method of capturing an ongoing fraud, for example, a user behavior is recorded as in the display method disclosed in Patent Document 2, and a predetermined behavior (behavior defined as “fraud”) regarded as a problem is used. If it is discovered, there is a way to alert the administrator. However, in the above method, (a) when the predetermined behavior is discovered, it is often after the fraud has already occurred, and (b) a warning requirement is set in order to detect the fraud in advance. The more you loosen, the more frequently you will be alerted and the less effective monitoring will be.

また、特許文献１または２に開示された従来技術は、特定の不正行為に特化したものに過ぎず、汎用的でないため、特定の不正行為以外の事案に適用可能なものではない。 Further, the prior art disclosed in Patent Document 1 or 2 is only specialized for a specific fraud and is not general-purpose, and is not applicable to cases other than the specific fraud.

本発明は、上記の問題点に鑑みてなされたものであり、その目的は、過去のデータを分析した結果に基づいて現在のデータを分析することによって、所定の事案と関係するデータを抽出可能なデータ分析装置等を提供することである。 The present invention has been made in view of the above problems, and its purpose is to extract data related to a predetermined case by analyzing current data based on the result of analyzing past data. Providing a simple data analysis device.

上記課題を解決するために、本発明の一態様に係るデータ分析装置は、新たに取得されたデータから所定の事案と関係するデータを抽出可能なデータ分析装置であって、所定の事案と関係するか否かが判断されていない未判断データが新たに取得された場合に、当該未判断データに対する当該判断の基礎となる閾値を、当該所定の事案と関係するか否かがユーザによって判断された既判断データについて、当該所定の事案との関係性の強さを示す指標としてそれぞれ算出されたスコアから特定する閾値特定部と、閾値特定部によって特定された閾値と、未判断データについて算出されたスコアとを比較した結果に応じて、未判断データをユーザに報告すべきデータとして設定するデータ設定部とを備えている。 In order to solve the above problems, a data analysis apparatus according to an aspect of the present invention is a data analysis apparatus capable of extracting data related to a predetermined case from newly acquired data, and is related to the predetermined case. When undecided data that has not been determined whether or not to be newly acquired is newly acquired by the user, it is determined whether or not the threshold that is the basis of the determination for the undecided data is related to the predetermined case. For the already-determined data, the threshold value specifying unit specified from the score calculated as an index indicating the strength of the relationship with the predetermined case, the threshold value specified by the threshold specifying unit, and the undetermined data are calculated. And a data setting unit that sets undecided data as data to be reported to the user according to the result of comparison with the score.

また、本発明の一態様に係るデータ分析装置において、閾値特定部は、既判断データについてそれぞれ算出されたスコアのうち、適合率に対して設定された目標値を超過可能なスコアを、閾値として特定することができる。 Further, in the data analysis device according to one aspect of the present invention, the threshold value specifying unit uses, as a threshold value, a score that can exceed the target value set for the relevance ratio among the scores calculated for the already determined data. Can be identified.

また、本発明の一態様に係るデータ分析装置は、未判断データについて算出されたスコアと、閾値特定部によって特定された閾値とを比較することによって、当該スコアが当該閾値を超過しているか否かを判定する超過判定部をさらに備え、データ設定部は、超過判定部によって超過していると判定された場合、未判断データをユーザに報告すべきデータとして設定することができる。 In addition, the data analysis device according to one aspect of the present invention compares the score calculated for undecided data with the threshold specified by the threshold specifying unit, thereby determining whether the score exceeds the threshold. The data setting unit can set undecided data as data to be reported to the user when it is determined by the excess determining unit that the data has been exceeded.

また、本発明の一態様に係るデータ分析装置は、既判断データに含まれるデータ要素を、所定の基準に基づいてそれぞれ評価する要素評価部と、要素評価部によって評価された結果に基づいて、スコアを算出するスコア算出部とをさらに備えてよい。 Further, the data analysis apparatus according to one aspect of the present invention is based on the element evaluation unit that evaluates the data elements included in the already-determined data based on predetermined criteria, and the results evaluated by the element evaluation unit, A score calculation unit that calculates a score may be further included.

また、本発明の一態様に係るデータ分析装置において、要素評価部は、データ要素と当該データ要素を含む既判断データに対してユーザが判断した結果との依存関係を表す伝達情報量を、所定の基準の１つとして、当該データ要素を評価することができる。 In the data analysis apparatus according to one aspect of the present invention, the element evaluation unit determines a predetermined amount of transmission information indicating a dependency relationship between the data element and the result of the user's determination on the already determined data including the data element. As one of the criteria, the data element can be evaluated.

また、本発明の一態様に係るデータ分析装置は、データ設定部によって設定されたデータが所定の事案と関係するか否かが、ユーザによって判断された結果を、所定の入力部を介して当該ユーザから取得する結果取得部をさらに備え、要素評価部は、結果取得部によって取得された結果に基づいて、データ設定部によって設定されたデータに含まれるデータ要素をそれぞれ評価することができる。 In addition, the data analysis apparatus according to one aspect of the present invention provides a result of determination by a user whether or not the data set by the data setting unit is related to a predetermined case via the predetermined input unit. A result acquisition unit acquired from the user is further provided, and the element evaluation unit can evaluate each data element included in the data set by the data setting unit based on the result acquired by the result acquisition unit.

また、本発明の一態様に係るデータ分析装置は、要素評価部によって評価されたデータ要素と、当該データ要素が評価された結果とを対応付けて、所定の記憶部に格納する格納部をさらに備えてよい。 The data analysis apparatus according to one aspect of the present invention further includes a storage unit that associates the data element evaluated by the element evaluation unit with the result of evaluation of the data element and stores the data element in a predetermined storage unit. You may prepare.

また、本発明の一態様に係るデータ分析装置において、未判断データは、複数の人物または組織をそれぞれ特定可能な固有データ要素をそれぞれ含み、データ設定部は、未判断データから固有データ要素をそれぞれ抽出し、第１固有データ要素と、当該第１固有データ要素とは異なる第２固有データ要素との対応関係を推定することによって、複数の人物または組織の間の繋がりの強さを可視化することができる。 Further, in the data analysis apparatus according to one aspect of the present invention, the undecided data includes unique data elements that can respectively identify a plurality of persons or organizations, and the data setting unit includes unique data elements from the undecided data, respectively. Extracting and visualizing the strength of connection between a plurality of persons or organizations by estimating the correspondence between the first unique data element and a second unique data element different from the first unique data element Can do.

また、本発明の一態様に係るデータ分析装置は、所定のデータ群から抽出したデータが、所定の事案と関係するか否かがユーザによって判断された結果を、所定の入力部を介して当該ユーザから取得することによって、既判断データを取得する既判断データ取得部をさらに備えてよい。 In addition, the data analysis apparatus according to one aspect of the present invention provides a result of a determination made by a user whether or not data extracted from a predetermined data group is related to a predetermined case via the predetermined input unit. A determination data acquisition unit that acquires the determination data by acquiring it from the user may be further provided.

また、本発明の一態様に係るデータ分析装置は、データ設定部によって設定されたデータに、当該データが所定の事案と関係することを示す関係性情報を付与する関係付与部をさらに備えてよい。 In addition, the data analysis apparatus according to one aspect of the present invention may further include a relationship adding unit that adds relationship information indicating that the data is related to a predetermined case to the data set by the data setting unit. .

また、本発明の一態様に係るデータ分析装置において、データは、コンピュータで処理可能となるようにデジタル化された文書であり、データ要素は、文書に含まれるキーワードであってよい。 In the data analysis apparatus according to one embodiment of the present invention, the data may be a document digitized so that it can be processed by a computer, and the data element may be a keyword included in the document.

また、本発明の一態様に係るデータ分析装置において、データは、コンピュータで処理可能となるようにデジタル化された音声であり、データ要素は、音声に含まれる部分音声であってよい。 In the data analysis device according to one embodiment of the present invention, the data may be voice that has been digitized so that it can be processed by a computer, and the data element may be partial voice included in the voice.

上記課題を解決するために、本発明の一態様に係るデータ分析装置の制御方法は、新たに取得されたデータから所定の事案と関係するデータを抽出可能なデータ分析装置の制御方法であって、所定の事案と関係するか否かが判断されていない未判断データを新たに取得した場合に、当該未判断データに対する当該判断の基礎となる閾値を、当該所定の事案と関係するか否かがユーザによって判断された既判断データについて、当該所定の事案との関係性の強さを示す指標としてそれぞれ算出したスコアから特定する閾値特定ステップと、閾値特定ステップにおいて特定した閾値と、未判断データについて算出したスコアとを比較した結果に応じて、未判断データをユーザに報告すべきデータとして設定するデータ設定ステップとを含んでいる。 In order to solve the above problem, a control method for a data analysis apparatus according to an aspect of the present invention is a control method for a data analysis apparatus capable of extracting data related to a predetermined case from newly acquired data. Whether or not the threshold value that is the basis of the judgment for the undecided data is related to the predetermined case when new undecided data that has not been determined whether or not it is related to the predetermined case For the already determined data determined by the user from the score respectively calculated as an index indicating the strength of the relationship with the predetermined case, the threshold specified in the threshold specifying step, and the undetermined data A data setting step for setting undecided data as data to be reported to the user according to the result of comparison with the score calculated for

上記課題を解決するために、本発明の一態様に係るデータ分析装置の制御プログラムは、新たに取得されたデータから所定の事案と関係するデータを抽出可能なデータ分析装置の制御プログラムであって、コンピュータに、所定の事案と関係するか否かが判断されていない未判断データが新たに取得された場合に、当該未判断データに対する当該判断の基礎となる閾値を、当該所定の事案と関係するか否かがユーザによって判断された既判断データについて、当該所定の事案との関係性の強さを示す指標としてそれぞれ算出されたスコアから特定する閾値特定機能と、閾値特定機能によって特定された閾値と、未判断データについて算出されたスコアとを比較した結果に応じて、未判断データをユーザに報告すべきデータとして設定するデータ設定機能とを実現させる。 In order to solve the above problems, a control program for a data analysis apparatus according to an aspect of the present invention is a control program for a data analysis apparatus capable of extracting data related to a predetermined case from newly acquired data. When a new undetermined data that has not been determined whether or not it is related to a predetermined case is newly acquired in the computer, a threshold that is the basis of the determination for the undetermined data is related to the predetermined case. Threshold determination function that specifies from the score respectively calculated as an index indicating the strength of the relationship with the predetermined case, and the threshold determination function Data that sets the undetermined data as data to be reported to the user according to the result of comparing the threshold and the score calculated for the undetermined data. To realize the setting function.

本発明の一態様によれば、データ分析装置、データ分析装置の制御方法、およびデータ分析装置の制御プログラムは、所定の事案と関係するか否かが判断されていない未判断データが新たに取得された場合に、当該未判断データに対する当該判断の基礎となる閾値を、所定の事案と関係するか否かがユーザによって判断された既判断データについて、当該所定の事案との関係性の強さを示す指標としてそれぞれ算出されたスコアから特定し、当該閾値と未判断データについて算出されたスコアとを比較した結果に応じて、未判断データをユーザに報告すべきデータとして設定する。 According to one aspect of the present invention, a data analysis device, a data analysis device control method, and a data analysis device control program newly acquire undecided data that has not been determined whether or not it relates to a predetermined case. If the judgment data for which the user has determined whether or not the threshold that is the basis of the judgment for the undecided data is related to the predetermined case, the strength of the relationship with the predetermined case The undetermined data is set as data to be reported to the user according to the result of comparing the threshold and the score calculated for the undetermined data.

上記構成により、上記データ分析装置等は、過去のデータを分析した結果に基づいて現在のデータを分析することによって、所定の事案と関係するデータを抽出できるという効果を奏する。 With the above configuration, the data analysis apparatus and the like have an effect that data related to a predetermined case can be extracted by analyzing current data based on the result of analyzing past data.

本発明の実施の形態に係る文書分析システムの要部構成を示すブロック図である。It is a block diagram which shows the principal part structure of the document analysis system which concerns on embodiment of this invention. 上記文書分析システムの一例を概略的に示す模式図である。It is a schematic diagram which shows an example of the said document analysis system roughly. レビュー結果が付与された文書の数に対応する最小スコアを示す表であり、（ａ）は、目標適合率を１００％とした場合を示し、（ｂ）は、目標適合率を９０％とした場合を示す。It is a table | surface which shows the minimum score corresponding to the number of documents to which the review result was provided, (a) shows the case where a target precision is set to 100%, (b) sets the target precision to 90% Show the case. 上記文書分析システムが実行する処理の一例を示すフローチャートである。It is a flowchart which shows an example of the process which the said document analysis system performs.

図１〜図４に基づいて、本発明の実施の形態を説明する。 An embodiment of the present invention will be described with reference to FIGS.

〔文書分析システム１００の概要〕
文書分析システム（データ分析装置）１００は、デジタル文書を分析することによって、新たに取得された文書から所定の事案と関係する文書を抽出可能な情報処理システムである。文書分析システム１００は、以下で説明する処理を実行可能なコンピュータを含んでいればよく、例えば、サーバ装置、パーソナルコンピュータ、メインフレーム、ワークステーション、その他の電子機器などを用いて実現され得る。[Outline of Document Analysis System 100]
The document analysis system (data analysis apparatus) 100 is an information processing system that can extract a document related to a predetermined case from a newly acquired document by analyzing a digital document. The document analysis system 100 only needs to include a computer that can execute the processing described below, and can be realized using, for example, a server device, a personal computer, a mainframe, a workstation, or other electronic devices.

図２は、文書分析システム１００の一例を概略的に示す模式図である。図２に示されるように、レビュア（ユーザ）は、文書が所定の事案と関係するか否かを判断し、当該判断した結果（レビュー結果５ａ）を文書分析システム１００に入力する。 FIG. 2 is a schematic diagram schematically showing an example of the document analysis system 100. As shown in FIG. 2, the reviewer (user) determines whether or not the document is related to a predetermined case, and inputs the determined result (review result 5 a) to the document analysis system 100.

ここで、上記「文書」は、コンピュータによって処理可能となるようにデジタル化されたデータであり、例えば、電子メール、技術文書、プレゼンテーション資料、表計算資料、決算報告書、打ち合わせ資料、契約書、組織図、事業計画書などが広く含まれ得る。図２は、文書分析システム１００がネットワーク上を日々流通する電子メールを取り込み、当該電子メールに含まれる文書を分析する例を示している。 Here, the “document” is data digitized so that it can be processed by a computer. For example, e-mail, technical documents, presentation materials, spreadsheet materials, financial statements, meeting materials, contracts, Organization charts, business plans, etc. can be widely included. FIG. 2 shows an example in which the document analysis system 100 takes in an electronic mail distributed on a network every day and analyzes a document included in the electronic mail.

また、上記「所定の事案」は、組織において上記文書を利用する一般ユーザが、不正行為、および／またはその予備的行為をとることによって生じる事件を含み、例えば、機密情報が外部に漏えいする事件、他社と談合した事件、決算を粉飾した事件、取引企業に架空請求して代金を着服した事件、その他企業にとって好ましくない事件など、当該組織が発生を未然に防止したいと考える事案を広く含む。ただし、上記所定の事案は上記の例に限定されず、関連するデータ（例えば、文書、音声、映像など）を生成し得る事案一般を広く含んでよい。 In addition, the “predetermined case” includes an incident that occurs when a general user who uses the document in the organization takes an illegal act and / or a preliminary act thereof, for example, an incident in which confidential information is leaked to the outside. This includes a wide range of incidents that the organization wants to prevent from occurring, such as incidents that have been negotiated with other companies, incidents that have been decorated with financial results, incidents that have been fictitiously charged to a trading company, and incidents that are unfavorable to other companies. However, the predetermined case is not limited to the above example, and may include a wide range of cases that can generate related data (eg, document, audio, video, etc.).

文書分析システム１００は、上記レビュー結果５ａに基づいて、上記文書（既判断データ）に含まれるキーワード（データ要素）を所定の基準（例えば、伝達情報量）に基づいてそれぞれ評価する。そして、文書分析システム１００は、当該評価された結果に基づいて、上記所定の事案との関係性の強さを示すスコアを上記文書についてそれぞれ算出し、適合率（上記所定の事案に関係すると判断された文書が、所定数の文書を含む文書群に占める割合）に対して設定された目標値（目標適合率）を超過可能な最小のスコアを、適合しきい値として特定する。 Based on the review result 5a, the document analysis system 100 evaluates keywords (data elements) included in the document (determined data) based on predetermined criteria (for example, the amount of transmitted information). Then, the document analysis system 100 calculates, for each of the documents, a score indicating the strength of the relationship with the predetermined case based on the evaluated result, and determines the relevance rate (relevant to the predetermined case). The minimum score that can exceed the target value (target relevance ratio) set for the ratio of the recorded document to the document group including a predetermined number of documents is specified as the relevance threshold.

すなわち、文書分析システム１００は、レビュアから与えられたレビュー結果５ａ（過去のデータに対して人間が判断した結果）に基づいて上記適合しきい値を設定し、当該適合しきい値を超過するスコアを有する文書のみを、上記所定の事案と関係する可能性が高い文書として、レビュアに一覧結果５ｂ（当該文書を一覧してレビュアに提示可能な情報）を返すことができる。言い換えれば、文書分析システム１００は、過去のデータを分析した結果に基づいて現在のデータを分析することにより、所定の事案と関係するデータを抽出できる。これにより、文書分析システム１００は、例えば、不正行為が生じる予兆を検知できる。 That is, the document analysis system 100 sets the matching threshold based on the review result 5a given by the reviewer (the result of human judgment on past data), and the score exceeding the matching threshold is set. As a document that is highly likely to be related to the predetermined case, the list result 5b (information that can be listed and presented to the reviewer) can be returned to the reviewer. In other words, the document analysis system 100 can extract data related to a predetermined case by analyzing current data based on the result of analyzing past data. Thereby, the document analysis system 100 can detect, for example, a sign that an illegal act occurs.

〔文書分析システム１００の構成〕
図１は、文書分析システム１００の要部構成を示すブロック図である。図１に示されるように、文書分析システム１００は、制御部１０（データ抽出部１１、結果取得部１２、要素評価部１３、スコア算出部１４、スコア特定部１５、超過判定部１６、データ設定部１７、関係付与部１８、格納部１９）、受信部２０、入力部４０、表示部５０、および記憶部３０を備えている。[Configuration of Document Analysis System 100]
FIG. 1 is a block diagram showing a main configuration of the document analysis system 100. As shown in FIG. 1, the document analysis system 100 includes a control unit 10 (a data extraction unit 11, a result acquisition unit 12, an element evaluation unit 13, a score calculation unit 14, a score specification unit 15, an excess determination unit 16, a data setting. Unit 17, relationship giving unit 18, storage unit 19), receiving unit 20, input unit 40, display unit 50, and storage unit 30.

制御部１０は、文書分析システム１００が有する各種の機能を統括的に制御する。制御部１０は、データ抽出部１１、結果取得部１２、要素評価部１３、スコア算出部１４、スコア特定部１５、超過判定部１６、データ設定部１７、関係付与部１８、および格納部１９を含む。 The control unit 10 comprehensively controls various functions of the document analysis system 100. The control unit 10 includes a data extraction unit 11, a result acquisition unit 12, an element evaluation unit 13, a score calculation unit 14, a score specification unit 15, an excess determination unit 16, a data setting unit 17, a relationship assignment unit 18, and a storage unit 19. Including.

データ抽出部（既判断データ取得部）１１は、所定の事案と関係するか否かがレビュアによって判断されるべき文書１ａを、所定の文書群（データ群）から所定数だけ抽出する。当該文書群は、ネットワーク上を流通するデータであってもよいし、記憶部３０にあらかじめ格納されたデータであってもよい。 The data extraction unit (determined data acquisition unit) 11 extracts a predetermined number of documents 1a to be determined by a reviewer from a predetermined document group (data group) as to whether or not it is related to a predetermined case. The document group may be data distributed on the network, or may be data stored in advance in the storage unit 30.

データ抽出部１１は、抽出した文書１ａを表示部５０に出力することによって、当該文書１ａをレビュアに提示することができる。これにより、レビュアは、例えば、文書１ａが「所定の事案と関係する」または「所定の事案と関係しない」を示すレビュー結果５ａを当該文書１ａにそれぞれ付与できる。また、データ抽出部１１は、当該文書１ａを結果取得部１２および要素評価部１３に出力する。 The data extraction unit 11 can present the document 1a to the reviewer by outputting the extracted document 1a to the display unit 50. As a result, the reviewer can give, for example, the review result 5a indicating that the document 1a is “related to the predetermined case” or “not related to the predetermined case” to the document 1a. In addition, the data extraction unit 11 outputs the document 1 a to the result acquisition unit 12 and the element evaluation unit 13.

結果取得部（既判断データ取得部）１２は、文書１ａがデータ抽出部１１から入力された場合、当該文書１ａが所定の事案と関係するか否かについてレビュアが判断した結果（レビュー結果５ａ）を、入力部４０を介して取得し、当該レビュー結果５ａを要素評価部１３およびスコア特定部１５に出力する。 When the document 1a is input from the data extraction unit 11, the result acquisition unit (determined data acquisition unit) 12 determines whether the reviewer determines whether or not the document 1a is related to a predetermined case (review result 5a). Is obtained via the input unit 40, and the review result 5a is output to the element evaluation unit 13 and the score identification unit 15.

要素評価部１３は、所定の事案と関係するか否かがレビュアによって判断された文書１ａに含まれるキーワード（データ要素）を、所定の基準に基づいてそれぞれ評価する。要素評価部１３は、例えば、上記キーワードと当該キーワードを含む文書１ａに対してレビュアが判断した結果（レビュー結果５ａ）との依存関係を表す伝達情報量を、上記所定の基準の１つとして当該キーワードの重みを算出することによって、当該キーワードを評価することができる。これにより、文書分析システム１００は、キーワードを正確に評価することができるため、所定の事案と関係するデータを正確に抽出できる。 The element evaluation unit 13 evaluates each keyword (data element) included in the document 1a determined by the reviewer whether or not it is related to a predetermined case based on a predetermined criterion. The element evaluation unit 13 uses, for example, the amount of transmitted information representing the dependency relationship between the keyword and the result (review result 5a) determined by the reviewer for the document 1a including the keyword as one of the predetermined criteria. The keyword can be evaluated by calculating the weight of the keyword. Thereby, since the document analysis system 100 can accurately evaluate the keyword, it is possible to accurately extract data related to a predetermined case.

または、要素評価部１３は、上記キーワードに所定の重みを割り当てることにより、当該キーワードを評価してもよい。この場合、要素評価部１３は、例えば、上記キーワードに「１」の重みを割り当てることができる。 Alternatively, the element evaluation unit 13 may evaluate the keyword by assigning a predetermined weight to the keyword. In this case, for example, the element evaluation unit 13 can assign a weight of “1” to the keyword.

なお、上記「キーワード」は、意味を有する文字列（形態素）である。例えば、「文書を分別する」という文章には、「文書」および「分別」というキーワードが含まれる。要素評価部１３は、上記キーワードと当該キーワードの重みとのペアであるキーワード情報５ｃを、スコア算出部１４および格納部１９に出力する。 The “keyword” is a meaningful character string (morpheme). For example, a sentence “classify a document” includes keywords “document” and “classification”. The element evaluation unit 13 outputs the keyword information 5 c that is a pair of the keyword and the weight of the keyword to the score calculation unit 14 and the storage unit 19.

スコア算出部１４は、要素評価部１３によって評価された結果（キーワード情報５ｃ）に基づいて、所定の事案との関係性の強さを示すスコア５ｄを文書１ａについてそれぞれ算出し、当該スコア５ｄをスコア特定部１５に出力する。また、受信部２０から文書１ｂ（所定の事案と関係するか否かが未だ判断されていないデータ）が新たに取得された場合、スコア算出部１４は、当該文書１ｂについてスコア５ｅを算出し、当該スコア５ｅを超過判定部１６に出力する。 Based on the result (keyword information 5c) evaluated by the element evaluation unit 13, the score calculation unit 14 calculates a score 5d indicating the strength of the relationship with the predetermined case for each document 1a, and the score 5d is calculated. The result is output to the score specifying unit 15. When the document 1b (data that has not been determined whether or not related to a predetermined case) is newly acquired from the receiving unit 20, the score calculation unit 14 calculates a score 5e for the document 1b, The score 5e is output to the excess determination unit 16.

スコア算出部１４は、文書に出現するキーワードの重みを合算することによって、当該文書のスコアを計算できる。例えば、文書に「価格を調整する」という文章が含まれていることにより、「価格」および「調整」というキーワードが要素評価部１３によってそれぞれ評価された結果、「１.２」および「２.２」という重みが設定された場合、スコア算出部１４は、当該文書のスコアを「３.４」（１.２＋２.２）と計算できる。 The score calculation unit 14 can calculate the score of the document by adding the weights of keywords appearing in the document. For example, as a result of the text “adjust price” included in the document, the keywords “price” and “adjustment” are evaluated by the element evaluation unit 13, respectively. As a result, “1.2” and “2. When the weight “2” is set, the score calculation unit 14 can calculate the score of the document as “3.4” (1.2 + 2.2).

具体的には、スコア算出部１４は、所定のキーワードが文書に含まれるか否かを示すキーワードベクトルを生成する。上記キーワードベクトルは、当該キーワードベクトルのそれぞれの要素が「０」または「１」の値をとることによって、当該要素に対応付けられた所定のキーワードが、上記文書に含まれるか否かを示すベクトルである。例えば、上記文書に「価格」というキーワードが含まれている場合、スコア算出部１４は、上記キーワードベクトルの上記「価格」に対応する要素を「０」から「１」に変更する。そして、スコア算出部１４は、以下の式のように、上記キーワードベクトル（縦ベクトル）と重みベクトル（各キーワードに対する重みを要素にした縦ベクトル）との内積を計算することにより、上記文書のスコアＳを計算する。 Specifically, the score calculation unit 14 generates a keyword vector indicating whether or not a predetermined keyword is included in the document. The keyword vector is a vector indicating whether or not a predetermined keyword associated with the element is included in the document when each element of the keyword vector takes a value of “0” or “1”. It is. For example, when the keyword “price” is included in the document, the score calculation unit 14 changes the element corresponding to the “price” of the keyword vector from “0” to “1”. Then, the score calculation unit 14 calculates the inner product of the keyword vector (vertical vector) and the weight vector (vertical vector using the weight for each keyword as an element) as in the following formula, thereby calculating the score of the document. S is calculated.

ここで、ｓはキーワードベクトルを表し、Ｗは重みベクトルを表す。なお、Ｔは行列・ベクトルを転置する（行と列とを入れ替える）ことを表す。

Here, s represents a keyword vector, and W represents a weight vector. T represents transposing a matrix / vector (replaces rows and columns).

または、スコア算出部１４は、以下の式にしたがってスコアＳを算出してもよい。 Or the score calculation part 14 may calculate the score S according to the following formula | equation.

ここで、ｍ_ｊは、ｊ番目のキーワードの出現頻度を表し、ｗ_ｉは、ｉ番目のキーワードの重みを表す。なお、スコア算出部１４は、文書１ａおよび／または文書１ｂに含まれる第１キーワードが評価された結果（第１キーワードの重み）と、当該文書１ａおよび／または文書１ｂに含まれる第２キーワードが評価された結果（第２キーワードの重み）とに基づいて、スコア５ｄおよび／またはスコア５ｅを算出してよい。また、スコア算出部１４は、文書１ａおよび／または文書１ｂにそれぞれ含まれるセンテンスごとに、スコア５ｄおよび／または５ｅを算出してよい（いずれも後で詳細に説明する）。

Here, m _j represents the appearance frequency of the j-th keyword, and w _i represents the weight of the i-th keyword. Note that the score calculation unit 14 determines that the first keyword included in the document 1a and / or the document 1b is evaluated (the weight of the first keyword) and the second keyword included in the document 1a and / or the document 1b. Based on the evaluated result (weight of the second keyword), the score 5d and / or the score 5e may be calculated. The score calculation unit 14 may calculate the scores 5d and / or 5e for each sentence included in the document 1a and / or the document 1b (both will be described in detail later).

スコア特定部（閾値特定部）１５は、所定の事案に関係すると判断された文書１ａが、所定数の文書を含む文書群に占める割合を示す適合率に対して設定された目標値（目標適合率）を超過可能な最小のスコアを、適合しきい値６として特定する。具体的には、スコア算出部１４からスコア５ｄが入力された場合、スコア特定部１５は、当該スコア５ｄを降順に並べ替える。次に、スコア特定部１５は、最大のスコア５ｄ（スコアのランクが１位）を有する文書１ａから順番に当該文書１ａに付与されたレビュー結果５ａを走査し、「所定の事案と関係する」というレビュー結果５ａが付与された文書の数が、現時点において走査が終了した文書の数に占める割合（適合率）を、順次計算する。 The score specifying unit (threshold specifying unit) 15 is configured to set a target value (target matching) indicating a ratio of the document 1a determined to be related to a predetermined case to a document group including a predetermined number of documents. The minimum score that can exceed (rate) is identified as the fitness threshold 6. Specifically, when the score 5d is input from the score calculation unit 14, the score specifying unit 15 sorts the scores 5d in descending order. Next, the score specifying unit 15 scans the review result 5a given to the document 1a in order from the document 1a having the maximum score 5d (score rank is first), and “relevant to a predetermined case”. The ratio of the number of documents to which the review result 5a is given to the number of documents that have been scanned at the present time (the relevance ratio) is sequentially calculated.

例えば、レビュー結果５ａが付与された文書１ａの数が１００である場合に、スコアのランクが１位から２０位までの文書について走査を終了したところ、「所定の事案と関係する」というレビュー結果５ａが付与された文書の数が１８であった場合、スコア特定部１５は、適合率を０.９（１８／２０）と計算する。または、スコアのランクが１位から４０位までの文書について走査を終了したところ、「所定の事案と関係する」というレビュー結果５ａが付与された文書の数が３５であった場合、スコア特定部１５は、適合率を０.８７５（３５／４０）と計算する。 For example, when the number of documents 1a to which the review result 5a is assigned is 100, when the scan is finished for the documents whose score ranks are 1st to 20th, the review result is “related to a predetermined case”. When the number of documents to which 5a is assigned is 18, the score specifying unit 15 calculates the relevance rate as 0.9 (18/20). Alternatively, when scanning of the documents whose score ranks are 1st to 40th is finished, when the number of documents to which the review result 5a “related to a predetermined case” is given is 35, the score specifying unit 15 calculates the precision as 0.875 (35/40).

スコア特定部１５は、文書１ａに対する適合率をすべて計算し、目標適合率を超過可能な最小のスコアを特定する。具体的には、スコア特定部１５は、最小のスコア５ｄ（スコアのランクが１００位）を有する文書１ａから順番に当該文書１ａに対して計算された適合率を走査し、当該適合率が目標適合率を超過した場合、当該適合率に対応するスコアを、上記目標適合率を維持可能な最小スコア（適合しきい値６）として超過判定部１６および格納部１９に出力する。 The score specifying unit 15 calculates all the relevance ratios for the document 1a, and specifies the minimum score that can exceed the target relevance ratio. Specifically, the score specifying unit 15 scans the relevance ratio calculated for the document 1a in order from the document 1a having the minimum score 5d (score rank is 100th), and the relevance ratio is the target. When the accuracy rate is exceeded, the score corresponding to the accuracy rate is output to the excess determination unit 16 and the storage unit 19 as the minimum score (matching threshold 6) that can maintain the target accuracy rate.

超過判定部１６は、要素評価部１３によって評価された結果（キーワード情報５ｃ）に基づいて、所定の事案と関係するか否かが未だ判断されていない文書１ｂについて算出されたスコア５ｅが、適合しきい値６を超過しているか否かを判定し、当該判定した結果（判定結果５ｆ）をデータ設定部１７に出力する。 Based on the result (keyword information 5c) evaluated by the element evaluation unit 13, the excess determination unit 16 determines that the score 5e calculated for the document 1b that has not yet been determined whether or not it is related to a predetermined case It is determined whether or not the threshold value 6 is exceeded, and the determined result (determination result 5f) is output to the data setting unit 17.

データ設定部１７は、超過判定部１６によって超過していると判定された場合、当該文書１ｂをレビュアに報告すべき文書として設定する。データ設定部１７は、例えば、上記適合しきい値６を超過した文書１ｂにフラグを立てることによって、当該文書１ｂをレビュアに報告すべき文書に設定する。データ設定部１７は、設定した文書を特定可能な設定情報５ｇを関係付与部１８に出力する。 The data setting unit 17 sets the document 1b as a document to be reported to the reviewer when the excess determination unit 16 determines that the number is exceeded. The data setting unit 17 sets the document 1b as a document to be reported to the reviewer, for example, by setting a flag for the document 1b that has exceeded the conformance threshold 6. The data setting unit 17 outputs setting information 5g that can specify the set document to the relationship adding unit 18.

関係付与部１８は、データ設定部１７によって設定された文書１ｂに、当該文書１ｂが所定の事案と関係することを示す関係性情報（文書分析システム１００によるレビュー結果）を付与する。関係付与部（表示処理部）１８は、一覧結果５ｂを表示部５０に出力することにより、データ設定部１７によって設定された文書１ｂ（所定の事案と関係すると文書分析システム１００によって判断された文書）を一覧可能に表示できる。 The relationship assigning unit 18 assigns relationship information (review result by the document analysis system 100) indicating that the document 1b is related to a predetermined case to the document 1b set by the data setting unit 17. The relationship assigning unit (display processing unit) 18 outputs the list result 5b to the display unit 50, whereby the document 1b set by the data setting unit 17 (the document determined by the document analysis system 100 to be related to a predetermined case). ) Can be displayed as a list.

格納部１９は、要素評価部１３からキーワード情報５ｃが入力された場合、当該キーワード情報５ｃに含まれるキーワードと、当該キーワードが評価された結果（重み）とを対応付けて、記憶部３０に格納する。これにより、文書分析システム１００は、過去のデータを分析した結果（キーワードが評価された結果としての重み）に基づいて現在のデータを分析することによって、所定の事案と関係するデータを抽出できる。また、格納部１９は、スコア特定部１５から適合しきい値６が入力された場合、当該適合しきい値６を記憶部３０に格納する。 When the keyword information 5 c is input from the element evaluation unit 13, the storage unit 19 associates the keyword included in the keyword information 5 c with the result (weight) of evaluation of the keyword and stores it in the storage unit 30. To do. Thus, the document analysis system 100 can extract data related to a predetermined case by analyzing the current data based on the result of analyzing past data (weight as a result of evaluating the keyword). In addition, when the adaptation threshold 6 is input from the score specifying unit 15, the storage unit 19 stores the adaptation threshold 6 in the storage unit 30.

入力部（所定の入力部）４０は、レビュアから入力（レビュー結果５ａ）を受け付ける。図１は、文書分析システム１００が入力部４０を備えた構成（例えば、入力部４０としてキーボード、マウスなどが接続された構成）を示すが、当該入力部４０は、当該文書分析システム１００と通信可能に接続された外部の入力装置（例えば、クライアント端末）であってもよい。 The input unit (predetermined input unit) 40 receives an input (review result 5a) from a reviewer. FIG. 1 shows a configuration in which the document analysis system 100 includes an input unit 40 (for example, a configuration in which a keyboard, a mouse, and the like are connected as the input unit 40). The input unit 40 communicates with the document analysis system 100. It may be an external input device (for example, a client terminal) that is connected as possible.

受信部２０は、所定の通信方式にしたがう通信網を介して、ネットワークから文書１ａおよび／または文書１ｂを受信する。外部の機器（例えば、一般ユーザが使用する端末）との通信を実現する本質的な機能が受信部２０に備わってさえいればよく、通信回線、通信方式、または通信媒体などは限定されない。受信部２０は、例えばイーサネット（登録商標）アダプタなどの機器で構成できる。また、受信部２０は、例えばIEEE802.11無線通信、Bluetooth（登録商標）などの通信方式や通信媒体を利用できる。 The receiving unit 20 receives the document 1a and / or the document 1b from the network via a communication network according to a predetermined communication method. It is only necessary that the receiving unit 20 has an essential function for realizing communication with an external device (for example, a terminal used by a general user), and a communication line, a communication method, a communication medium, and the like are not limited. The receiving unit 20 can be configured by a device such as an Ethernet (registered trademark) adapter, for example. The receiving unit 20 can use a communication method or a communication medium such as IEEE 802.11 wireless communication or Bluetooth (registered trademark).

表示部５０は、レビュアが操作可能なインターフェース画面を表示するデバイスである。図１は、文書分析システム１００が表示部５０を備えた構成（例えば、表示部５０として液晶ディスプレイなどが接続された構成）を示すが、当該表示部５０は、当該文書分析システム１００と通信可能に接続された外部の表示装置（例えば、クライアント端末）であってもよい。 The display unit 50 is a device that displays an interface screen that can be operated by the reviewer. FIG. 1 shows a configuration in which the document analysis system 100 includes a display unit 50 (for example, a configuration in which a liquid crystal display or the like is connected as the display unit 50). The display unit 50 can communicate with the document analysis system 100. It may be an external display device (for example, a client terminal) connected to the.

記憶部（所定の記憶部）３０は、例えば、ハードディスク、ＳＳＤ（silicon state drive）、半導体メモリ、ＤＶＤなど、任意の記録媒体によって構成される記憶機器であり、文書１ａ、キーワード情報５ｃ、適合しきい値６、および／または文書分析システム１００を制御可能な制御プログラムを記憶する。なお、図１は、文書分析システム１００が記憶部３０を内蔵する構成を示すが、当該記憶部３０は、当該文書分析システム１００と通信可能に接続された外部の記憶装置であってもよい。 The storage unit (predetermined storage unit) 30 is a storage device composed of an arbitrary recording medium such as a hard disk, an SSD (silicon state drive), a semiconductor memory, a DVD, and the like, and includes documents 1a and keyword information 5c. A threshold 6 and / or a control program capable of controlling the document analysis system 100 are stored. 1 illustrates a configuration in which the document analysis system 100 includes the storage unit 30, the storage unit 30 may be an external storage device connected to be communicable with the document analysis system 100.

〔文書分析システム１００の性能検証〕
図３は、レビュー結果５ａが付与された文書の数に対応する最小スコアを示す表であり、（ａ）は、目標適合率を１００％とした場合を示し、（ｂ）は、目標適合率を９０％とした場合を示す。[Performance verification of document analysis system 100]
FIG. 3 is a table showing the minimum score corresponding to the number of documents to which the review result 5a is given. (A) shows a case where the target precision is 100%, and (b) is the target precision. Is 90%.

図３の（ａ）に例示されるように、所定の事案と関係するか否かがレビュアによって判断された文書の数が１００である場合（同図の表において「サンプル数」が「１００」である行を参照）、目標適合率１００％を達成可能な最下位のランクは１１位であり、当該ランクに対応するスコア（適合率１００％を達成可能な最小スコア）は、０.１１０である。文書分析システム１００は、上記最小スコアを適合しきい値として設定し、当該適合しきい値を超過するスコアを有する文書１ｂは、適合率１００％を維持可能な文書（すなわち、所定の事案と関係する文書）とみなす。 As illustrated in FIG. 3A, when the number of documents determined by the reviewer as to whether or not it is related to a predetermined case is 100 (“sample number” in the table of FIG. 3 is “100”). ), The lowest rank that can achieve the target precision 100% is 11th, and the score corresponding to the rank (the minimum score that can achieve the precision 100%) is 0.110. is there. The document analysis system 100 sets the minimum score as a conformance threshold, and a document 1b having a score exceeding the conformance threshold is a document that can maintain a conformance rate of 100% (ie, related to a predetermined case). Document).

上記適合しきい値の妥当性を検証するために、レビュアによってレビュー結果５ａが付与された７９９４の文書から、上記適合しきい値を超過するスコアを有する文書を取り出した。なお、上記文書は、文書分析システム１００の性能を検証するために用意された特別な文書であり、当該文書分析システム１００が分析の目的とする文書は、あくまでも、所定の事案と関係するか否かが未だ判断されていない文書１ｂであることに注意する。 In order to verify the validity of the conformance threshold, a document having a score exceeding the conformance threshold was extracted from 7994 documents to which the review result 5a was given by the reviewer. Note that the document is a special document prepared for verifying the performance of the document analysis system 100, and the document that is analyzed by the document analysis system 100 is only related to a predetermined case. Note that the document 1b has not yet been determined.

上記の結果、７６６の文書が上記適合しきい値を超過し、このうちの６０５の文書に「所定の事案と関係する」というレビュー結果５ａが付与されていた。すなわち、わずか１００の文書に対してレビュー結果５ａを与えさえすれば、文書分析システム１００は、約８０００の文書に対して７９％（605/766=0.790）の精度（適合率）で所定の事案と関係する文書を抽出できることが定量的に証明された。 As a result, 766 documents exceeded the conformance threshold, and a review result 5 a “related to a predetermined case” was given to 605 of these documents. That is, as long as the review result 5a is given to only 100 documents, the document analysis system 100 can perform a predetermined case with an accuracy (accuracy rate) of 79% (605/766 = 0.790) for about 8000 documents. It is proved quantitatively that documents related to can be extracted.

図３の（ａ）に示されるように、所定の事案と関係するか否かがレビュアによって判断された文書の数が増えるほど、文書分析システム１００の精度（適合率）が上昇し、目標適合率に近づくことが分かる（同図の「全体サンプル」に含まれる「適合率」の列を参照）。図３の（ｂ）に示されるように、目標適合率を９０％に下げた場合も、上記傾向は成立する。 As shown in FIG. 3A, the accuracy (accuracy rate) of the document analysis system 100 increases as the number of documents determined by the reviewer as to whether or not it is related to a predetermined case, and the target conformity is increased. It can be seen that the rate is approaching (see the column “Fit rate” included in “Total sample” in the figure). As shown in FIG. 3B, the above tendency is also established when the target precision is lowered to 90%.

以上のように、文書分析システム１００は、全体のサンプル数を確定できない任意の文書の一部に、レビュアによって判断された結果（レビュー結果５ａ）を与えさえすれば、残りの大部分の文書を高い精度で分別できる。すなわち、文書分析システム１００は、過去のデータを分析した結果に基づいて現在のデータを分析することによって、所定の事案と関係するデータを抽出できる。これにより、文書分析システム１００は、例えば、機密情報を外部に漏えいさせたり、他社に談合を持ちかけたりするなどの法的リスクが高まる予兆を、人手をかけることなく検知できる。 As described above, if the document analysis system 100 gives the result (review result 5a) determined by the reviewer to a part of an arbitrary document in which the total number of samples cannot be determined, the remaining most documents are processed. Can be sorted with high accuracy. That is, the document analysis system 100 can extract data related to a predetermined case by analyzing current data based on the result of analyzing past data. As a result, the document analysis system 100 can detect signs of increasing legal risks such as leaking confidential information to the outside or rigging other companies without manpower.

〔文書分析システム１００が実行する処理〕
図４は、文書分析システム１００が実行する処理の一例を示すフローチャートである。なお、以下の説明において、カッコ書きの「〜ステップ」は、データ分析装置の制御方法に含まれる各ステップを表す。[Processing executed by the document analysis system 100]
FIG. 4 is a flowchart illustrating an example of processing executed by the document analysis system 100. In the following description, parenthesized “˜step” represents each step included in the control method of the data analysis apparatus.

まず、データ抽出部１１は、所定の事案と関係するか否かがレビュアによって判断されるべき文書１ａを、所定の文書群から所定数だけ抽出する（ステップ１、以下「ステップ」を「Ｓ」と略記する）。次に、結果取得部１２は、文書１ａが所定の事案と関係するか否かについてレビュアが判断した結果（レビュー結果５ａ）を、入力部４０を介して取得する（Ｓ２）。次に、要素評価部１３は、上記所定の事案と関係するか否かがレビュアによって判断された文書に含まれるキーワードを、所定の基準に基づいてそれぞれ評価する（Ｓ３）。そして、スコア算出部１４は、要素評価部１３によって評価された結果（キーワード情報５ｃ）に基づいて、上記所定の事案との関係性の強さを示すスコア５ｄを文書１ａについてそれぞれ算出し（Ｓ４）、スコア特定部１５は、上記所定の事案に関係すると判断された文書１ａが、所定数の文書を含む文書群に占める割合を示す適合率に対して設定された目標値（目標適合率）を超過可能な最小のスコアを、適合しきい値６として特定する（Ｓ５、閾値特定ステップ）。 First, the data extraction unit 11 extracts a predetermined number of documents 1a to be judged by a reviewer from a predetermined document group as to whether or not they are related to a predetermined case (Step 1, hereinafter “Step” is “S”). Abbreviated). Next, the result acquisition unit 12 acquires a result (review result 5a) determined by the reviewer as to whether or not the document 1a is related to a predetermined case via the input unit 40 (S2). Next, the element evaluation unit 13 evaluates each keyword included in the document determined by the reviewer whether or not it is related to the predetermined case based on a predetermined criterion (S3). Then, the score calculation unit 14 calculates, for each document 1a, a score 5d indicating the strength of the relationship with the predetermined case based on the result (keyword information 5c) evaluated by the element evaluation unit 13 (S4). ), The score specifying unit 15 sets a target value (target relevance ratio) set for the relevance ratio indicating the ratio of the document 1a determined to be related to the predetermined case to the document group including the predetermined number of documents. Is specified as the matching threshold 6 (S5, threshold specifying step).

次に、スコア算出部１４は、要素評価部１３によって評価された結果（キーワード情報５ｃ）に基づいて、上記所定の事案との関係性の強さを示すスコア５ｅを文書１ｂについてそれぞれ算出する（Ｓ６）。超過判定部１６は、要素評価部１３によって評価された結果（キーワード情報５ｃ）に基づいて、上記所定の事案と関係するか否かが未だ判断されていない文書１ｂについて算出されたスコア５ｅが、適合しきい値６を超過しているか否かを判定し（Ｓ７）、超過していると判定される場合（Ｓ７においてＹＥＳ）、データ設定部１７は、当該文書１ｂをレビュアに報告すべき文書として設定する（Ｓ８、データ設定ステップ）。最後に、関係付与部１８は、データ設定部１７によって設定された文書１ｂに、当該文書１ｂが所定の事案と関係することを示す関係性情報（文書分析システム１００によるレビュー結果）を付与する（Ｓ９）。 Next, the score calculation unit 14 calculates a score 5e indicating the strength of the relationship with the predetermined case for each document 1b based on the result (keyword information 5c) evaluated by the element evaluation unit 13 ( S6). Based on the result (keyword information 5c) evaluated by the element evaluation unit 13, the excess determination unit 16 has a score 5e calculated for the document 1b that has not yet been determined whether or not it is related to the predetermined case. It is determined whether or not the conformance threshold 6 has been exceeded (S7), and if it is determined that it has been exceeded (YES in S7), the data setting unit 17 is a document that should report the document 1b to the reviewer. (S8, data setting step). Finally, the relationship assigning unit 18 assigns relationship information (review result by the document analysis system 100) indicating that the document 1b is related to a predetermined case to the document 1b set by the data setting unit 17 ( S9).

なお、上記制御方法は、図４を参照して前述した上記処理だけでなく、制御部１０に含まれる各部において実行される処理を任意に含んでよい。 Note that the above control method may optionally include not only the above-described processing described with reference to FIG. 4 but also processing executed in each unit included in the control unit 10.

〔共起に基づくスコア計算〕
前述したように、スコア算出部１４は、文書に含まれる第１キーワードが評価された結果と、当該文書に含まれる第２キーワードが評価された結果とに基づいてスコアを算出できる。すなわち、スコア算出部１４は、第１キーワードが文書に出現した場合、当該文書において第２キーワードが出現する頻度（すなわち、第１キーワードと第２キーワードとの相関、共起ともいう）を考慮して、文書のスコアを計算できる。[Score calculation based on co-occurrence]
As described above, the score calculation unit 14 can calculate the score based on the result of evaluating the first keyword included in the document and the result of evaluating the second keyword included in the document. That is, when the first keyword appears in the document, the score calculation unit 14 takes into account the frequency with which the second keyword appears in the document (that is, the correlation between the first keyword and the second keyword or co-occurrence). The document score.

この場合、スコア算出部１４は、第１キーワードと第２キーワードとの相関（共起）を表す相関行列（共起行列）Ｃを用いて、（上記〔数１〕ではなく）以下の式にしたがってスコアＳを計算できる。 In this case, the score calculation unit 14 uses the correlation matrix (co-occurrence matrix) C that represents the correlation (co-occurrence) between the first keyword and the second keyword to express the following equation (instead of [Equation 1] above). Therefore, the score S can be calculated.

なお、上記相関行列Ｃは、所定の文書を所定数だけ含む学習用データセットを用いて、あらかじめ最適化されている。例えば、ある文書において「価格」というキーワードが出現する場合、当該キーワードに対する他のキーワードの出現数を０〜１の間に正規化した値（すなわち、最尤推定値）が、上記相関行列Ｃのそれぞれの要素に格納されている（したがって、上記相関行列Ｃの各列に対する総和は１になる）。 The correlation matrix C is optimized in advance using a learning data set including a predetermined number of predetermined documents. For example, when a keyword “price” appears in a document, a value obtained by normalizing the number of occurrences of other keywords with respect to the keyword between 0 and 1 (that is, a maximum likelihood estimate) is the correlation matrix C. Stored in each element (therefore, the sum for each column of the correlation matrix C is 1).

以上のように、文書分析システム１００は、キーワード間の相関関係を考慮してスコアを算出できるため、より高い精度で所定の事案と関係するデータを抽出できる。 As described above, since the document analysis system 100 can calculate the score in consideration of the correlation between keywords, it can extract data related to a predetermined case with higher accuracy.

〔センテンスごとのスコア計算〕
前述したように、スコア算出部１４は、文書にそれぞれ含まれるセンテンスごとにスコアを算出できる。この場合、スコア算出部１４は、文書に含まれるセンテンスに所定のキーワードが含まれるか否かを示すキーワードベクトルを、当該センテンスごとに生成する。そして、スコア算出部１４は、下記の式にしたがってスコアを文書ごとに算出する。[Score calculation for each sentence]
As described above, the score calculation unit 14 can calculate a score for each sentence included in each document. In this case, the score calculation unit 14 generates a keyword vector indicating whether or not a predetermined keyword is included in the sentence included in the document for each sentence. And the score calculation part 14 calculates a score for every document according to the following formula.

ここで、ｓ_ｓは、ｓ番目のセンテンスに対応するキーワードベクトルである。なお、上記〔数４〕にしたがうスコアの算出においては、共起を考慮している（相関行列Ｃを用いている）ことに注意する。

Here, s _s is a keyword vector corresponding to the sth sentence. It should be noted that co-occurrence is taken into account (correlation matrix C is used) in calculating the score according to [Equation 4].

ＴＦｎｏｒｍは、下記の〔数５〕に示されるように計算できる。 TFnorm can be calculated as shown in [Formula 5] below.

ここで、上記〔数５〕において、ＴＦ_ｉはｉ番目のキーワードの出現頻度（Term Frequency）を表し、ｓ_ｊｉは上記ｉ番目のキーワードベクトルのｊ番目の要素を表し、ｃ_ｊｉは相関行列Ｃのｊ行ｉ列の要素を表す。

Here, in [Formula 5], TF _i represents the appearance frequency (Term Frequency) of the i-th keyword, s _ji represents the j-th element of the i-th keyword vector, and c _ji represents the correlation matrix C Of j rows and i columns.

上記〔数４〕および〔数５〕をまとめると、スコア算出部１４は、以下の〔数６〕を計算することによって文書ごとに上記スコアを算出する。 When the above [Equation 4] and [Equation 5] are put together, the score calculation unit 14 calculates the above score for each document by calculating the following [Equation 6].

ここで、上記〔数６〕において、ｗ_ｉは上記重みベクトルｗのｉ番目の要素である。

Here, in [Formula 6], w _i is the i-th element of the weight vector w.

以上のように、文書分析システム１００は、センテンスの文意を正しく反映したスコアを算出できるため、より高い精度で所定の事案と関係するデータを抽出できる。 As described above, since the document analysis system 100 can calculate a score that correctly reflects the sentence meaning, it can extract data related to a predetermined case with higher accuracy.

〔フェーズ分析〕
文書分析システム１００は、所定の事案が属するフェーズを推定し、当該フェーズに応じてスコアを算出できる。ここで、上記「フェーズ」は、上記所定の事案が進展する各段階を示す（上記所定の事案の進展に応じて分類する）指標である。[Phase analysis]
The document analysis system 100 can estimate a phase to which a predetermined case belongs and calculate a score according to the phase. Here, the “phase” is an index indicating each stage where the predetermined case progresses (classified according to the progress of the predetermined case).

例えば、上記所定の事案が「他社との談合」という不正行為事件であり、文書分析システム１００は、ネットワーク上を日々流通する電子メールが当該不正行為事件に関係するか否かを判断することにより、当該不正行為事件の予兆を検知することを目的とする場合、上記フェーズには、「他社と競合に関する情報を収集する準備フェーズ」、「顧客・競合と関係を構築する関係構築フェーズ」、「顧客へ価格を提示し、フィードバックを得て、当該フィードバックに関して競合とコミュニケーションを取る競合フェーズ」などが含まれ得る。 For example, the predetermined case is a fraud case called “collusion with another company”, and the document analysis system 100 determines whether or not an email distributed daily on the network is related to the fraud case. When the purpose is to detect a sign of the fraud case, the above phases include the “preparation phase for collecting information on competitors with other companies”, the “relationship building phase for building relationships with customers / competitors”, “ It may include a “competition phase” that presents a price to the customer, gets feedback, and communicates with the competitor regarding that feedback.

また、時系列情報および生成過程情報が記憶部３０に格納されている。ここで、上記「時系列情報」は、上記フェーズの時間的な序列を示す情報であり、例えば、「準備フェーズ」から「関係構築フェーズ」を経て「競合フェーズ」に至ることを示す時間発展モデルであってよい。また、上記「生成過程情報」は、あるフェーズにおいて各キーワードが生成される過程をモデル化した情報であり、例えば、フェーズごとに定義された多項分布モデルであってよい。 In addition, time series information and generation process information are stored in the storage unit 30. Here, the “time-series information” is information indicating the temporal order of the phases, for example, a time development model indicating that the “competition phase” is reached from the “preparation phase” through the “relationship building phase”. It may be. The “generation process information” is information that models a process in which each keyword is generated in a certain phase, and may be, for example, a multinomial distribution model defined for each phase.

結果取得部１２は、文書１ａが所定の事案と関係するか否かについてレビュアが判断した結果と、当該文書１ａが上記所定の事案のいずれのフェーズに属するかを判断した結果とを、レビュー結果５ａとして取得し、要素評価部１３は、文書１ａに含まれるキーワードを上記フェーズごとにそれぞれ評価する（各キーワードの重みを決定する）。 The result acquisition unit 12 obtains the review result based on the result of the review by the reviewer as to whether or not the document 1a is related to the predetermined case and the result of determining which phase of the predetermined case the document 1a belongs to. The element evaluation unit 13 evaluates the keywords included in the document 1a for each phase (determines the weight of each keyword).

スコア算出部１４は、文書１ｂに対してスコア５ｅを算出する場合、上記生成過程情報に基づいて当該文書１ｂがいずれのフェーズにあるかを推定する。具体的には、当該生成過程情報に基づいて各フェーズに対する尤度を算出し、当該尤度を最大化するフェーズを当該文書１ｂのフェーズとして推定する。そして、スコア算出部１４は、推定したフェーズに対応する重みを用いて、文書１ｂのスコアをそれぞれ算出する。このとき、スコア算出部１４は、当該フェーズに対応する相関行列Ｃを用いてもよい。 When calculating the score 5e for the document 1b, the score calculation unit 14 estimates in which phase the document 1b is based on the generation process information. Specifically, the likelihood for each phase is calculated based on the generation process information, and the phase that maximizes the likelihood is estimated as the phase of the document 1b. And the score calculation part 14 calculates the score of the document 1b, respectively using the weight corresponding to the estimated phase. At this time, the score calculation unit 14 may use the correlation matrix C corresponding to the phase.

関係付与部１８は、データ設定部１７によって設定された文書１ｂを一覧可能に表示するとともに、上記推定したフェーズを表示できる。このとき、関係付与部１８は、上記時系列情報に基づいて、当該推定したフェーズが次のフェーズに発展する可能性・時期などを予測し、当該予測した結果をあわせて表示することができる。 The relationship assigning unit 18 can display the document 1b set by the data setting unit 17 in a listable manner and can display the estimated phase. At this time, the relationship assigning unit 18 can predict the possibility and timing of the estimated phase developing to the next phase based on the time series information, and can display the predicted result together.

以上のように、文書分析システム１００は、フェーズに応じてスコアを正確に算出できるため、より高い精度で所定の事案と関係するデータを抽出できる。 As described above, since the document analysis system 100 can accurately calculate a score according to a phase, it can extract data related to a predetermined case with higher accuracy.

〔重みの再計算〕
データ設定部１７によって設定された文書１ｂ（所定の事案と関係すると文書分析システム１００によって判断された文書）が、関係付与部１８によって一覧可能に表示された後、結果取得部１２は、当該判断に対するフィードバックをレビュアから受け付けることができる。すなわち、レビュアは、文書分析システム１００によって判断された結果が妥当であるか否かを、上記フィードバックとしてそれぞれ入力できる。そして、要素評価部１３は、上記フィードバックに基づいて各キーワードを再評価できる。[Recalculation of weights]
After the document 1b set by the data setting unit 17 (the document determined by the document analysis system 100 to be related to a predetermined case) is displayed in a listable manner by the relationship granting unit 18, the result acquisition unit 12 makes the determination Feedback can be received from the reviewer. That is, each reviewer can input whether the result determined by the document analysis system 100 is valid as the feedback. The element evaluation unit 13 can reevaluate each keyword based on the feedback.

言い換えれば、要素評価部１３は、文書分析システム１００の判断に対して新たに得られたフィードバックに基づいて重みを再計算できる。これにより、文書分析システム１００は、分析の対象とする文書に適合した重みを獲得し、当該重みに基づいて正確にスコアを算出できるため、より高い精度で所定の事案と関係するデータを抽出できる。 In other words, the element evaluation unit 13 can recalculate the weight based on the feedback newly obtained for the determination of the document analysis system 100. As a result, the document analysis system 100 can obtain a weight suitable for the document to be analyzed, and can accurately calculate a score based on the weight, so that data related to a predetermined case can be extracted with higher accuracy. .

〔人物・組織相関の表示〕
データ設定部１７は、文書１ｂに出現する固有名詞（例えば、人物の名前、企業の名前、場所の名前など、固有データ要素）を抽出し、所定の固有名詞（第１固有データ要素）と他の固有名詞（第２固有データ要素）との対応関係を推定することによって、複数の人物または組織の間の繋がりの強さを可視化することができる。[Display of person-organization correlation]
The data setting unit 17 extracts proper nouns appearing in the document 1b (for example, unique data elements such as a person's name, company name, place name, etc.), a predetermined proper noun (first unique data element) and others By estimating the correspondence with the proper noun (second unique data element), the strength of the connection between a plurality of persons or organizations can be visualized.

例えば、人物Ａから人物Ｂに送信された電子メールを文書１ｂとして分析した結果、当該文書１ｂに「私からＣさんに連絡しておきます」という文章が含まれていた場合、データ設定部１７は、「人物Ａ」、「人物Ｂ」、および「人物Ｃ」を抽出し、「人物Ａ」を示すノードから「人物Ｂ」を示すノード、および「人物Ｃ」を示すノードのそれぞれに矢印を接続したチャートを表示できる。この場合、データ設定部１７は、人物・組織間の相関の強さに応じて矢印の太さが異なるように、上記チャートを表示してよい。 For example, if an e-mail transmitted from person A to person B is analyzed as document 1b, and the document 1b contains a sentence "I will contact Mr. C", data setting unit 17 Extracts “person A”, “person B”, and “person C” and puts an arrow on each of the node indicating “person B” and the node indicating “person C” from the node indicating “person A”. The connected chart can be displayed. In this case, the data setting unit 17 may display the chart so that the thickness of the arrow varies depending on the strength of the correlation between the person and the organization.

以上のように、文書分析システム１００は、文書を分析した結果に基づいて、人物・組織間の相関関係を把握容易に表示できるため、不正行為事件を発生させる主体を漏れなく特定できる。 As described above, since the document analysis system 100 can easily understand and display the correlation between persons and organizations based on the result of analyzing the document, the subject that causes the fraud case can be identified without omission.

〔サーバ装置が機能の一部または全部を提供する構成〕
以上では、データを分析する機能を提供可能な制御プログラム（データ分析装置の制御プログラム）が、主に文書分析システム１００（データ分析装置）において実行される構成（スタンドアロン構成）を説明した。一方、上記制御プログラムの一部または全部がサーバ装置において実行され、当該実行された処理の結果が上記文書分析システム１００（ユーザ端末）に返される構成（クラウド構成）であってもよい。すなわち、本発明のデータ分析装置は、ユーザ端末とネットワークを介して通信可能に接続されたサーバ装置として機能することができる。これにより、サーバ装置は、上記文書分析システム１００が機能を提供する場合に、当該文書分析システム１００が奏する効果と同じ効果を奏する。[Configuration in which server device provides part or all of functions]
The configuration (stand-alone configuration) in which the control program (data analysis device control program) capable of providing a function of analyzing data is mainly executed in the document analysis system 100 (data analysis device) has been described above. On the other hand, a configuration (cloud configuration) in which part or all of the control program is executed in the server device and the result of the executed processing is returned to the document analysis system 100 (user terminal) may be employed. That is, the data analysis device of the present invention can function as a server device that is communicably connected to a user terminal via a network. Thereby, the server device has the same effect as the document analysis system 100 when the document analysis system 100 provides a function.

〔ソフトウェアによる実現例〕
文書分析システム１００の制御ブロック（特に、制御部１０）は、集積回路（ＩＣチップ）等に形成された論理回路（ハードウェア）によって実現してもよいし、ＣＰＵ（Central Processing Unit）を用いてソフトウェアによって実現してもよい。後者の場合、文書分析システム１００は、各機能を実現するソフトウェアである制御プログラムの命令を実行するＣＰＵ、上記制御プログラムおよび各種データがコンピュータ（またはＣＰＵ）で読み取り可能に記録されたＲＯＭ（Read Only Memory）または記憶装置（これらを「記録媒体」と称する）、上記制御プログラムを展開するＲＡＭ（Random Access Memory）などを備えている。そして、コンピュータ（またはＣＰＵ）が上記制御プログラムを上記記録媒体から読み取って実行することにより、本発明の目的が達成される。上記記録媒体としては、「一時的でない有形の媒体」、例えば、テープ、ディスク、カード、半導体メモリ、プログラマブルな論理回路などを用いることができる。また、上記制御プログラムは、当該制御プログラムを伝送可能な任意の伝送媒体（通信ネットワークや放送波等）を介して上記コンピュータに供給されてもよい。本発明は、上記制御プログラムが電子的な伝送によって具現化された、搬送波に埋め込まれたデータ信号の形態でも実現され得る。[Example of software implementation]
The control block (particularly, the control unit 10) of the document analysis system 100 may be realized by a logic circuit (hardware) formed in an integrated circuit (IC chip) or the like, or using a CPU (Central Processing Unit). It may be realized by software. In the latter case, the document analysis system 100 includes a CPU that executes instructions of a control program that is software that implements each function, and a ROM (Read Only) in which the control program and various data are recorded so as to be readable by a computer (or CPU). Memory) or a storage device (these are referred to as “recording media”), a RAM (Random Access Memory) that expands the control program, and the like. The computer (or CPU) reads the control program from the recording medium and executes it, thereby achieving the object of the present invention. As the recording medium, a “non-temporary tangible medium” such as a tape, a disk, a card, a semiconductor memory, a programmable logic circuit, or the like can be used. The control program may be supplied to the computer via any transmission medium (such as a communication network or a broadcast wave) that can transmit the control program. The present invention can also be realized in the form of a data signal embedded in a carrier wave, in which the control program is embodied by electronic transmission.

なお、上記制御プログラムは、例えば、Python、ActionScript、JavaScript（登録商標）などのスクリプト言語、Objective-C、Java（登録商標）などのオブジェクト指向プログラミング言語、HTML5などのマークアップ言語などを用いて実装できる。また、前記制御プログラムによって実現される各機能を実現する各部を備えた情報処理装置（例えば、文書分析システム１００）と、前記各機能とは異なる残りの機能を実現する各部を備えたサーバ装置とを含む分析システムも、本発明の範疇に入る。 The above control program is implemented using, for example, a script language such as Python, ActionScript, JavaScript (registered trademark), an object-oriented programming language such as Objective-C, Java (registered trademark), or a markup language such as HTML5. it can. In addition, an information processing apparatus (for example, the document analysis system 100) including each unit that implements each function implemented by the control program, and a server apparatus that includes each unit that implements the remaining functions different from the respective functions, An analysis system including the above also falls within the scope of the present invention.

〔文書以外のデータに適用する例〕
本発明のデータ分析装置を実施する一形態として、文書を分析する文書分析システム１００を説明したが、当該データ分析装置は、文書以外のデータも分析可能である。[Example applied to data other than documents]
Although the document analysis system 100 for analyzing a document has been described as an embodiment for implementing the data analysis apparatus of the present invention, the data analysis apparatus can also analyze data other than documents.

例えば、本発明のデータ分析装置は、音声を分析する音声分析システムの形態でも実施可能である。この場合、上記音声分析システムは、（１）音声を認識することによって当該音声に含まれる会話の内容を文字（文書データ）に変換し、上記文書分析システム１００と同様に当該文書データを処理してもよいし、（２）音声データをそのまま処理してもよい。 For example, the data analysis apparatus of the present invention can be implemented in the form of a voice analysis system that analyzes voice. In this case, the voice analysis system (1) recognizes the voice, converts the content of the conversation included in the voice into characters (document data), and processes the document data in the same manner as the document analysis system 100. (2) The audio data may be processed as it is.

上記（１）の場合、上記音声分析システムは、任意の音声認識アルゴリズム（例えば、隠れマルコフモデルを用いた認識方法など）を用いることによって、音声データを文書データに変換し、当該文書データに対して、文書分析システム１００が実行する処理と同様の処理を実行する。これにより、上記音声分析システムは、上記文書分析システム１００と同様の効果を奏する。 In the case of (1), the speech analysis system converts speech data into document data by using an arbitrary speech recognition algorithm (for example, a recognition method using a hidden Markov model). Thus, the same processing as the processing executed by the document analysis system 100 is executed. Thereby, the voice analysis system has the same effect as the document analysis system 100.

上記（２）の場合、上記音声分析システムは、音声データに含まれる部分音声を抽出することによって、当該音声データが所定の事案と関係するか否かを分別できる。例えば、「価格を調整する」という音声データが得られた場合、音声分析システムは「価格」および「調整」という部分音声を当該音声データから抽出し、当該部分音声を評価した結果に基づいて、未分別の音声データに関連性情報を与えることができる。この場合、音声分析システムは、時系列データの分類アルゴリズム（例えば、隠れマルコフモデル、カルマンフィルタ、ニューラルネットワークなど）を利用して、音声データを分別できる。これにより、上記音声分析システムは、上記文書分析システム１００と同様の効果を奏する。 In the case of (2), the voice analysis system can distinguish whether or not the voice data is related to a predetermined case by extracting partial voices included in the voice data. For example, when voice data “adjust price” is obtained, the voice analysis system extracts partial voices “price” and “adjustment” from the voice data, and based on the result of evaluating the partial voice, Relevance information can be given to unsorted audio data. In this case, the speech analysis system can classify speech data using a time series data classification algorithm (for example, a hidden Markov model, a Kalman filter, a neural network, etc.). Thereby, the voice analysis system has the same effect as the document analysis system 100.

または、本発明のデータ分析装置は、映像（動画）を分析する映像分析システムの形態でも実施可能である。この場合、上記映像分析システムは、映像データに含まれるフレーム画像を抽出し、任意の顔認識技術を用いることによって、当該フレーム画像に含まれる人物を特定できる。また、上記映像分析システムは、任意のモーション認識技術（例えば、パターンマッチング技術を応用するものであってよい）を用いることによって、上記映像データに含まれる部分映像（上記映像に含まれる全フレーム画像のうちの一部を含む映像）から上記人物のモーション（動作）を抽出できる。そして、上記映像分析システムは、上記人物および／またはモーションに基づいて、上記映像データを分別できる。これにより、上記映像分析システムは、上記文書分析システム１００と同様の効果を奏する。 Alternatively, the data analysis apparatus of the present invention can be implemented in the form of a video analysis system that analyzes video (moving images). In this case, the video analysis system can identify a person included in the frame image by extracting a frame image included in the video data and using an arbitrary face recognition technique. In addition, the video analysis system uses an arbitrary motion recognition technique (for example, a pattern matching technique may be applied), thereby enabling a partial video (all frame images included in the video to be included) included in the video data. The motion (motion) of the person can be extracted from the video including a part of the video. The video analysis system can sort the video data based on the person and / or motion. Thus, the video analysis system has the same effect as the document analysis system 100.

すなわち、本発明のデータ分析装置は、時系列で情報が展開するデジタルデータ（文書、音声、映像など）を分析することができる。これにより、上記データ分析装置は、過去のデータ（文書、音声、映像など）を分析した結果に基づいて現在のデータを分析することによって、所定の事案と関係するデータを抽出できる（例えば、不正行為が生じる予兆を検知できる）という効果を奏する。 That is, the data analysis apparatus of the present invention can analyze digital data (document, audio, video, etc.) in which information is developed in time series. Accordingly, the data analysis apparatus can extract data related to a predetermined case by analyzing current data based on the result of analyzing past data (document, audio, video, etc.) (for example, illegal It is possible to detect a sign that an action will occur.

〔付記事項〕
本発明は上述したそれぞれの実施の形態に限定されるものではなく、請求項に示した範囲で種々の変更が可能であり、異なる実施の形態にそれぞれ開示された技術的手段を適宜組み合わせて得られる実施の形態についても、本発明の技術的範囲に含まれる。さらに、各実施の形態にそれぞれ開示された技術的手段を組み合わせることにより、新しい技術的特徴を形成できる。[Additional Notes]
The present invention is not limited to the above-described embodiments, and various modifications can be made within the scope of the claims, and the technical means disclosed in different embodiments can be appropriately combined. Embodiments to be made are also included in the technical scope of the present invention. Furthermore, a new technical feature can be formed by combining the technical means disclosed in each embodiment.

例えば、本発明は以下のようにも表現できる。すなわち、所定の事案と関係するか否かがユーザによって判断されたデータに含まれるデータ要素を、所定の基準に基づいてそれぞれ評価する要素評価部と、要素評価部によって評価された結果に基づいて、所定の事案との関係性の強さを示すスコアを、データについてそれぞれ算出するスコア算出部と、所定の事案に関係すると判断されたデータが、所定数のデータを含むデータ群に占める割合を示す適合率に対して設定された目標値を超過可能な最小のスコアを、適合しきい値として特定するスコア特定部と、要素評価部によって評価された結果に基づいて、所定の事案と関係するか否かが未だ判断されていないデータについて算出されたスコアが、適合しきい値を超過しているか否かを判定する超過判定部と、超過判定部によって超過していると判定された場合、当該データをユーザに報告すべきデータとして設定するデータ設定部とを備えたデータ分析装置。 For example, the present invention can be expressed as follows. That is, based on the result evaluated by the element evaluation unit and the element evaluation unit that evaluates each data element included in the data determined by the user as to whether or not it is related to the predetermined case based on a predetermined standard A score calculation unit for calculating a score indicating the strength of the relationship with a predetermined case for each data, and a ratio of data determined to be related to the predetermined case to a data group including a predetermined number of data It is related to a given case based on the score evaluation unit that specifies the minimum score that can exceed the target value set for the accuracy rate shown as the adaptation threshold and the result evaluated by the element evaluation unit Exceeded by the excess determination unit that determines whether the score calculated for data that has not been determined yet exceeds the conformance threshold and the excess determination unit If it is determined that the data analyzer and a data setting unit for setting the data to be reported the data to the user.

または、本発明は以下のようにも表現できる。すなわち、所定の事案との関係性の強さを示す指標として、当該所定の事案と関係するか否かがユーザによって判断された既判断データについてそれぞれ算出されたスコアのうち、適合率に対して設定された目標値を超過可能な最小のスコアを特定するスコア特定部と、所定の事案と関係するか否かが判断されていない未判断データが新たに取得された場合、当該未判断データについて算出されたスコアが、スコア特定部によって特定された最小のスコアを超過しているか否かを判定する超過判定部と、超過判定部によって超過していると判定された場合、未判断データをユーザに報告すべきデータとして設定するデータ設定部とを備えたデータ分析装置。 Alternatively, the present invention can be expressed as follows. That is, as an index indicating the strength of the relationship with a predetermined case, out of the scores calculated for the already-determined data determined by the user whether or not it is related to the predetermined case, When new data is obtained for the score identification unit that identifies the minimum score that can exceed the set target value, and whether or not it is determined whether or not it is related to a predetermined case, When it is determined that the calculated score exceeds the minimum score specified by the score specifying unit, and the excess determining unit determines that the calculated score exceeds the minimum score, A data analysis device including a data setting unit that sets data to be reported to the computer.

または、本発明は以下のようにも表現できる。すなわち、所定の事案と関係するか否かが判断されていない未判断データが新たに取得された場合に、当該未判断データに対する当該判断の基礎となる基礎情報を、当該所定の事案と関係するか否かがユーザによって判断された既判断データから特定する情報特定部と、情報特定部によって特定された基礎情報に基づいて、未判断データをユーザに報告すべきデータとして設定するデータ設定部とを備えたデータ分析装置。 Alternatively, the present invention can be expressed as follows. That is, when undecided data that has not been determined whether or not it is related to a predetermined case is newly acquired, the basic information that is the basis of the determination for the undecided data is related to the predetermined case An information specifying unit that specifies whether or not the data has already been determined by the user, and a data setting unit that sets undetermined data as data to be reported to the user based on the basic information specified by the information specifying unit; Data analysis device equipped with.

また、上記データ分析装置において、スコア算出部は、要素評価部によって評価された結果に基づいて、所定の事案との関係性の強さを示すスコアを、当該所定の事案と関係するか否かがユーザによって判断された文書にそれぞれ含まれるセンテンスごとに算出し、超過判定部は、要素評価部によって評価された結果に基づいて、所定の事案と関係するか否かが未だ判断されていない文書にそれぞれ含まれるセンテンスについて算出されたスコアが、適合しきい値を超過しているか否かを判定することができる。 In the data analysis apparatus, the score calculation unit determines whether or not the score indicating the strength of the relationship with the predetermined case is related to the predetermined case based on the result evaluated by the element evaluation unit. Is calculated for each sentence included in each document determined by the user, and the excess determination unit has not yet been determined whether it is related to a predetermined case based on the result evaluated by the element evaluation unit It is possible to determine whether or not the score calculated for each sentence included in each exceeds a fitness threshold.

本発明は、パーソナルコンピュータ、サーバ装置、メインフレーム、ワークステーション、その他の電子機器に広く適用することができる。 The present invention can be widely applied to personal computers, server devices, mainframes, workstations, and other electronic devices.

１ａ：文書（既判断データ）、１ｂ：文書（未判断データ）、５ａ：レビュー結果（ユーザによって判断された結果）、５ｄ：スコア、５ｅ：スコア、６：適合しきい値、１１：データ抽出部（既判断データ取得部）、１２：結果取得部（既判断データ取得部）、１３：要素評価部、１４：スコア算出部、１５：スコア特定部（閾値特定部）、１６：超過判定部、１７：データ設定部、１８：関係付与部、１９：格納部、３０：記憶部（所定の記憶部）、４０：入力部（所定の入力部）、１００：文書分析システム（データ分析装置） 1a: Document (determined data), 1b: Document (undetermined data), 5a: Review result (result determined by the user), 5d: Score, 5e: Score, 6: Relevance threshold, 11: Data extraction Part (predetermined data acquisition part), 12: result acquisition part (predetermined data acquisition part), 13: element evaluation part, 14: score calculation part, 15: score specification part (threshold specification part), 16: excess determination part , 17: data setting unit, 18: relationship adding unit, 19: storage unit, 30: storage unit (predetermined storage unit), 40: input unit (predetermined input unit), 100: document analysis system (data analysis device)

Claims

An already-determined data acquisition unit that acquires a plurality of already-determined data determined by the user as to whether or not it is related to a predetermined case;
For each of the plurality of previously determined data, the data elements included in the already determined data, commentary worthy elements evaluated in accordance with the dependencies of the results the user decides on the data elements and the already determined data And
Based on the evaluated data element, a score calculation unit that calculates a score indicating the strength of the relationship between each of the plurality of already determined data and the predetermined case;
An undetermined data acquisition unit that newly acquires undetermined data that has not been determined by the user as to whether or not it relates to the predetermined case;
As an index for determining whether or not the undecided data is related to the predetermined case, a threshold specifying unit that specifies a threshold provided for the calculated score;
The threshold specifying unit, among the scores calculated for said plurality of previously determined data, already determined data determined to be related to the predetermined incidents, conform showing the percentage of data group including a predetermined number of data The minimum score that can exceed the target set for the rate is identified as the threshold ,
The score calculation unit uses the evaluation result of the data element corresponding to the data element included in the acquired undetermined data among the evaluation results of the evaluated data element, and the acquired undetermined data and calculating a score indicating the previous SL relationship in the strength of the predetermined cases,
An excess determination unit that determines whether or not a score calculated for the acquired undetermined data exceeds the threshold;
A data analysis apparatus comprising: a data setting unit configured to set undecided data having a score determined to exceed the data to be reported to the user.

The element evaluation unit represents , for each of the plurality of already-determined data, a data element included in the already-determined data and a dependency relationship between the data element and a result of the determination made by the user with respect to the already-determined data data analyzer according to transmission information amount to claim 1, wherein the evaluation child by calculating the weight of the data element.

Whether or not the data set by the data setting unit is related to the predetermined case, further comprising a result acquisition unit for acquiring a result determined by the user from the user via a predetermined input unit;
The said element evaluation part evaluates each data element contained in the data set by the said data setting part based on the result acquired by the said result acquisition part, respectively. Data analysis device.

4. The storage device according to claim 1, further comprising a storage unit that associates the data element evaluated by the element evaluation unit with a result of evaluating the data element and stores the data element in a predetermined storage unit. The data analysis device according to any one of the above.

The undetermined data includes unique data elements that can respectively identify a plurality of persons or organizations,
The data setting unit extracts each of the unique data elements from the undetermined data, and estimates a correspondence relationship between the first unique data element and a second unique data element different from the first unique data element. The data analysis apparatus according to claim 1, wherein the strength of connection between the plurality of persons or organizations is visualized.

The data analysis apparatus according to claim 1, wherein the determination data acquisition unit acquires the plurality of determination data via a predetermined input unit.

The relation setting section for adding relation information indicating that the data is related to the predetermined case to the data set by the data setting section. The data analysis device according to one item.

The data is a document digitized so that it can be processed by a computer,
The data analysis apparatus according to claim 1, wherein the data element is a keyword included in the document.

The data is voice that has been digitized so that it can be processed by a computer,
The data analysis apparatus according to claim 1, wherein the data element is a partial sound included in the sound.

A method for controlling a data analyzer,
Previously determining the data acquisition unit, and the previously determined data acquisition step whether related to cases of Jo Tokoro to obtain multiple previously determined data determined by the user,
Element evaluation unit, for each of the previous SL plurality of previously determined data, according to the data elements included in the already determined data, the dependency on the result of the user decides on the data elements and the already determined data and elements evaluation step that deserves Review Te,
Score calculating unit, based on the previous SL evaluated data element, and score calculation step of calculating the plurality of the respective pre-determined data a score indicating the relationship between the intensity of the predetermined cases, respectively,
Undetermined data acquisition unit, the undetermined data acquisition step whether related to previous SL predetermined incidents newly acquires undetermined data not determined by the user,
Threshold identifying unit as an index for before Symbol undetermined data to determine whether the relationship between the predetermined cases, a threshold specifying step of specifying a threshold provided for the calculated score,
In the threshold identifying step, the threshold specifying unit, among the score calculated for the previous SL plurality of previously determined data, already determined data determined to be related to the predetermined incidents, data group including a predetermined number of data specified as the threshold score of possible exceeding the minimum target value set for conformance ratio indicating the percentage of,
In the score calculating step, the score calculating unit uses the evaluation result of the data element corresponding to the data element included in the acquired undecided data among the evaluation results of the evaluated data element, calculating a score indicating the relationship between the intensity of the undetermined data and the previous SL predetermined cases that are,
Excess determination unit, the score calculated for undetermined data pre SL acquired, the excess judgment step of judging whether or not exceeded the threshold,
The method of data analysis device includes a data setting unit, the undetermined data having the determined scores to be pre Symbol exceeded, the data setting step of setting as the data to be reported to the user.

A control program for a data analyzer,
On the computer,
An already-determined data acquisition function for acquiring a plurality of already-determined data determined by the user as to whether or not it is related to a predetermined case;
For each of the plurality of previously determined data, the data elements included in the already determined data, commentary worthy elements evaluated in accordance with the dependencies of the results the user decides on the data elements and the already determined data Function and
A score calculation function for calculating a score indicating the strength of the relationship between each of the plurality of already determined data and the predetermined case, based on the evaluated data element;
An undetermined data acquisition function for newly acquiring undetermined data that has not been determined by the user as to whether or not it relates to the predetermined case;
A threshold specifying function for specifying a threshold provided for the calculated score as an index for determining whether the undecided data is related to the predetermined case;
The threshold specific function, of the scores calculated for said plurality of previously determined data, already determined data determined to be related to the predetermined incidents, conform showing the percentage of data group including a predetermined number of data The minimum score that can exceed the target set for the rate is identified as the threshold,
The score calculation function uses the evaluation result of the data element corresponding to the data element included in the acquired undetermined data among the evaluation results of the evaluated data element, and the acquired undetermined data and calculating a score indicating the previous SL relationship in the strength of the predetermined cases,
An excess determination function for determining whether a score calculated for the acquired undetermined data exceeds the threshold; and
A control program for a data analysis device that realizes a data setting function for setting undetermined data having a score determined to be in excess as data to be reported to the user.