JP5946949B1

JP5946949B1 - DATA ANALYSIS SYSTEM, ITS CONTROL METHOD, PROGRAM, AND RECORDING MEDIUM

Info

Publication number: JP5946949B1
Application number: JP2015238978A
Authority: JP
Inventors: 秀樹武田; 和巳蓮子
Original assignee: Ubic Inc
Current assignee: Ubic Inc
Priority date: 2015-12-07
Filing date: 2015-12-07
Publication date: 2016-07-06
Anticipated expiration: 2035-12-07
Also published as: JP2017107302A

Abstract

【課題】膨大なデータの中から、所定事案に関係するデータを的確に見つけ出すことができるデータ分析システムを提供する。【解決手段】データ分析システムは、学習用データから複数の構成要素を抽出し、当該複数の構成要素の夫々は、当該学習用データの少なくとも一部を構成するものであり、抽出された複数の構成要素の中から、複数の評価用データの評価に使用される構成要素を、抽出された複数の構成要素の学習用データにおける分布の態様に基づいて選択し、選択された構成要素に基づいて、評価用データを評価する。【選択図】図４A data analysis system capable of accurately finding out data related to a predetermined case from an enormous amount of data. A data analysis system extracts a plurality of constituent elements from learning data, and each of the plurality of constituent elements constitutes at least a part of the learning data. From among the components, select the components to be used for the evaluation of the plurality of evaluation data based on the distribution mode in the extracted learning data of the plurality of components, and based on the selected components Evaluate the evaluation data. [Selection] Figure 4

Description

本発明は、データを分析するデータ分析システム等に関するものであり、例えば、ビックデータを分析する人工知能を備えたシステムに適用可能である。 The present invention relates to a data analysis system for analyzing data, and can be applied to a system having artificial intelligence for analyzing big data, for example.

コンピュータの急速な発展により社会の情報化が進んだ結果、企業・個人の活動に、膨大な量の情報（ビッグデータ）が、広範に、かつ、密接に関係するようになってきている。そのため、最近では、特に、ビッグデータの中から、所望の情報を的確に分別する必要性が重要視されている。 As a result of the rapid development of computers and the progress of computerization of society, a huge amount of information (big data) has become widely and closely related to the activities of companies and individuals. Therefore, recently, the necessity of accurately separating desired information from big data has been emphasized.

ビッグデータから、所望の情報を取り出すためのアプローチとして、データ群からサンプリングされた一部のデータに対して、レビューワに依るデータ分析を適用し、この分析結果を利用して、残りのデータを自動分析可能なシステムが知られている（例えば、特開２０１３−１８２３３８号公報）。 As an approach for extracting desired information from big data, a data analysis by a reviewer is applied to a part of the data sampled from the data group, and the remaining data is obtained by using this analysis result. A system capable of automatic analysis is known (for example, JP 2013-182338 A).

特開２０１３―１８２３３８号公報JP 2013-182338 A

上記データ分析システムによれば、膨大なデータの中から、所定事案に関係するデータを見つけ出すことができる。しかしながら、所定事案に関係する度合いが本来高くないデータであるにも拘わらず、所定事案に関係する度合いが高いデータであると評価されてしまったり、あるいは、その逆のことが生じてしまったりするという課題があった。そこで、本発明は、膨大なデータの中から、所定事案に関係するデータを的確に見つけ出すことができるシステムの提供を目的とする。 According to the data analysis system, data related to a predetermined case can be found out from a huge amount of data. However, even though the degree of data related to a given case is not inherently high, it is evaluated that the data is related to a given case or vice versa. There was a problem. Therefore, an object of the present invention is to provide a system that can accurately find data related to a predetermined case from a vast amount of data.

前記目的は、データを分析するデータ分析システムであって、分析の対象となる複数の評価用データを少なくとも一時的に記憶するメモリと、前記複数の評価用データを学習用データに基づいて評価するコントローラと、を備え、前記コントローラは、前記学習用データから複数の構成要素を抽出し、当該複数の構成要素の夫々は、当該学習用データの少なくとも一部を構成するものであり、前記抽出された複数の構成要素の中から、前記複数の評価用データの評価に使用される構成要素を、前記抽出された複数の構成要素の前記学習用データにおける分布の態様に基づいて選択し、前記選択された構成要素に基づいて、前記評価用データを評価することによって達成される。さらに、データ分析システムの制御方法、そのプログラム、および、記録媒体が提供される。 The object is a data analysis system for analyzing data, wherein a plurality of evaluation data to be analyzed are stored at least temporarily, and the plurality of evaluation data are evaluated based on learning data. A controller, wherein the controller extracts a plurality of components from the learning data, and each of the plurality of components constitutes at least a part of the learning data and is extracted. The component to be used for the evaluation of the plurality of evaluation data is selected from the plurality of components based on the distribution mode in the learning data of the plurality of extracted components, and the selection is performed. This is achieved by evaluating the evaluation data based on the configured components. Furthermore, a control method of the data analysis system, a program thereof, and a recording medium are provided.

既述の開示によって、膨大なデータの中から、所定事案に関係するデータを的確に見つけ出すことができるデータ分析システム等が提供される。 The above-described disclosure provides a data analysis system and the like that can accurately find data related to a predetermined case from a vast amount of data.

データ分析システムのハードウェア構成の一例を示すブロック図である。It is a block diagram which shows an example of the hardware constitutions of a data analysis system. 学習用データにおける構成要素の配置を説明する図である。It is a figure explaining arrangement | positioning of the component in the data for learning. 複数の構成要素夫々の評価値と複数の構成要素夫々の学習用データにおける出現位置との分布を示す特性図である。It is a characteristic view which shows distribution of the evaluation value of each of a some component, and the appearance position in the data for learning of each of a some component. 第１の実施形態に係る評価用データの評価を説明する、サーバ装置２のコントローラのフローチャートである。It is a flowchart of the controller of the server apparatus 2 explaining evaluation of the data for evaluation which concerns on 1st Embodiment. 第２の実施形態に係る評価用データの評価を説明する、サーバ装置２のコントローラの動作フローチャートである。It is an operation | movement flowchart of the controller of the server apparatus 2 explaining the evaluation of the data for evaluation which concerns on 2nd Embodiment. 構成要素グループの統合の処理のための制御テーブルである。It is a control table for the process of integration of component groups.

次に、添付図面に基づいてデータ分析システムの実施形態を説明する。 Next, an embodiment of a data analysis system will be described based on the accompanying drawings.

〔データ分析システムの構成〕
図１は、本実施の形態に係るデータ分析システム（以下、単に「システム」と略記することがある）のハードウェア構成の一例を示すブロック図である。当該システムは、例えば、データ（デジタルデータおよびアナログデータを含む）を格納可能な任意の記録媒体（例えば、メモリ、ハードディスクなど）と、当該記録媒体に格納された制御プログラムを実行可能なコントローラ（例えば、ＣＰＵ；Central Processing Unit）とを備え、当該記録媒体に少なくとも一時的に格納されたデータを分析するコンピュータまたはコンピュータシステム（複数のコンピュータが統合的に動作することによってデータ分析を実現するシステム）として実現され得る。 [Data analysis system configuration]
FIG. 1 is a block diagram showing an example of a hardware configuration of a data analysis system according to the present embodiment (hereinafter sometimes simply referred to as “system”). The system includes, for example, an arbitrary recording medium (for example, a memory or a hard disk) capable of storing data (including digital data and analog data), and a controller (for example, a control program stored in the recording medium). A computer or a computer system (a system that realizes data analysis by operating a plurality of computers in an integrated manner) that analyzes data stored at least temporarily in the recording medium. Can be realized.

本実施の形態において、「学習用データ」（training data）は、例えば、参照データとしてユーザに提示され、分類情報が対応付けられたデータ（分類済みの参照データ、参照データと分類情報との組み合わせ）であってよい。学習用データを、「教師データ」または「トレーニングデータ」といってもよい。また、「評価用データ」（evaluation data）は、当該分類情報が対応付けられていないデータ（参照データとしてユーザに提示されておらず、ユーザにとっては分類されていない未分類のデータ、「未知データ」といってもよい）であってよい。ここで、上記「分類情報」は、参照データを任意に分類するために用いる識別ラベルであってよく、例えば、参照データと所定事案（上記システムがデータとの関連性を評価する対象を広く含み、その範囲は制限されない）とが関係することを示す「Related」ラベルと、両者が関係しないことを示す「Non-Related」ラベルとのように、当該参照データを任意の数（例えば、２つ）のグループに分類する情報であってよい。 In the present embodiment, “learning data” (training data) is, for example, data presented to the user as reference data and associated with classification information (classified reference data, a combination of reference data and classification information) ). The learning data may be referred to as “teacher data” or “training data”. “Evaluation data” is data that is not associated with the classification information (unclassified data that is not presented to the user as reference data and is not classified for the user, “unknown data” May be said). Here, the “classification information” may be an identification label used for arbitrarily classifying reference data. For example, the “classification information” widely includes reference data and a predetermined case (the target for which the system evaluates relevance between data). , The range is not limited), and the reference data can be an arbitrary number (for example, two), such as a “Related” label indicating that the two are related to each other and a “Non-Related” label indicating that the two are not related. ) May be classified into groups.

図１に例示されるように、上記システムは、例えば、データ分析の主要処理を実行可能なサーバ装置（サーバ計算機）２と、当該データ分析の関連処理を実行可能な一つ又は複数のクライアント装置（クライアント計算機）３と、データおよび当該データに対する評価結果を記録するデータベース４を備えるストレージシステム５と、クライアント装置３およびサーバ装置２に対して、データ分析のための管理機能を提供する管理計算機６とを備えてよい。それぞれの装置は、ハードウェア資源として、例えば、メモリ、コントローラ、バス、入出力インターフェース（例えば、キーボード、ディスプレイなど）、通信インターフェース（所定のネットワークを用いた通信手段によって、各装置を通信可能に接続する）などを備えてよい（これらの例に限定されない）。サーバ装置２は、データ分析に必要なプログラムやデータを記録した（非一時的）記憶媒体、例えば、ハードディスク、フラッシュメモリ、ＤＶＤ、ＣＤ、ＢＤ等を備えている。 As illustrated in FIG. 1, the system includes, for example, a server device (server computer) 2 that can execute a main process of data analysis, and one or a plurality of client devices that can execute a related process of the data analysis. (Client computer) 3, a storage system 5 having a database 4 for recording data and evaluation results for the data, and a management computer 6 for providing a management function for data analysis to the client device 3 and the server device 2 And may be provided. Each device is connected as a hardware resource, for example, a memory, a controller, a bus, an input / output interface (for example, a keyboard, a display, etc.), and a communication interface (communication means using a predetermined network) so that the devices can communicate with each other. (Not limited to these examples). The server device 2 includes a (non-temporary) storage medium recording a program and data necessary for data analysis, such as a hard disk, a flash memory, a DVD, a CD, and a BD.

クライアント装置３は、データの一部を参照データとしてユーザに提示する。これにより、当該ユーザは、クライアント装置３を介して参照データに対する評価・分類のための入力を行う（分類情報を与える）ことができる。サーバ装置２は、参照データと分類情報との組み合わせ（学習用データ）に基づいて、当該データからパターン（例えば、データに含まれる抽象的な規則、意味、概念、様式、分布、サンプルなどを広く指し、いわゆる「特定のパターン」に限定されない）を学習し、当該学習したパターンに基づいて、評価用データと所定事案との関連性を評価する。 The client device 3 presents a part of the data to the user as reference data. As a result, the user can input (provide classification information) for evaluation / classification of the reference data via the client device 3. Based on the combination of reference data and classification information (learning data), the server device 2 widely uses patterns (for example, abstract rules, meanings, concepts, styles, distributions, and samples included in the data) from the data. And is not limited to a so-called “specific pattern”), and the relevance between the evaluation data and the predetermined case is evaluated based on the learned pattern.

管理計算機６は、クライアント装置３、サーバ装置２、およびストレージシステム５に対して、所定の管理処理を実行する。ストレージシステム５は、例えば、ディスクアレイシステムから構成され、データと当該データに対する評価・分類の結果とを記録するデータベース４を備えてよい。サーバ装置２とストシステム５とは、ＤＡＳ（Direct Attached Storage）方式、またはＳＡＮ（Storage Area Network）によって通信可能に接続されている。 The management computer 6 executes predetermined management processing for the client device 3, the server device 2, and the storage system 5. The storage system 5 may be composed of, for example, a disk array system, and may include a database 4 that records data and results of evaluation / classification of the data. The server device 2 and the storage system 5 are communicably connected by a DAS (Direct Attached Storage) method or a SAN (Storage Area Network).

なお、図１に示されるハードウェア構成はあくまで例示に過ぎず、上記システムは、他のハードウェア構成によっても実現され得る。例えば、サーバ装置２において実行される処理の一部または全部がクライアント装置３において実行される構成であってもよいし、当該処理の一部または全部がサーバ装置２において実行される構成であってもよいし、ストレージシステム５がサーバ装置２に内蔵される構成であってもよい。また、ユーザは、クライアント装置３を介してサンプルデータに対する評価・分類のための入力を行う（分類情報を与える）だけでなく、サーバ装置２に直接接続された入力機器を介して上記入力を行うこともできる。当該システムを実現可能なハードウェア構成が多様に存在し得ることは、当業者に理解されるところであり、特定の１つの構成（例えば、図１に例示されるような構成）に限定されない。 Note that the hardware configuration shown in FIG. 1 is merely an example, and the above system can be realized by other hardware configurations. For example, a part or all of the processing executed in the server device 2 may be executed in the client device 3, or a part or all of the processing may be executed in the server device 2. Alternatively, the storage system 5 may be built in the server device 2. Further, the user not only performs input for evaluation / classification of sample data via the client device 3 (gives classification information), but also performs the above input via an input device directly connected to the server device 2. You can also. It is understood by those skilled in the art that there can be various hardware configurations capable of realizing the system, and the present invention is not limited to one specific configuration (for example, the configuration illustrated in FIG. 1).

〔データ評価機能〕
上記システムは、データ評価機能を備えることができる。当該データ評価機能は、人手で分類された少数のデータ（学習用データ）に基づいて、多数の評価用データ（ビッグデータ）を評価するものである。当該データ評価機能を備えることにより、上記システムは、例えば、評価用データと所定事案との関連性の高低を示す指標（例えば、評価用データを序列化可能にする数値（例えば、スコア）、文字（例えば、「高」、「中」、「低」など）、記号（例えば、「◎」、「○」、「△」、「×」など）、および／またはこれらの組み合わせであってよい）を導出することによって、上記評価を行うことができる。データ評価機能は、サーバ装置２のコントローラによって実現される。 [Data evaluation function]
The system can have a data evaluation function. The data evaluation function evaluates a large number of evaluation data (big data) based on a small number of data (learning data) classified manually. By providing the data evaluation function, the system can, for example, indicate an index indicating the level of relevance between the evaluation data and the predetermined case (for example, a numerical value (for example, a score), a character that enables the evaluation data to be ordered) (Eg, “high”, “medium”, “low”, etc.), symbols (eg, “◎”, “◯”, “△”, “x”, etc.), and / or combinations thereof) The above evaluation can be performed by deriving. The data evaluation function is realized by the controller of the server device 2.

上記システムが上記評価のための指標としてスコアを導出する場合、当該システムは、当該スコアを任意の方法で算出することができる。例えば、機械学習または自然言語処理の分野で用いられる各種の手法（例えば、Ｋ近傍法、サポートベクターマシンを用いた手法、ニューラルネットワークを用いた手法、データに対して統計モデルを仮定する手法（例えば、ガウス過程を用いた手法など）、および／またはこれらを組み合わせた手法など）に基づいて当該スコアを算出してもよいし、統計学の分野で用いられる各種の手法に基づいて（例えば、構成要素がデータに現れる頻度に基づいて）算出してもよい。 When the system derives a score as an index for the evaluation, the system can calculate the score by any method. For example, various methods used in the field of machine learning or natural language processing (for example, a K-neighbor method, a method using a support vector machine, a method using a neural network, a method that assumes a statistical model for data (for example, , A method using a Gaussian process, etc.) and / or a method combining these, etc.), or based on various methods used in the field of statistics (eg, configuration) (Based on how often the element appears in the data).

「構成要素」は、データの少なくとも一部を構成する部分データであってよく、例えば、文書を構成する形態素、キーワード、センテンス、段落、および／またはメタデータ（例えば、電子メールのヘッダ情報）であったり、音声を構成する部分音声、ボリューム（ゲイン）情報、および／または音色情報であったり、画像を構成する部分画像、部分画素、および／または輝度情報であったり、映像を構成するフレーム画像、モーション情報、および／または３次元情報であったりしてよい。 The “component” may be partial data constituting at least a part of the data, for example, a morpheme, a keyword, a sentence, a paragraph, and / or metadata (eg, e-mail header information) constituting the document. Frame image that constitutes a video, or partial audio that constitutes audio, volume (gain) information, and / or timbre information, partial image that constitutes an image, partial pixel, and / or luminance information , Motion information, and / or 3D information.

構成要素がデータに現れる頻度に基づいて上記システムが上記スコアを算出する場合、例えば、次のような算出方法が考えられる。まず、上記システムは、学習用データから、当該学習用データを構成する構成要素を抽出し、当該構成要素を評価する。このとき、上記システムは、例えば、学習用データの少なくとも一部を構成する複数の構成要素が、データと分類情報との組み合わせに寄与する度合い（言い換えれば、当該構成要素が分類情報に応じて出現する頻度）をそれぞれ評価する。より具体的な一例として、上記システムは、伝達情報量（例えば、構成要素の出現確率と分類情報の出現確率とを用いて、所定の式から算出される情報量）を用いて構成要素を評価することによって、当該構成要素の評価情報としての評価値を、下記の数１に従い算出する。 When the system calculates the score based on the frequency at which the component appears in the data, for example, the following calculation method can be considered. First, the system extracts the constituent elements constituting the learning data from the learning data, and evaluates the constituent elements. At this time, for example, the system described above is the degree that a plurality of constituent elements constituting at least a part of the learning data contribute to the combination of data and classification information (in other words, the constituent elements appear according to the classification information). Frequency). As a more specific example, the system evaluates a component using a transmitted information amount (for example, an information amount calculated from a predetermined formula using the appearance probability of the component and the appearance probability of the classification information). By doing so, an evaluation value as evaluation information of the constituent element is calculated according to the following formula 1.

ここで、ｗｇｔは、評価前のｉ番目の構成要素の評価値の初期値を示す。また、ｗｇｔは、Ｌ回目の評価後のｉ番目の構成要素の評価値を示す。γはＬ回目の評価における評価パラメータを意味し、θは評価の際の閾値を意味する。これにより、上記システムは、例えば、算出した伝達情報量の値が大きいほど、構成要素が所定の分類情報の特徴を表すものとして評価することができる。 Here, wgt indicates the initial value of the evaluation value of the i-th component before evaluation. Wgt indicates the evaluation value of the i-th component after the Lth evaluation. γ means an evaluation parameter in the L-th evaluation, and θ means a threshold value in the evaluation. Thereby, for example, the system can be evaluated such that the greater the value of the calculated transmission information amount, the more the component represents the characteristic of the predetermined classification information.

次に、上記システムは、上記構成要素と評価値とを対応付け、両者を任意のメモリ（例えば、ストレージシステム５）に格納する。そして、上記システムは、評価用データから構成要素を抽出し、当該構成要素が上記メモリに格納されているか否かを照会し、格納されている場合は、当該構成要素に対応付けられた評価値を当該メモリから読み出し、当該評価値に基づいて評価用データを評価する。より具体的な一例として、上記システムは、評価用データの少なくとも一部を構成する構成要素に対応付けられた評価値を用いて以下の式を計算することによって、上記スコアを算出することができる。
ｍ_ｊ：ｉ番目の構成要素の出現頻度
wgt_ｉ：ｉ番目の構成要素の評価値 Next, the system associates the component with the evaluation value, and stores both in an arbitrary memory (for example, the storage system 5). Then, the system extracts a component from the evaluation data, inquires whether or not the component is stored in the memory, and if so, an evaluation value associated with the component Is read from the memory, and the evaluation data is evaluated based on the evaluation value. As a more specific example, the system can calculate the score by calculating the following expression using an evaluation value associated with a component constituting at least part of the evaluation data. .
m _j : frequency of occurrence of the i-th component
wgt _i : Evaluation value of the i-th component

サーバ装置２は、再現率が所定の目標値になるまで、構成要素の抽出および評価を継続する（繰り返す）ことができるようにしてもよい。再現率とは、所定数のデータに対して発見すべきデータが占める割合（網羅性）を示す指標であり、例えば、全データの３０％に対して再現率が８０％である場合、所定事案に関係するとして、発見されるべきデータの８０％が、指標（スコア）上位３０％のデータの中に含まれていることを示す。データ分析システムを用いず、人がデータに総当たり（リニアレビュー）した場合、発見すべきデータの量は人がレビューした量に比例するため、この比例からの乖離が大きいほどシステムのデータ分析性能が良いことになる。 The server device 2 may continue (repeate) the extraction and evaluation of the constituent elements until the recall rate reaches a predetermined target value. The recall is an index indicating the ratio (coverability) of the data to be discovered with respect to a predetermined number of data. For example, when the recall is 80% with respect to 30% of all data, a predetermined case As shown in the figure, 80% of the data to be discovered is included in the data of the top 30% of the index (score). When a person hits the data (linear review) without using a data analysis system, the amount of data to be discovered is proportional to the amount reviewed by the person, so the greater the deviation from this proportionality, the greater the data analysis performance of the system. Will be good.

前述したデータ評価機能の実現例は、あくまでも一例に過ぎない。すなわち、当該データ評価機能は、「学習用データに基づいて評価用データを評価する」という機能でありさえすれば、その具体的な態様は特定の１つの構成（例えば、前述したスコアの算出方法）に限定されない。 The implementation example of the data evaluation function described above is merely an example. In other words, as long as the data evaluation function is a function of “evaluating evaluation data based on learning data”, the specific mode is a specific one configuration (for example, the above-described score calculation method) ) Is not limited.

[構成要素の最適化]
既述のとおり、評価用データの評価には、例えば、学習用データから抽出された構成要素の評価値が用いられる。この場合、評価値が低い構成要素でも、それが数多く評価用データに含まれていると、当該評価用データと所定事案との真の関連性の高さにかかわらず、当該評価用データが高く評価されてしまうことがある。 [Component optimization]
As described above, for evaluation of evaluation data, for example, evaluation values of constituent elements extracted from learning data are used. In this case, if a large number of components with low evaluation values are included in the evaluation data, the evaluation data is high regardless of the true high relevance between the evaluation data and the predetermined case. It may be evaluated.

そこで、本実施形態では、上記システムは、例えば、学習用データから抽出された構成要素の中から、評価用データの評価に使用される構成要素を、当該抽出された構成要素の当該学習用データにおける分布の態様に基づいて選択、決定、あるいは、抽出等して構成要素を最適化し、当該選択した構成要素に基づいて評価用データを評価する。これにより、上記システムは、当該評価用データと所定事案との関連性を正確に判定、決定、分類等できる。選択されなかった構成要素は、その全てが評価用データの評価に使用されないか、あるいは、一部の構成要素が評価用データの評価に使用され、残りの構成要素が使用されないようにしてもよい。サーバ装置２は、例えば、選択された構成要素の評価値をそのまま利用して評価用データを評価する他、選択された構成要素の評価をやり直して、評価用データを評価してもよいし、選択された構成要素の評価値を増加するなどの加工を行って評価用データを評価してもよい。 Therefore, in the present embodiment, the system uses, for example, the component used for evaluation of the evaluation data from the components extracted from the learning data, and the learning data of the extracted component. The component is optimized by selecting, determining, or extracting based on the distribution mode in Eq., And the evaluation data is evaluated based on the selected component. As a result, the system can accurately determine, determine, classify, etc. the relevance between the evaluation data and the predetermined case. All of the components that are not selected may be used for evaluating the evaluation data, or some components may be used for evaluating the evaluation data and the remaining components may not be used. . For example, the server device 2 may evaluate the evaluation data by using the evaluation value of the selected component as it is, or may evaluate the evaluation data by re-evaluating the selected component, The evaluation data may be evaluated by processing such as increasing the evaluation value of the selected component.

前述したように、サーバ装置２は、構成要素を選択するために、抽出された複数の構成要素の学習用データにおける分布の態様を利用する。例えば、分布の態様に基づいて、学習用データから抽出された複数の構成要素の中から、所定の位置関係を持って学習用データに存在する複数の構成要素を選択することができる。好適には、複数の構成要素夫々の評価値と複数の構成要素夫々の学習用データにおける出現位置との分布を利用することができる。以下詳しく説明する。 As described above, the server device 2 uses the distribution form in the learning data of the plurality of extracted constituent elements in order to select the constituent elements. For example, based on the distribution mode, a plurality of components existing in the learning data having a predetermined positional relationship can be selected from a plurality of components extracted from the learning data. Preferably, the distribution of the evaluation values of each of the plurality of constituent elements and the appearance positions of the plurality of constituent elements in the learning data can be used. This will be described in detail below.

図２は、学習用データの一例を示すものであり、ａ，ｂ，ｃ等のアルファベット一つ一つが構成要素に相当し、“・”が構成要素として抽出されなかった、助詞、副詞等の語句である。図３は、複数の構成要素夫々の評価値と複数の構成要素夫々の学習用データにおける出現位置との分布を示す。縦軸は構成要素の評価値であり、横軸は構成要素の学習用データにおける出現位置である。棒グラフの一つ一つが構成要素の評価値である。そして、複数の構成要素の評価値に、例えば、ガウシアンフィルターを用いて平滑化処理を行うと符号１００に示す特性が得られる。 FIG. 2 shows an example of learning data. Each alphabet such as a, b, and c corresponds to a component, and “•” is not extracted as a component, such as particles, adverbs, etc. It is a phrase. FIG. 3 shows the distribution of the evaluation values of each of the plurality of constituent elements and the appearance positions of the plurality of constituent elements in the learning data. The vertical axis is the evaluation value of the component, and the horizontal axis is the appearance position of the component in the learning data. Each bar graph is a component evaluation value. When the smoothing process is performed on the evaluation values of a plurality of constituent elements using, for example, a Gaussian filter, the characteristic indicated by reference numeral 100 is obtained.

この特性１００によれば、学習用データに含まれる構成要素の優劣（例えば、評価値の高低）を可視化することができる。ピーク（１０２Ａ〜１０２Ｉ）に位置する構成要素は、データと分類情報との組み合わせを強く特徴付ける要素（例えば、所定事案に対する関連性が高い要素）であることを示している。このとき、当該構成要素（ここでは「特定構成要素」と称する）に対して所定の位置関係を有する他の構成要素（例えば、当該特定構成要素の近傍に位置する構成要素）も、ピークに位置する構成要素（特定構成要素）の影響を受けて（換言すれば、特定構成要素に関連する意味、或いは、意義を有することになって）、所定事案に対する関連性が高くなることがある。 According to this characteristic 100, the superiority or inferiority (for example, the level of the evaluation value) of the constituent elements included in the learning data can be visualized. The component located at the peak (102A to 102I) indicates that it is an element that strongly characterizes the combination of data and classification information (for example, an element that is highly relevant to a predetermined case). At this time, other components (for example, components located in the vicinity of the specific component) having a predetermined positional relationship with the component (herein referred to as “specific component”) are also located at the peak. Under the influence of the constituent element (specific constituent element) (in other words, meaning or meaning related to the specific constituent element), the relevance to the predetermined case may be high.

そこで、サーバ装置２は、例えば、学習用データにおける構成要素の出現位置に対する、当該構成要素の評価値の分布において、当該評価値のピークを中心にして構成要素を選択する。例えば、サーバ装置２は、ピークに対応する構成要素とその前後に出現する構成要素とを“構成要素グループ”として選択する。ここで、構成要素グループは、例えば、学習用データにおいて隣接して出現している複数の構成要素を１つのグループとしてまとめたものをいう図３においては、〔〕で囲まれた領域が構成要素グループを示す。例えば、評価用データにおいて、ａ，ｂ，ｃが、“ａ・・ｂ・・ｃ”の順に出現しており、ｂに評価値のピークがあるとすると、構成要素グループは、“ａ，ｂ，ｃ”によって定義されてよい（構成要素同士の間にある意味を持たない語句（既述の“・”）を構成要素グループに考慮しなくてよい）。 Therefore, for example, in the distribution of evaluation values of the component with respect to the appearance position of the component in the learning data, the server device 2 selects the component centering on the peak of the evaluation value. For example, the server device 2 selects the component corresponding to the peak and the components that appear before and after the peak as the “component group”. Here, the component group is, for example, a group of a plurality of components that appear adjacent in the learning data as one group. In FIG. Indicates a group. For example, in the evaluation data, if a, b, and c appear in the order of “a, b, c”, and b has an evaluation value peak, the component group is “a, b , C ″ (words having no meaning between components (the above-mentioned “•”) need not be considered in the component group).

図３から分かるように、ピークは複数存在することがあるため、構成要素グループは、ピークの数分存在する場合がある。サーバ装置２は、評価用データを評価するために、全ての構成要素グループを利用してもよいし、ピークの評価値の大小等に基づいて一部の構成要素グループを利用してもよい。 As can be seen from FIG. 3, since there may be a plurality of peaks, there may be as many component groups as there are peaks. In order to evaluate the evaluation data, the server device 2 may use all the component groups, or may use some component groups based on the magnitude of the peak evaluation value.

サーバ装置２は、例えば、学習用データに含まれる構成要素の中から、構成要素グループに含まれる構成要素を選択し、選択された構成要素に基づいて評価用データを評価する。その際、サーバ装置２は、例えば、評価用データにおいて、構成要素グループを構成する構成要素間での出現位置の差（距離）が少ない場合には多い場合よりも、評価用データの評価値を増加させ、また、複数の構成要素がグループを構成するように評価用データに出現されている場合には、そうでない場合よりも評価用データの評価値を増加させてもよい。 For example, the server device 2 selects a component included in the component group from the components included in the learning data, and evaluates the evaluation data based on the selected component. At that time, for example, in the evaluation data, the server device 2 uses the evaluation value of the evaluation data as compared with the case where the difference (distance) of the appearance position between the constituent elements constituting the constituent element group is small. In addition, when a plurality of components appear in the evaluation data so as to form a group, the evaluation value of the evaluation data may be increased as compared with the case where it is not.

[サーバ装置２による評価用データの評価]
サーバ装置２による評価用データの評価動作を説明する。図４は、サーバ装置２のコントローラのフローチャートである。コントローラは、ストレージシステム５に記録された評価用データの中から一つ又は複数のデータを参照データとして取得する（ステップＳ３００：参照データ取得）。各ステップを、モジュール又は手段と言い換えることもできる。 [Evaluation of evaluation data by server device 2]
An evaluation operation of evaluation data by the server device 2 will be described. FIG. 4 is a flowchart of the controller of the server device 2. The controller acquires one or more data as reference data from the evaluation data recorded in the storage system 5 (step S300: reference data acquisition). Each step can be rephrased as a module or means.

次に、コントローラは、ユーザが参照データを実際にレビューして分類を決定し、ユーザによって参照データに対して入力された分類情報を、任意の入力装置から取得する（Ｓ３０２：分類情報取得）。コントローラは、参照データと分類情報とを組み合わせることによって学習用データを構成し、学習用データから構成要素を抽出する（Ｓ３０４：構成要素抽出）。 Next, the controller actually reviews the reference data to determine the classification, and acquires the classification information input to the reference data by the user from any input device (S302: acquisition of classification information). The controller composes the learning data by combining the reference data and the classification information, and extracts the constituent elements from the learning data (S304: constituent element extraction).

そして、コントローラは、当該構成要素を評価し（Ｓ３０６：構成要素評価）、当該構成要素と評価値とを対応付け、両者をストレージシステム５に格納する（Ｓ３０８：構成要素格納モジュール）。上記Ｓ３００〜Ｓ３０８の処理は、「学習フェーズ」（人工知能がパターンを学習するフェーズ）に対応する。なお、学習用データを、参照データから作成する代わりに、予め用意しておいてもよい。例えば、ある特許権に係る特許を無効にするための公知文献を見つける場合、学習用データは、特許の請求の範囲の記載と「Related」ラベルとの組み合わせになる。 Then, the controller evaluates the constituent element (S306: constituent element evaluation), associates the constituent element with the evaluation value, and stores both in the storage system 5 (S308: constituent element storage module). The processing of S300 to S308 corresponds to a “learning phase” (phase in which artificial intelligence learns a pattern). Note that the learning data may be prepared in advance instead of creating from the reference data. For example, when a known document for invalidating a patent related to a certain patent right is found, the learning data is a combination of the description of the scope of claims and the “Related” label.

コントローラは、学習用データから抽出された複数の構成要素について、構成要素の評価値と構成要素の出現位置との分布（図２）を作成し（Ｓ３１０：構成要素分布作成）、さらに、既述したように、分布から構成要素の評価値のピークを判定する（Ｓ３１２：分布処理）。そして、コントローラは、判定されたピークに基づいて、構成要素グループを選択し（Ｓ３１４：構成要素グループ選択）、選択された構成要素グループに属する構成要素とその評価値とをストレージシステム５に記録する。 The controller creates a distribution (FIG. 2) between the evaluation value of the component and the appearance position of the component for the plurality of components extracted from the learning data (S310: component element distribution creation). As described above, the peak of the evaluation value of the component is determined from the distribution (S312: distribution processing). Then, the controller selects a component group based on the determined peak (S314: component group selection), and records the component belonging to the selected component group and its evaluation value in the storage system 5. .

次に、コントローラは、ストレージシステム５から評価用データを取得する（Ｓ３１６：評価用データ取得）。コントローラは、さらに、ストレージシステム５から構成要素とその評価値とを読み出し、当該構成要素を評価用データから抽出する（Ｓ３１８：構成要素抽出）。コントローラは、当該構成要素に対応付けられた評価値に基づいて評価用データを評価して（Ｓ３２０：評価用データ評価）、複数の評価用データを序列化情報（ランキング）を作成する。上位の評価用データほど所定事案との関連性が高い。Ｓ３１０以降の処理が、学習フェーズに対して、評価フェーズになる。なお、既述のフローチャートに含まれる各処理は、一例であって、限定される態様を示したものでないことに留意すべきである。 Next, the controller acquires evaluation data from the storage system 5 (S316: acquisition of evaluation data). The controller further reads out the constituent element and its evaluation value from the storage system 5, and extracts the constituent element from the evaluation data (S318: constituent element extraction). The controller evaluates the evaluation data based on the evaluation value associated with the constituent element (S320: evaluation data evaluation), and creates a plurality of evaluation data in order information (ranking). The higher the evaluation data, the higher the relevance to the predetermined case. The processing after S310 becomes the evaluation phase with respect to the learning phase. It should be noted that each process included in the above-described flowchart is an example and does not indicate a limited aspect.

以上説明した実施形態によれば、学習用データから抽出された構成要素の中から所定事案に対する関連性がより高い構成要素を選択して評価用データを評価できるため、膨大なデータの中から、所定事案に関係するデータを的確に見つけ出すことができる。 According to the embodiment described above, since the evaluation data can be evaluated by selecting a component having a higher relevance to the predetermined case from among the components extracted from the learning data, Data related to a given case can be found accurately.

〈第２の実施形態〉
次に、データ分析システムの第２の実施形態を説明する。この実施形態の特徴は、学習用データに含まれる構成要素の評価結果を利用して、学習用データを複数のセグメントに分割し、複数のセグメント夫々を新たな複数の学習用データとして、評価用データの評価に利用することにある。例えば、学習用データから抽出された構成要素の当該学習用データにおける分布の態様に基づいて、学習用データの構成要素を所定のパターンに分割することにより、学習用データを複数のセグメントに分割することができる。さらに、具体的には、学習用データから選択された複数の構成要素グループを所定事案との関連性に基づいて統合することにより、学習用データに複数のセグメントを設定することができる。 <Second Embodiment>
Next, a second embodiment of the data analysis system will be described. A feature of this embodiment is that, by using the evaluation results of the constituent elements included in the learning data, the learning data is divided into a plurality of segments, and each of the plurality of segments is used as a plurality of new learning data. It is to be used for data evaluation. For example, the learning data is divided into a plurality of segments by dividing the constituent elements of the learning data into predetermined patterns based on the distribution of the constituent elements extracted from the learning data in the learning data. be able to. More specifically, a plurality of segments can be set in the learning data by integrating a plurality of constituent element groups selected from the learning data based on the relevance with a predetermined case.

第２の実施形態に係るデータ分析システムの動作を、サーバ装置２のコントローラの動作フローチャート(図５)に基づいて説明する。コントローラが構成要素グループを選択するまでの処理（Ｓ３００〜Ｓ３１４）は第１の実施形態と同じである。コントローラは、Ｓ４００において、互いに関連する構成要素グループを統合させて統合グループを作成する（構成要素グループ統合）。構成要素グループの統合を具体的に説明する。 The operation of the data analysis system according to the second embodiment will be described based on the operation flowchart (FIG. 5) of the controller of the server device 2. Processing (S300 to S314) until the controller selects a component group is the same as that in the first embodiment. In S400, the controller integrates component groups related to each other to create an integrated group (component group integration). The integration of the component group will be specifically described.

構成要素グループ同士が、構成要素にはならない語句（既述の“・”）を介することなく並んでいるか、少数の当該語句を介して並んでいるか、又は、構成要素グループの最後の構成要素と構成要素グループの最初の構成要素が同じ用語等、互いに関連する構成要素グループである場合、複数の構成要素のグループ同士の意味、意義等が互いに関連していることが期待されるため、複数の構成要素グループを統合させて統合グループとする。サーバ装置２は、複数の構成要素グループの統合の過程を図６の制御テーブルに格納し、メモリの所定領域に記録する。 The component groups are arranged without using a word that does not become a component (the above-mentioned “·”), arranged with a small number of the relevant words, or with the last component of the component group When the first component of a component group is a component group related to each other such as the same term, it is expected that the meaning, meaning, etc. of the plurality of component groups are related to each other. The component group is integrated into an integrated group. The server device 2 stores a process of integrating a plurality of component groups in the control table of FIG. 6 and records it in a predetermined area of the memory.

図６において、グループ番号（１）〜（５）までの構成要素グループの夫々は単独の構成要素グループが“統合グループ”に相当し、グループ番号（６）、（７）の構成要素グループは統合されて統合グループ＃６になり、以下、図６に示すとおりである。図６において、構成要素グループ評価値とは、構成要素グループに属する複数の構成要素の評価値の代表値としての最大値であり、統合グループ評価値とは、統合グループに属する構成要素グループの評価値の代表値としての最大値である。 In FIG. 6, each of the constituent element groups of group numbers (1) to (5) corresponds to an “integrated group” as a single constituent element group, and the constituent element groups of group numbers (6) and (7) are integrated. To become integrated group # 6, as shown in FIG. In FIG. 6, the component group evaluation value is a maximum value as a representative value of the evaluation values of a plurality of component elements belonging to the component group, and the integrated group evaluation value is an evaluation of the component group belonging to the integrated group. It is the maximum value as a representative value.

構成要素グループを統合しても、それだけでは、統合グループの数（＃１〜＃１１）はまだ多い可能性があるため、コントローラは、統合グループをさらに統合する（Ｓ４０２：統合グループ統合）。コントローラは、統合グループの最大値の分布から、統合グループの最大値のピーク（図６で“*”で区別されている最大値）を求め、ピーク毎に統合グループを統合したセグメントを設定する（セグメント設定）。図６は、学習用データに３つのセグメント１，２，３が設定されることを示している。したがって、図２に示すように、コントローラは、学習用データをＩ（セグメント１）、ＩＩ（セグメント２）、ＩＩＩ（セグメント３）の３つに分割することができる。 Even if the component groups are integrated, the number of integrated groups (# 1 to # 11) may still be large by itself, so the controller further integrates the integrated groups (S402: integrated group integration). The controller obtains the peak of the maximum value of the integrated group (the maximum value distinguished by “*” in FIG. 6) from the distribution of the maximum value of the integrated group, and sets a segment in which the integrated group is integrated for each peak ( Segment setting). FIG. 6 shows that three segments 1, 2, and 3 are set in the learning data. Therefore, as shown in FIG. 2, the controller can divide the learning data into three parts I (segment 1), II (segment 2), and III (segment 3).

コントローラは、評価用データの評価（Ｓ４０４（Ｓ３１６〜Ｓ３２０））に移行すると、制御テーブル（図６）を参照して、前記３つのセグメントに基づいて、評価用データを評価する。学習用データデータの数が増えることに依って、既述の再現率を向上させることができる。コントローラは、評価用データを評価する際、複数のトレーニング夫々の構成要素とその評価値とを利用してもよいし、学習用データ毎に新たに構成要素を抽出しその評価値を求めて利用してもよい。 When the controller proceeds to evaluation of evaluation data (S404 (S316 to S320)), the controller refers to the control table (FIG. 6) and evaluates the evaluation data based on the three segments. As the number of learning data data increases, the reproducibility described above can be improved. When evaluating the evaluation data, the controller may use the components of each of the plurality of trainings and their evaluation values, or newly extract the components for each learning data and obtain the evaluation values for use. May be.

〔データ分析システムが処理するデータ形式〕
本実施の形態において、「データ」は、コンピュータによって処理可能となる形式で表現された任意のデータであってよい。上記データは、例えば、少なくとも一部において構造定義が不完全な非構造化データであってよく、自然言語によって記述された文章を少なくとも一部に含む文書データ（例えば、電子メール（添付ファイル・ヘッダ情報を含む）、技術文書（例えば、学術論文、特許公報、製品仕様書、設計図など、技術的事項を説明する文書を広く含む）、プレゼンテーション資料、表計算資料、決算報告書、打ち合わせ資料、報告書、営業資料、契約書、組織図、事業計画書、企業分析情報、電子カルテ、ウェブページ、ブログ、ソーシャルネットワークサービスに投稿されたコメントなど）、音声データ（例えば、会話・音楽などを録音したデータ）、画像データ（例えば、複数の画素またはベクター情報から構成されるデータ）、映像データ（例えば、複数のフレーム画像から構成されるデータ）などを広く含む（これらの例に限定されない）。 [Data format processed by the data analysis system]
In the present embodiment, “data” may be any data expressed in a format that can be processed by a computer. The data may be, for example, unstructured data whose structure definition is incomplete at least in part, and document data (for example, e-mail (attached file header) Information), technical documents (including a wide range of documents explaining technical matters such as academic papers, patent publications, product specifications, design drawings, etc.), presentation materials, spreadsheets, financial statements, meeting materials, Record reports, sales documents, contracts, organization charts, business plans, company analysis information, electronic medical records, web pages, blogs, comments posted on social network services, etc., audio data (eg conversation / music) Data), image data (eg, data composed of a plurality of pixels or vector information), video data (eg, Broadly includes data formed) including a plurality of frame images (not limited to these examples).

例えば、文書データを分析する場合、上記システムは、学習用データとしての文書データに含まれる形態素を構成要素として抽出し、当該構成要素をそれぞれ評価し、評価用データとしての文書データから抽出した構成要素に基づいて、当該文書データと所定事案との関連性を評価することができる。また、音声データを分析する場合、上記システムは、当該音声データ自体を分析の対象としてもよいし、音声認識により当該音声データを文書データに変換し、変換後の文書データを分析の対象としてもよい。前者の場合、上記システムは、例えば、音声データを所定の長さの部分音声に分割して構成要素とし、任意の音声分析手法（例えば、隠れマルコフモデル、カルマンフィルタなど）を用いて当該部分音声を識別することによって、当該音声データを分析できる。後者の場合、任意の音声認識アルゴリズム（例えば、隠れマルコフモデルを用いた認識方法など）を用いて音声を認識し、認識後のデータ（文書データ）に対して、前述した手順と同様の手順で分析できる。また、画像データを分析する場合、上記システムは、例えば、画像データを所定の大きさの部分画像に分割して構成要素とし、任意の画像認識手法（例えば、パターンマッチング、サポートベクターマシン、ニューラルネットワークなど）を用いて当該部分画像を識別することによって、当該画像データを分析できる。さらに、映像データを分析する場合、上記システムは、例えば、映像データに含まれる複数のフレーム画像を所定の大きさの部分画像にそれぞれ分割して構成要素とし、任意の画像認識手法（例えば、パターンマッチング、サポートベクターマシン、ニューラルネットワークなど）を用いて当該部分画像を識別することによって、当該映像データを分析できる。 For example, when analyzing document data, the system extracts morphemes contained in document data as learning data as constituent elements, evaluates the constituent elements, and extracts from the document data as evaluation data. Based on the element, the relevance between the document data and the predetermined case can be evaluated. When analyzing voice data, the system may analyze the voice data itself, or convert the voice data into document data by voice recognition, and use the converted document data as an analysis target. Good. In the former case, for example, the system divides the voice data into partial voices of a predetermined length to form components, and uses the voice analysis method (for example, hidden Markov model, Kalman filter, etc.) to convert the partial voices. By identifying, the voice data can be analyzed. In the latter case, speech is recognized using an arbitrary speech recognition algorithm (for example, a recognition method using a hidden Markov model), and the same procedure as described above is performed on the recognized data (document data). Can be analyzed. When analyzing image data, the system, for example, divides the image data into partial images of a predetermined size to form components, and any image recognition method (for example, pattern matching, support vector machine, neural network) Etc.) can be used to identify the partial image. Further, when analyzing video data, the system, for example, divides a plurality of frame images included in the video data into partial images each having a predetermined size to form a component, and an arbitrary image recognition technique (for example, a pattern The video data can be analyzed by identifying the partial image using matching, a support vector machine, a neural network, or the like.

〔ソフトウェア・ハードウェアによる実現例〕
上記システムの制御ブロックは、集積回路（ＩＣチップ）等に形成された論理回路（ハードウェア）によって実現してもよいし、ＣＰＵを用いてソフトウェアによって実現してもよい。後者の場合、上記システムは、各機能を実現するソフトウェアであるプログラム（データ分析システムの制御プログラム）を実行するＣＰＵ、当該プログラムおよび各種データがコンピュータ（またはＣＰＵ）で読み取り可能に記録されたＲＯＭ（Read Only Memory）または記憶装置（これらを「記録媒体」と称する）、当該プログラムを展開するＲＡＭ（Random Access Memory）などを備えている。そして、コンピュータ（またはＣＰＵ）が上記プログラムを上記記録媒体から読み取って実行することにより、本発明の目的が達成される。上記記録媒体としては、「一時的でない有形の媒体」、例えば、テープ、ディスク、カード、半導体メモリ、プログラマブルな論理回路などを用いることができる。また、上記プログラムは、当該プログラムを伝送可能な任意の伝送媒体（通信ネットワークや放送波等）を介して上記コンピュータに供給されてもよい。本発明は、上記プログラムが電子的な伝送によって具現化された、搬送波に埋め込まれたデータ信号の形態でも実現され得る。なお、上記プログラムは、任意のプログラミング言語によって実装可能である。また、上記プログラムを記録した任意の記録媒体も、本発明の範疇に入る。 [Example of implementation using software and hardware]
The control block of the above system may be realized by a logic circuit (hardware) formed in an integrated circuit (IC chip) or the like, or may be realized by software using a CPU. In the latter case, the system includes a CPU that executes a program (control program for the data analysis system) that is software that implements each function, and a ROM (in which the program and various data are recorded so as to be readable by the computer (or CPU)). A Read Only Memory) or a storage device (these are referred to as “recording media”), a RAM (Random Access Memory) for developing the program, and the like are provided. And the objective of this invention is achieved when a computer (or CPU) reads the said program from the said recording medium and runs it. As the recording medium, a “non-temporary tangible medium” such as a tape, a disk, a card, a semiconductor memory, a programmable logic circuit, or the like can be used. The program may be supplied to the computer via an arbitrary transmission medium (such as a communication network or a broadcast wave) that can transmit the program. The present invention can also be realized in the form of a data signal embedded in a carrier wave in which the program is embodied by electronic transmission. Note that the above program can be implemented in any programming language. Also, any recording medium that records the above program falls within the scope of the present invention.

〔アプリケーション例〕
上記システムは、例えば、ディスカバリー支援システム、フォレンジックシステム、電子メール監視システム、医療応用システム（例えば、ファーマコビジランス支援システム、治験効率化システム、医療リスクヘッジシステム、転倒予測（転倒防止）システム、予後予測システム、診断支援システムなど）、インターネット応用システム（例えば、スマートメールシステム、情報アグリゲーション（キュレーション）システム、ユーザ監視システム、ソーシャルメディア運営システムなど）、情報漏洩検知システム、プロジェクト評価システム、マーケティング支援システム、知財評価システム、不正取引監視システム、コールセンターエスカレーションシステム、信用調査システムなど、ビッグデータを分析する人工知能システム（データと所定事案との関連性を評価可能な任意のシステム）として実現され得る。なお、本発明のデータ分析システムが応用される分野によっては、当該分野に特有の事情を考慮して、例えば、データに前処理（例えば、当該データから重要箇所を抜き出し、当該重要箇所のみをデータ分析の対象とするなど）を施したり、データ分析の結果を表示する態様を変化させたりしてよい。こうした変形例が多様に存在し得ることは、当業者に理解されるところであり、すべての変形例が本発明の範疇に入る。 [Application example]
Such systems include, for example, discovery support systems, forensic systems, e-mail monitoring systems, medical application systems (eg, pharmacovigilance support systems, clinical trial efficiency systems, medical risk hedging systems, fall prediction (fall prevention) systems, prognosis predictions) System, diagnosis support system, etc.), Internet application system (eg, smart mail system, information aggregation (curation) system, user monitoring system, social media management system, etc.), information leakage detection system, project evaluation system, marketing support system, Artificial intelligence systems that analyze big data, such as intellectual property evaluation systems, fraud monitoring systems, call center escalation systems, credit check systems And may be implemented the association between predetermined cases as any system) can be evaluated. Depending on the field to which the data analysis system of the present invention is applied, in consideration of circumstances peculiar to the field, for example, preprocessing (for example, extracting an important part from the data and extracting only the important part from the data) The analysis target may be applied), or the mode of displaying the data analysis result may be changed. It will be understood by those skilled in the art that a variety of such variations can exist, and all variations fall within the scope of the present invention.

本発明は上述したそれぞれの実施の形態に限定されるものではなく、請求項に示した範囲で種々の変更が可能であり、異なる実施の形態にそれぞれ開示された技術的手段を適宜組み合わせて得られる実施の形態についても、本発明の技術的範囲に含まれる。さらに、各実施の形態にそれぞれ開示された技術的手段を組み合わせることにより、新しい技術的特徴を形成できる。 The present invention is not limited to the above-described embodiments, and various modifications can be made within the scope of the claims, and the technical means disclosed in different embodiments can be appropriately combined. Embodiments to be made are also included in the technical scope of the present invention. Furthermore, a new technical feature can be formed by combining the technical means disclosed in each embodiment.

１……データ分析システム、２……サーバ装置、３……クライアント装置、４……データベース、５……ストレージシステム、６……管理計算機 1 ... Data analysis system, 2 ... Server device, 3 ... Client device, 4 ... Database, 5 ... Storage system, 6 ... Management computer

Claims

A data analysis system for analyzing data,
A memory for at least temporarily storing a plurality of evaluation data to be analyzed;
A controller that evaluates the plurality of evaluation data based on learning data;
With
The controller is
A plurality of components are extracted from the learning data, and each of the plurality of components constitutes at least a part of the learning data.
From the relationship between the extracted evaluation information of each of the plurality of constituent elements and the position at which each of the plurality of constituent elements appears in the learning data, the distribution mode of the plurality of constituent elements in the learning data is obtained. ,
From among the plurality of components, the components used in the evaluation of the plurality of evaluation data, selected based on the mode of the distribution,
A data analysis system for evaluating the evaluation data based on the selected component.

The controller is
The reference data presented to the user and the combination of classification information set in the reference data by the user are used as the learning data,
Based on the degree of contribution to the combination of each of the plurality of constituent elements, generating evaluation information for each of the plurality of constituent elements,
The data analysis system according to claim 1, wherein the evaluation data is evaluated by generating an index for ranking the evaluation data based on the generated evaluation information.

The controller is
Based on the distribution mode, among the plurality of components extracted from the learning data, a plurality of components existing in the learning data having a predetermined positional relationship are converted into the plurality of evaluation data. The data analysis system according to claim 1 or 2 , wherein the data analysis system is selected as a component used for evaluation of the data.

The controller is
Obtaining at least one peak of evaluation information of a plurality of constituent elements extracted from the learning data based on the distribution mode;
The data analysis system according to claim 3 , wherein a component used for evaluating the plurality of evaluation data is selected from the plurality of components based on the peak.

The controller is
Determining a component group including a component corresponding to the position of the peak and another component located in the vicinity of the component of the learning data;
The data analysis system according to claim 4 , wherein a plurality of components belonging to the component group are selected as components used for evaluating the plurality of evaluation data.

The controller is
Dividing the learning data into a plurality of segments based on the distribution mode;
The plurality of divided segments are used as new learning data,
Evaluating the plurality of evaluation data based on the new plurality of learning data;
The data analysis system according to any one of claims 1 to 5 .

The controller is
Classifying the plurality of selected components into a plurality of groups based on the distribution aspect;
Determining the plurality of segments in the process of integrating the plurality of groups;
The data analysis system according to claim 6 .

A method of controlling a data analysis system that evaluates a plurality of evaluation data based on learning data,
The data analysis system is
Extracting a plurality of components from the learning data, each of the plurality of components constituting at least a part of the learning data; and
From the relationship between the extracted evaluation information of each of the plurality of constituent elements and the position at which each of the plurality of constituent elements appears in the learning data, the distribution mode of the plurality of constituent elements in the learning data is obtained. Steps,
From among the plurality of components, the method comprising the components used in the evaluation of the plurality of evaluation data is selected based on the mode of the distribution,
Evaluating the evaluation data based on the selected component;
Control method to execute.

A program for causing a computer to perform data analysis for evaluating a plurality of evaluation data based on learning data,
Extracting a plurality of components from the learning data, each of the plurality of components constituting at least a part of the learning data; and
From the relationship between the extracted evaluation information of each of the plurality of constituent elements and the position at which each of the plurality of constituent elements appears in the learning data, the distribution mode of the plurality of constituent elements in the learning data is obtained. Steps,
From among the plurality of components, the method comprising the components used in the evaluation of the plurality of evaluation data is selected based on the mode of the distribution,
Evaluating the evaluation data based on the selected component;
A program that causes a computer to execute.

A computer-readable recording medium on which the program according to claim 9 is recorded.