JP2015203961A

JP2015203961A - Document extraction system

Info

Publication number: JP2015203961A
Application number: JP2014082782A
Authority: JP
Inventors: 佳男高枝; Yoshio Takaeda; 哲也金田; Tetsuya Kaneda; 弘海矢野; Hiromi Yano; 康生大原; Yasuo Ohara
Original assignee: TOOR Inc; Cybernet Systems Co Ltd
Current assignee: TOOR Inc; Cybernet Systems Co Ltd
Priority date: 2014-04-14
Filing date: 2014-04-14
Publication date: 2015-11-16

Abstract

【課題】本発明は、条件文の有する概念に関連する記載が文書にどの程度含まれるかによって文書を抽出可能にすることを目的とする。【解決手段】本発明に係る文書抽出方法は、条件文を取得すると、各文書を予め定められた複数のセグメントに分割し、条件文との類似度を前記セグメント毎に数値化して当該セグメントのスコアとする部分スコア算出手順（Ｓ１１３）と、一つの文書を構成する各セグメントのスコアを用いて当該文書と条件文との類似度を表す文書スコアを算出し、文書スコア用いて、複数の文書のなかから条件文と概念の近い文書を選択する抽出手順（Ｓ１１５）と、を順に有する。【選択図】図２An object of the present invention is to make it possible to extract a document depending on how much description related to the concept of a conditional sentence is included in the document. In a document extraction method according to the present invention, when a conditional sentence is acquired, each document is divided into a plurality of predetermined segments, and the degree of similarity with the conditional sentence is quantified for each segment. A partial score calculation procedure (S113) to be used as a score and the score of each segment constituting one document are used to calculate a document score representing the similarity between the document and the conditional sentence. An extraction procedure (S115) for selecting a document having a concept similar to that of the conditional sentence from the above. [Selection] Figure 2

Description

本発明は、複数の文書のなかから検索条件と類似の記載部分を多く含む文書を抽出する文書抽出システムに関する。 The present invention relates to a document extraction system for extracting a document including a large number of description parts similar to a search condition from a plurality of documents.

大量の文書のなかから、条件文に近い内容の文書を検索するシステムが提案されている（例えば、特許文献１参照。）。その方法としては、キーワードを使う方法と、単語の出現頻度をベースに特徴ベクトルを使う方法が代表的である。両方とも条件文との類似度をスコアとして何らかのアルゴリズムで数値化し、スコア値の高いものから順に抽出するのが一般的である。 There has been proposed a system for searching a document having a content close to a conditional sentence from a large number of documents (see, for example, Patent Document 1). As the method, a method using a keyword and a method using a feature vector based on the appearance frequency of words are representative. In both cases, the degree of similarity with a conditional sentence is generally converted into a numerical value by some algorithm and extracted in descending order of score value.

特開２０１３−３００８９号公報JP 2013-30089 A

しかし、いずれの方法を使っても、文書全体を対象とすると、文書の一部にしか条件文に近い内容が記述されない場合、内容的には条件文に極めて近い内容であっても、検索されない場合がある。これは、文書の大きさに、書籍のような広汎かつ大きな文書から、辞典の項目ように短くかつ端的な記述がされた小さな文書まであるためである。例えば、単純にスコアを計算すると、書籍のような文書の場合、条件文と内容的に大きく異なる記述であっても、文章量が多いためスコア値が高くなってしまうためである。これを防ぐために、スコアを文章量で割り算する方法がある。これによって、条件文とは内容が大きく異なり文章量が多い文書が抽出されることは防ぐことができる。しかしながら、大きな文章量で一部に条件文に非常に近い内容を含むような文書が抽出されないといった問題が新たに生じることになる。 However, no matter which method is used, if the entire document is targeted, if the content close to the conditional statement is described in only a part of the document, even if the content is very close to the conditional statement, it will not be searched There is a case. This is because the size of the document ranges from a wide and large document such as a book to a small document with a short and straightforward description such as a dictionary item. For example, when the score is simply calculated, in the case of a document such as a book, even if the description is significantly different from the conditional sentence, the score value becomes high due to the large amount of sentences. In order to prevent this, there is a method of dividing the score by the sentence amount. As a result, it is possible to prevent a document having a large amount of text from the conditional sentence and from being extracted. However, there arises a new problem that a document having a large amount of text and partly including a content very close to a conditional sentence cannot be extracted.

以上の問題を解決し、適格に条件文に近い記述を有する文書を抽出するために、発明者は、文書を形式的な部分（以下文書セグメントと呼ぶ）に分割し、セグメント毎にスコアを計算し、セグメント毎に一定の重み付けをして文書全体のスコア値として、有効文書の抽出を行うことが有効であることを見出した。 In order to solve the above problems and extract documents that have a description close to a conditional sentence, the inventor divides the document into formal parts (hereinafter referred to as document segments) and calculates a score for each segment. Then, it has been found that it is effective to extract a valid document as a score value of the entire document by giving a constant weight to each segment.

具体的には、本発明に係る文書抽出方法は、
複数の文書のなかから条件文の概念に近い文書を抽出する文書抽出方法であって、
部分スコア算出部が、前記条件文を取得すると、各文書を予め定められた複数のセグメントに分割し、前記条件文との類似度を前記セグメント毎に数値化して当該セグメントのスコアとする部分スコア算出手順と、
抽出部が、一つの文書を構成する各セグメントのスコアを用いて、一定のアルゴリズムに基づいて重み付けを行った上で、当該文書と前記条件文との類似度を表す文書スコアを算出し、前記文書スコアを用いて、前記複数の文書のなかから前記条件文と概念の近い文書を選択する抽出手順と、
を順に有する。 Specifically, the document extraction method according to the present invention includes:
A document extraction method for extracting a document close to the concept of a conditional sentence from a plurality of documents,
When the partial score calculation unit obtains the conditional sentence, the partial score is divided into a plurality of predetermined segments, and the degree of similarity with the conditional sentence is quantified for each segment to obtain the score of the segment Calculation procedure,
The extraction unit calculates the document score representing the similarity between the document and the conditional sentence after weighting based on a certain algorithm using the score of each segment constituting one document, An extraction procedure for selecting a document having a concept similar to the conditional sentence from the plurality of documents using a document score;
In order.

本発明に係る文書抽出方法では、マップ化部が、前記抽出手順において抽出した文書の各セグメントの重み付けを用いて各セグメントの特徴ベクトルを合成して当該文書の文書特徴ベクトルを生成し、前記文書特徴ベクトルを用いて前記抽出手順において抽出した文書群内の文書相互間の類似度を計算し、前記文書相互間の類似度に応じて、前記抽出手順において抽出した文書をマップ上に配置するマップ化手順を、前記抽出手順の後にさらに有していてもよい。 In the document extraction method according to the present invention, the mapping unit generates a document feature vector of the document by combining the feature vectors of the segments using the weights of the segments of the document extracted in the extraction procedure, and the document A map that calculates the similarity between documents in the document group extracted in the extraction procedure using a feature vector, and arranges the documents extracted in the extraction procedure on the map according to the similarity between the documents A conversion procedure may be further provided after the extraction procedure.

本発明に係る文書抽出方法では、前記抽出手順において、一つの文書に含まれる前記スコアの最高値を文書ごとに算出し、前記複数の文書のなかから前記最高値の高い予め定められた範囲の文書を抽出してもよい。 In the document extraction method according to the present invention, in the extraction procedure, the maximum value of the score included in one document is calculated for each document, and a predetermined range having a high maximum value is selected from the plurality of documents. A document may be extracted.

本発明に係る文書抽出方法では、前記抽出手順において、一つの文書に含まれる前記スコアのうちの予め定められたスコアの高い範囲のスコアを用いて比較値を文書ごとに算出し、前記複数の文書のなかから前記比較値の高い予め定められた範囲の文書を抽出してもよい。 In the document extraction method according to the present invention, in the extraction procedure, a comparison value is calculated for each document using a score in a high range of a predetermined score among the scores included in one document, A document in a predetermined range having a high comparison value may be extracted from the documents.

具体的には、本発明に係る文書抽出システムは、
複数の文書のなかから条件文の概念に近い文書を抽出する文書抽出システムであって、
前記条件文を取得すると、各文書を予め定められた複数のセグメントに分割し、前記条件文との類似度を前記セグメント毎に数値化して当該セグメントのスコアとする部分スコア算出部と、
一つの文書を構成する各セグメントのスコアを用いて当該文書と前記条件文との類似度を表す文書スコアを算出し、前記文書スコアを用いて、前記複数の文書のなかから前記条件文と概念の近い文書を選択する抽出部と、
を備える。 Specifically, the document extraction system according to the present invention is:
A document extraction system that extracts a document close to the concept of a conditional sentence from a plurality of documents,
When the conditional sentence is acquired, each document is divided into a plurality of predetermined segments, and a partial score calculation unit that quantifies the similarity with the conditional sentence for each segment and sets the score of the segment;
A document score representing the similarity between the document and the conditional sentence is calculated using the score of each segment constituting one document, and the conditional sentence and the concept are calculated from the plurality of documents using the document score. An extractor that selects documents close to each other,
Is provided.

本発明に係る文書抽出システムでは、前記抽出部の抽出した文書の各セグメントの重み付けを用いて各セグメントの特徴ベクトルを合成して当該文書の文書特徴ベクトルを生成し、前記文書特徴ベクトルを用いて前記抽出部の抽出した文書群内の文書相互間の類似度を計算し、前記文書相互間の類似度に応じて、前記抽出部の抽出した文書をマップ上に配置するマップ化部をさらに備えていてもよい。 In the document extraction system according to the present invention, a feature vector of each segment is generated by using the weight of each segment of the document extracted by the extraction unit to generate a document feature vector of the document, and the document feature vector is used. A mapping unit for calculating a similarity between documents in the document group extracted by the extraction unit and arranging the documents extracted by the extraction unit on a map according to the similarity between the documents; It may be.

本発明に係る文書抽出システムでは、前記抽出部は、一つの文書に含まれる前記スコアの最高値を文書ごとに算出し、前記複数の文書のなかから前記最高値の高い予め定められた範囲の文書を抽出してもよい。 In the document extraction system according to the present invention, the extraction unit calculates a maximum value of the score included in one document for each document, and has a predetermined range in which the maximum value is high among the plurality of documents. A document may be extracted.

本発明に係る文書抽出システムでは、前記抽出部は、一つの文書に含まれる前記スコアのうちの予め定められたスコアの高い範囲のスコアを用いて比較値を文書ごとに算出し、前記複数の文書のなかから前記比較値の高い予め定められた範囲の文書を抽出してもよい。 In the document extraction system according to the present invention, the extraction unit calculates a comparison value for each document using a score in a high range of a predetermined score among the scores included in one document, and A document in a predetermined range having a high comparison value may be extracted from the documents.

なお、上記各発明は、可能な限り組み合わせることができる。 The above inventions can be combined as much as possible.

本発明によれば、条件文の有する概念に関連する記載が文書にどの程度含まれるかによって文書を抽出可能にすることができる。 According to the present invention, it is possible to extract a document depending on how much description related to the concept of the conditional sentence is included in the document.

実施形態１に係る部分情報抽出システムの構成例を示す。The structural example of the partial information extraction system which concerns on Embodiment 1 is shown. 実施形態１に係る部分情報抽出システムのシーケンスを示す。The sequence of the partial information extraction system which concerns on Embodiment 1 is shown. ベクトル空間モデルを用いたスコアＰ_ｋの算出方法の一例を示す。An example of a method for calculating a score P _k using a vector space model will be described. 文書スコアＸ_ｉの算出方法の一例を示す。An example of a method for calculating the document score X _i will be described. キーワードの出現頻度を用いたスコアＰ_ｋの算出方法の一例を示す。An example of a method for calculating the score P _k using the appearance frequency of the keyword will be shown. 実施形態２に係る部分情報抽出システムの構成例を示す。The structural example of the partial information extraction system which concerns on Embodiment 2 is shown. 実施形態２に係る部分情報抽出システムのシーケンスを示す。The sequence of the partial information extraction system which concerns on Embodiment 2 is shown. マップの一例を示す。An example of a map is shown.

以下、本発明の実施形態について、図面を参照しながら詳細に説明する。なお、本発明は、以下に示す実施形態に限定されるものではない。これらの実施の例は例示に過ぎず、本発明は当業者の知識に基づいて種々の変更、改良を施した形態で実施することができる。なお、本明細書及び図面において符号が同じ構成要素は、相互に同一のものを示すものとする。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. In addition, this invention is not limited to embodiment shown below. These embodiments are merely examples, and the present invention can be implemented in various modifications and improvements based on the knowledge of those skilled in the art. In the present specification and drawings, the same reference numerals denote the same components.

（実施形態１）
図１に、本実施形態に係る文書抽出システムの構成例を示す。本実施形態に係る文書抽出システムは、サーバ１０と、ストレージ２０と、ユーザ端末３０を備える。ストレージ２０は、サーバ１０からアクセス可能な任意の記憶媒体である。サーバ１０及びユーザ端末３０は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）及び記憶媒体などの計算機資源を備えたコンピュータであり、記憶媒体にはプログラムがインストールされている。サーバ１０、ストレージ２０及びユーザ端末３０は、いずれも任意の数を採用しうるが、本実施形態では、サーバ１０が１台、ストレージ２０が２台、ユーザ端末３０が１台の場合について示す。 (Embodiment 1)
FIG. 1 shows a configuration example of a document extraction system according to the present embodiment. The document extraction system according to this embodiment includes a server 10, a storage 20, and a user terminal 30. The storage 20 is an arbitrary storage medium accessible from the server 10. The server 10 and the user terminal 30 are computers having computer resources such as a CPU (Central Processing Unit) and a storage medium, and a program is installed in the storage medium. Any number of servers 10, storages 20, and user terminals 30 may be employed. In the present embodiment, a case where there is one server 10, two storages 20, and one user terminal 30 will be described.

ストレージ２０は、文書を保持する。文書は、通信ネットワークを介して送受信される任意のデータを含み、例えば、文章、数値データ、ログデータ及び顧客情報を含む。文章は、例えば、特許、論文、書籍、レポート及びホームページが例示できる。数値データは、例えば、センサーデータ、測定データ、ＰＯＳ（ＰｏｉｎｔＯｆＳａｌｅｓ）データが例示できる。ログデータは、例えば、オンラインアクセスデータ、各種装置の状態データが例示できる。本実施形態では、一例として、文書が文章である場合について説明する。 The storage 20 holds a document. The document includes arbitrary data transmitted / received via a communication network, and includes, for example, text, numerical data, log data, and customer information. Examples of sentences include patents, papers, books, reports, and homepages. Examples of the numerical data include sensor data, measurement data, and POS (Point Of Sales) data. Examples of the log data include online access data and status data of various devices. In the present embodiment, a case where a document is a sentence will be described as an example.

図２に、本実施形態に係る部分情報抽出システムのシーケンスを示す。サーバ１０は、ユーザ端末３０から条件文を取得するステップＳ１０１の前に、セグメント毎に特徴ベクトルを生成する。例えば、サーバ１０は、ストレージ２０から文書を取得し、取得した文書をあらかじめ定められた複数のセグメントに分割し、セグメント毎にインデックスに基づきベクトル空間モデルに基づく特徴ベクトルを生成する。各セグメントの特徴ベクトルは、ストレージ２０に格納され、以後の類似度の計算に利用されることが好ましい。このとき、各セグメントの特徴ベクトルは、元の情報群とは別に２次的なストレージ２０に格納することが好ましい。元の情報群は、計算ステージでは一切利用されず、最終段階で元の情報を表示する際にのみ、利用される。 FIG. 2 shows a sequence of the partial information extraction system according to this embodiment. The server 10 generates a feature vector for each segment before step S101 for acquiring a conditional statement from the user terminal 30. For example, the server 10 acquires a document from the storage 20, divides the acquired document into a plurality of predetermined segments, and generates a feature vector based on the vector space model based on the index for each segment. The feature vector of each segment is preferably stored in the storage 20 and used for the subsequent calculation of similarity. At this time, the feature vector of each segment is preferably stored in the secondary storage 20 separately from the original information group. The original information group is not used at all in the calculation stage, and is used only when displaying the original information in the final stage.

ユーザ端末３０は、通信ネットワークを介して条件文を送信する（Ｓ１０１）。サーバ１０は、ユーザ端末３０から条件文を受信すると、ストレージ２０から各セグメントの特徴ベクトルを取得し（Ｓ１０２）、取得したセグメントのスコアを算出し（Ｓ１１３）、セグメントのスコアを用いて文書と条件文との類似度を表す文書スコアを算出し、文書スコア用いて、条件文と概念の近い文書を抽出し（Ｓ１１５）、抽出結果をユーザ端末３０へ送信する（Ｓ１１６）。ユーザ端末３０は、サーバ１０から受信した抽出結果を表示する（Ｓ１１７）。 The user terminal 30 transmits a conditional statement via the communication network (S101). Upon receiving the conditional statement from the user terminal 30, the server 10 acquires the feature vector of each segment from the storage 20 (S102), calculates the score of the acquired segment (S113), and uses the segment score to determine the document and the condition. A document score representing the degree of similarity with the sentence is calculated, a document having a concept similar to the conditional sentence is extracted using the document score (S115), and the extraction result is transmitted to the user terminal 30 (S116). The user terminal 30 displays the extraction result received from the server 10 (S117).

サーバ１０は、通信ネットワークを介してユーザ端末３０及びストレージ２０と情報の送受信を行う通信機能部（不図示）と、文書を抽出するための構成を備える。文書を抽出するための構成は、例えば、条件文取得部１１と、部分スコア算出部１５と、抽出部１７と、を備える。サーバ１０は、コンピュータを、条件文取得部１１、部分スコア算出部１５及び抽出部１７として機能させることで実現してもよい。この場合、サーバ１０内のＣＰＵが、記憶部（不図示）に記憶されたコンピュータプログラムを実行することで、各構成を実現する。 The server 10 includes a communication function unit (not shown) that transmits and receives information to and from the user terminal 30 and the storage 20 via a communication network, and a configuration for extracting a document. The configuration for extracting a document includes, for example, a conditional sentence acquisition unit 11, a partial score calculation unit 15, and an extraction unit 17. The server 10 may be realized by causing a computer to function as the conditional sentence acquisition unit 11, the partial score calculation unit 15, and the extraction unit 17. In this case, each configuration is realized by the CPU in the server 10 executing a computer program stored in a storage unit (not shown).

サーバ１０は、文書を抽出するに際し、本実施形態に係る文書抽出方法を実行する。本実施形態に係る文書抽出方法は、部分スコア算出手順（Ｓ１１３）と、抽出手順（Ｓ１１５）と、を順に有する。 The server 10 executes the document extraction method according to the present embodiment when extracting a document. The document extraction method according to this embodiment includes a partial score calculation procedure (S113) and an extraction procedure (S115) in order.

部分スコア算出手順（Ｓ１１３）では、部分スコア算出部１５が、条件文を取得すると、条件文との類似度を、文書の各部分を構成するセグメント毎に数値化して当該セグメントのスコアとする。文書が文章を含む場合、セグメントは、例えば、段落又は文である。段落の場合、例えば、改行を検出することで段落単位を識別する。文の場合、句点「。」又は「．」、疑問符「？」及び感嘆符「！」を検出することで単位文を識別する。本実施形態では、一例として、セグメントが段落である場合について説明する。 In the partial score calculation procedure (S113), when the partial score calculation unit 15 acquires the conditional sentence, the similarity with the conditional sentence is digitized for each segment constituting each part of the document to obtain the score of the segment. When the document includes a sentence, the segment is, for example, a paragraph or a sentence. In the case of paragraphs, for example, paragraph units are identified by detecting line breaks. In the case of a sentence, a unit sentence is identified by detecting a punctuation mark “.” Or “.”, A question mark “?”, And an exclamation mark “!”. In the present embodiment, as an example, a case where a segment is a paragraph will be described.

部分スコア算出手順（Ｓ１１３）では、部分スコア算出部１５が、ベクトル空間モデルに基づきセグメントの概念を表す特徴ベクトルＰ_ｉをセグメントごとに算出する。例えば、ベクトル判定部１２は、ベクトル空間モデルに基づき条件文ｄ_ｋ及びセグメントｄ_ｉをベクトル化し、条件文ベクトル及び特徴ベクトルを算出する。ここで、セグメントの概念を表す特徴ベクトルＰ_ｉは、条件文の取得前に予め算出しておいてもよい。このように、セグメントの特徴ベクトルＰ_ｉを条件文と独立にしておくことで、サーバ１０の負荷を軽減するとともに、抽出結果の速やかなユーザ端末３０への提供を行うことができる。 In the partial score calculation procedure (S113), the partial score calculation unit 15 calculates a feature vector P _i representing the concept of the segment for each segment based on the vector space model. For example, the vector determination unit 12 vectorizes the conditional statement d _k and the segment d _i based on the vector space model, and calculates the conditional statement vector and the feature vector. Here, the feature vector P _i representing the concept of the segment may be calculated in advance before acquiring the conditional sentence. Thus, by making the segment feature vector P _i independent of the conditional statement, the load on the server 10 can be reduced and the extraction result can be promptly provided to the user terminal 30.

情報ｄ_ｉが、要素ｔ_ｊに対してマトリクス表記できる場合、情報ｄ_ｉをベクトル空間モデルｄ_ｉ＝（ｔ_１，ｔ_２，ｔ_３，……）で記述することができる。このため、条件文は、条件文に含まれる単語を要素とする条件文ベクトルで記述することができる。またセグメントも、セグメントに含まれる単語を要素とする特徴ベクトルで記述することができる。 When the information d _i can be expressed in matrix with respect to the element t _j , the information d _i can be described by a vector space model d _i = (t ₁ , t ₂ , t ₃ ,...). For this reason, the conditional sentence can be described by a conditional sentence vector whose elements are words included in the conditional sentence. A segment can also be described by a feature vector whose elements are words included in the segment.

セグメントｄ_ｉ中に出現する要素ｔ_ｊの出現頻度をｎ_ｉｊとすると、セグメントｄ_ｉは概念ベクトルｄ_ｉ＝（ｎ_ｉ１，ｎ_ｉ２，ｎ_ｉ３，……）で表すことができる。例えば、セグメントｄ_１における単語ｔ_１、ｔ_２、ｔ_３の出願回数がそれぞれ０、１、０であり、セグメントｄ_２における単語ｔ_１、ｔ_２、ｔ_３の出願回数がそれぞれ２、１、０であり、セグメントｄ_３における単語ｔ_１、ｔ_２、ｔ_３の出願回数がそれぞれ１、２、３である場合、セグメントの行列Ｍは以下のように表される。 When the frequency of occurrence of elements _{t j} appearing in segment _{d i} and _{n ij,} segment _{d i} concept vector _{_{_{d i = (n i1, n}}} i2, n i3, ......) can be represented by. For example, the word _t 1 in the segment _{d _1,} application number of t 2, _{t 3} are 0,1,0 respectively, the words _t 1 in the segment _{d _2,} _t _{_2,} t ₃ of the application number, respectively 2,1, 0, if the applicant number of words _t _1, t 2, _{t 3} in the segment _{d 3} is 1, 2, and 3, respectively, the matrix M of the segment is expressed as follows.

部分スコア算出手順（Ｓ１１３）では、部分スコア算出部１５が、特徴ベクトルと条件文の概念を表す条件文ベクトルとの類似度をセグメント毎に数値化して当該セグメントのスコアとする。例えば、セグメントｄ_ｉと条件文ｄ_ｋの内容の近さは、特徴ベクトルｄ_ｉと条件文ベクトルｄ_ｋの演算によって数値化できる。数値化に用いる演算は、ベクトル相互間の距離であってもよいし、内積、外積等の任意の演算を用いてもよい。 In the partial score calculation procedure (S113), the partial score calculation unit 15 quantifies the similarity between the feature vector and the conditional sentence vector representing the concept of the conditional sentence for each segment to obtain the score of the segment. For example, the closeness of the contents of the segment d _i and the conditional statement d _k can be quantified by calculating the feature vector d _i and the conditional statement vector d _k . The calculation used for digitization may be a distance between vectors, or an arbitrary calculation such as inner product or outer product.

ここで、どのセグメントにも共通に使用される単語は文書の内容の近さに影響を与えない。そこで、ベクトルの算出においては、各文書に特徴的な単語とそれ以外の単語のベクトルへの寄与に差を設けることが好ましい。例えば、ｔｆｉｄｆ（ＴｅｒｍＦｒｅｑｕｅｎｃｙＩｎｖｅｒｓｅＤｏｃｕｍｅｎｔＦｒｅｑｕｅｎｃｙ）法を使って重み付けを行う。これにより、セグメントの内容の近さの精度を向上することができる。どの文書にも同様に使われる単語の重み付けｔｆｉｄｆは小さく、文書によって使われる頻度が大きく異なる文書はｔｆｉｄｆが大きい。 Here, words commonly used for any segment do not affect the closeness of the content of the document. Therefore, in calculating the vector, it is preferable to provide a difference in the contribution of each word characteristic to each document to the vector of the other words. For example, weighting is performed using a tfidf (Term Frequency Inverse Document Frequency) method. Thereby, the precision of the closeness of the content of a segment can be improved. The word weight tfidf used in the same manner in any document is small, and the tfidf having a large frequency used by the document has a large tfidf.

抽出手順（Ｓ１１５）では、抽出部１７が、一つの文書ｍを構成する各スコアＰ_ｉを用いて当該文書と条件文との類似度を表す文書スコアＸ_ｍを算出する。文書スコアＸ_ｍの算出法は任意であり、ユーザ端末３０が設定可能であることが好ましい。例えば、条件文と文書の各段落との類似度をベクトル空間モデルに基づき計算すると、文書ｍにおけるスコアＰ_ｉの分布は、図３に示すように、完全一致を１とする分布関数となる。そこで、スコアＰ_ｉのもっとも密度の高いスコアＰ_ｉを文書スコアＸ_ｍとしてもよいし、スコアＰ_ｉの平均値を文書スコアＸ_ｍとしてもよいし、スコアＰ_ｉの最高値を文書スコアＸ_ｍとしてもよいし、スコアＰ_ｉのうちの予め定められたスコアの高い範囲のスコアＰ_ｉを用いた比較値を文書スコアＸ_ｍとしてもよい。当該比較値は、例えば、スコアＰ_ｉの高い段落のうちの上位３段落のスコアの合計値である。この上位上位３段落は、ユーザ端末３０が設定可能であることが好ましい。 In the extraction procedure (S115), extraction unit 17 calculates the document score X _m representing the similarity between the document and the conditional statement using the scores P _i constituting one document m. Calculation of document score X _m is arbitrary, it is preferable the user terminal 30 can be set. For example, when calculated based the similarity between conditional sentence and each paragraph of the document vector space model, the distribution of scores P _i in the document m, as shown in FIG. 3, the distribution function for the exact match and 1. Therefore, the score the highest density score P _i may be used as the document score X _m of P _i, to the average value of the score P _i may document score X _m, the maximum value document score X _m of the score P _i Alternatively, a comparison value using a score P _i in a range having a predetermined high score among the scores P _i may be used as the document score X _m . The comparison value is, for example, is the sum of the scores of the top 3 paragraphs of high score P _i paragraph. The upper three upper paragraphs are preferably settable by the user terminal 30.

文書ｍのスコア値Ｐ_ｍは、一定のアルゴリズムに基づいて重み付けを行った上で算出することが好ましい。例えば、文書ｍのセグメントｉの条件文に対するスコア値をＳ_ｉとする。文書ｍのスコア値Ｐ_ｍは
Ｐ_ｍ＝ΣＡ_i＊Ｓ_ｉ（１）
と表される。Ａ_iはセグメントｉの重み付け係数である。 The score value P _m of the document m is preferably calculated after weighting based on a certain algorithm. For example, let S _i be the score value for the conditional sentence of segment i of document m. The score value P _m of the document m is P _m = ΣA _i * S _i (1)
It is expressed. A _i is a weighting coefficient of segment i.

式（１）におけるＡ_ｉについては、いろいろな考え方がある。具体的な例として、ここでは４つの場合を説明する。
（１）ピーク値を使用
最もスコア値の高いセグメントのみＡ_ｉ＝１、他のセグメントに対してはＡ_ｉ＝０と設定する。
（２）上位３セグメントを選択
スコア値の高い順に３セグメントのスコアを合計して文書スコアとする。
（３）一定スコア値以上のセグメントを選択
対象セグメントのスコアを合計して、文書スコアとする。
（４）スコア値の高いセグメントから順に重み付けを１／２にする。
Ａ_ｉｋ_１＝１，Ａ_ｉｋ_２＝１／２，Ａ_ｉｋ_３＝１／４，……．． There are various ways of thinking about A _i in equation (1). As specific examples, four cases will be described here.
(1) Use peak value Set A _i = 1 only for the segment with the highest score value, and A _i = 0 for the other segments.
(2) Select the top three segments The scores of the three segments are totaled in descending order of score values to obtain the document score.
(3) Select segments with a certain score value or higher Total the scores of the target segments to obtain the document score.
(4) Weighting is halved in order from the segment with the highest score value.
A _i k ₁ = 1, A _i k ₂ = 1/2, A _i k ₃ = 1/4,. .

なお、セグメントの文章量の影響を避けるため、抽出したセグメントの合計文章量を規格化してもよい。すなわち、合計スコアを対象セグメントの合計文章量で割り算して比較してもよい。
以上の方法により、文書量によらず、条件文に近い内容を含む文書を抽出することが可能となる。 In order to avoid the influence of the segment text amount, the total text amount of the extracted segments may be normalized. That is, the total score may be divided and compared by the total sentence amount of the target segment.
By the above method, it is possible to extract a document including contents close to a conditional sentence regardless of the document amount.

図４に、文書スコアＸ_ｍの分布Ｄ（Ｘ）の一例を示す。文書スコアＰ_ｍが完全一致を１とする分布関数であるため、その分布Ｄ（Ｘ）も０〜１の分布関数となる。抽出部１７は、文書スコアＸ_ｍを用いて、複数の文書のなかから条件文と概念の近い文書を選択する。サーバ１０は、選択した文書をユーザ端末３０へ送信する。このとき、選択したセグメントのみをユーザ端末３０へ送信してもよい。これにより、ユーザ端末３０は、条件文と概念の近い部分を多く含む文書をユーザ端末３０に提供することができる。 4 shows an example of the distribution D (X) of the document score _{X m.} Since the document score P _m is a distribution function with a perfect match of 1, the distribution D (X) is also a distribution function of 0 to 1. Extracting unit 17, using the document score X _m, selects the document close to that of conditional statements and concepts from a plurality of documents. The server 10 transmits the selected document to the user terminal 30. At this time, only the selected segment may be transmitted to the user terminal 30. As a result, the user terminal 30 can provide the user terminal 30 with a document that includes many parts that are similar in concept to the conditional sentence.

文書の選択方法は任意であり、ユーザ端末３０が設定可能であることが好ましい。例えば、あらかじめ定められた数や割合の文書を抽出する。類似度の高い上位１０％の文書を抽出する場合、斜線で示すようなＤ（Ｘ＞Ｘ_０）の積分値＝０．１を満足する文書ｍを抽出する。この抽出する数や割合はユーザ端末３０が設定可能であることが好ましい。 The document selection method is arbitrary, and it is preferable that the user terminal 30 can be set. For example, a predetermined number and ratio of documents are extracted. When extracting the top 10% documents having a high degree of similarity, a document m that satisfies the integral value of D (X> X ₀ ) = 0.1 as indicated by the oblique lines is extracted. It is preferable that the user terminal 30 can set the number and ratio to extract.

なお、部分スコア算出手順（Ｓ１１３）において、内容の近さの判定は、例えば、条件文に含まれる単語の有無に基づいて行ってもよい。条件文に単数の単語が含まれる場合は、セグメント毎に単語を含むか含まないかの２値で判定する。例えば、評価条件として、２語の単語「希土類」、「磁石」の場合を考える。希土類磁石に関する記述の場合、図５に示すように、両方の単語を含む段落のスコアＰ_ｉを１、それ以外のスコアＰ_ｉを０とする。文書ｍ中の全ての段落のスコアの合計スコアを文書スコアＸ_ｉとする。 In the partial score calculation procedure (S113), the determination of the closeness of the contents may be performed based on, for example, the presence or absence of a word included in the conditional sentence. When a single word is included in the conditional sentence, a determination is made based on a binary value indicating whether a word is included in each segment. For example, consider the case of two words “rare earth” and “magnet” as evaluation conditions. In the case of the description relating to the rare earth magnet, as shown in FIG. 5, the score P _i of the paragraph including both words is set to 1, and the other scores P _i are set to 0. A total score of scores of all paragraphs in the document m is set as a document score X _i .

なお、本実施形態では、文書が文章である例について説明したが、本発明における文書はこれに限らない。文書が数値データ又はログデータを含む場合、セグメントは、例えば、時刻若しくは時間、地域若しくは場所、又は帰属先である。文書が顧客データを含む場合、セグメントは、例えば、時刻若しくは時間、地域若しくは場所、帰属先、又は年齢である。時間の単位は任意であり、例えば、秒単位であってもよいし、年単位であってもよい。 In the present embodiment, an example in which the document is a sentence has been described, but the document in the present invention is not limited to this. When the document includes numerical data or log data, the segment is, for example, time or time, region or place, or attribution. If the document includes customer data, the segment is, for example, time or time, region or location, attribution, or age. The unit of time is arbitrary, for example, it may be a second unit or a year unit.

また、文書が数値データ又はログデータを含む場合、ベクトル空間モデルに基づくベクトル化は以下のようにして行う。
文書がオンラインサービスにおけるユーザのアクセスログデータの場合、時刻ｄ_ｉ〜ｄ_ｉ＋Ｔ（時間間隔Ｔ）の間における、ユーザｔ_ｊのアクセス数をｎ_ｉｊとする。時刻ｄ_ｉはベクトルｄ_ｉ＝（ｎ_ｉ１，ｎ_ｉ２，ｎ_ｉ３，……）と表現できる。
文書がセンサーデータの場合、時刻ｄ_ｉ〜ｄ_ｉ＋Ｔ（時間間隔Ｔ）の間における、センサーｔ_ｊの出力数値をｎ_ｉｊとする。時刻ｄ_ｉはベクトルｄ_ｉ＝（ｎ_ｉ１，ｎ_ｉ２，ｎ_ｉ３，……）と表現できる。
文書が画像データの場合、画像ｄ_ｉを周波数変換し、変換後の各周波数の成分ｔ_ｊの数値をｎ_ｉｊとする。時刻ｄ_ｉはベクトルｄ_ｉ＝（ｎ_ｉ１，ｎ_ｉ２，ｎ_ｉ３，……）と表現できる。 When the document includes numerical data or log data, vectorization based on the vector space model is performed as follows.
When the document is the access log data of the user in the online service, the number of accesses of the user t _j between the times d _{i to} d _i + T (time interval T) is n _ij . The time d _i can be expressed as a vector d _i = (n _i1 , n _i2 , n _i3 ,...).
If the document is sensor data, between time _{_d} i _~d i + T (time interval T), the output value of the sensor _{t j} and _{n ij.} The time d _i can be expressed as a vector d _i = (n _i1 , n _i2 , n _i3 ,...).
When the document is image data, the image d _i is subjected to frequency conversion, and the numerical value of the component t _j of each frequency after conversion is set to n _ij . The time d _i can be expressed as a vector d _i = (n _i1 , n _i2 , n _i3 ,...).

また、文書が数値データ又はログデータを含む場合、重み付けｔｆｉｄｆは以下のようにして行う。
文書がオンラインサービスにおけるユーザのアクセスログデータの場合、始終平均的にアクセスするユーザの重み付けｔｆｉｄｆは小さくなり、アクセスのムラの大きいユーザの重み付けｔｆｉｄｆは大きくなる。
文書がセンサーデータの場合、出力数値のあまり変化しないセンサーの重み付けｔｆｉｄｆは小さくなり、出力数値の変化の大きいセンサーの重み付けｔｆｉｄｆは大きくなる。
文書が画像データの場合、画像間で成分値のバラツキの小さい周波数の重み付けｔｆｉｄｆは小さくなり、画像間で成分値のバラツキの大きい周波数の重み付けｔｆｉｄｆは大きくなる。 When the document includes numerical data or log data, the weighting tfidf is performed as follows.
When the document is the access log data of the user in the online service, the weight tfidf of the user who accesses on average is small, and the weight tfidf of the user having large access unevenness is large.
When the document is sensor data, the sensor weight tfidf for which the output numerical value does not change much decreases, and the sensor weight tfidf for which the output numerical value changes greatly increases.
When the document is image data, the frequency weighting tfidf with a small component value variation between images is small, and the frequency weighting tfidf with a large component value variation between images is large.

（実施形態２）
図６に、本実施形態に係る部分情報抽出システムの構成例を示す。本実施形態に係る部分情報抽出システムは、実施形態１の構成に加え、さらにマップ化部１４を備える。 (Embodiment 2)
FIG. 6 shows a configuration example of the partial information extraction system according to this embodiment. The partial information extraction system according to the present embodiment further includes a mapping unit 14 in addition to the configuration of the first embodiment.

図７に、本実施形態に係る部分情報抽出システムのシーケンスを示す。本実施形態に係る部分情報抽出方法は、実施形態１で説明した抽出手順（Ｓ１１５）の後に、マップ化手順（Ｓ１２６）をさらに有する。サーバ１０は、マップ化手順で作成したマップをユーザ端末３０へ送信する（Ｓ１２７）。ユーザ端末３０は、サーバ１０から受信したマップを表示する（Ｓ１２８）。 FIG. 7 shows a sequence of the partial information extraction system according to this embodiment. The partial information extraction method according to the present embodiment further includes a mapping procedure (S126) after the extraction procedure (S115) described in the first embodiment. The server 10 transmits the map created by the mapping procedure to the user terminal 30 (S127). The user terminal 30 displays the map received from the server 10 (S128).

マップ化手順（Ｓ１２６）では、マップ化部１４が、抽出部１７の抽出した文書について、特徴ベクトルＰ_ｉ及び条件文ベクトルＰ_ｋよって表される点を、概念の近さに応じてマップ上に配置する。 In the mapping procedure (S126), the mapping unit 14 places the points represented by the feature vector P _i and the conditional statement vector P _k on the map according to the closeness of the concept for the document extracted by the extraction unit 17. Deploy.

ここで、文書ｍの特徴ベクトルＰ_ｍは、セグメントの特徴ベクトルＰ_ｉを合成して得られる。合成の際、セグメントの重み付け係数Ａ_iを考慮する。例えば、文書ｍの特徴ベクトルＰ_ｍは
Ｐ_ｍ＝ΣＡ_ｉ＊Ｐ_ｉ（２）
と表される。文書特徴ベクトルＰ_ｍを用いて抽出手順において抽出した文書群内の文書相互間の類似度を計算する。 Here, the feature vector P _m of the document m is obtained by synthesizing the segment feature vectors P _i . In the synthesis, the segment weighting factor A _i is taken into account. For example, the feature vector P _m of the document m is P _m = ΣA _i * P _i (2)
It is expressed. To calculate the similarity between documents mutually in document set extracted in the extraction procedure using the document feature vector P _m.

セグメントの内容が条件文に近いと、使用する単語の種類が類似するため、ベクトルの指し示す点は互いに近くに配置される。そこで、特徴ベクトル及び条件文ベクトル相互間の近さを計算し、ベクトル相互間の近さに基づいて、情報間の内容の近さすなわち「意味的距離」に基づくマップ化を行う。演算は、ベクトル相互間の距離であってもよいし、内積、外積等の任意の演算を用いてもよい。得られた情報ｄ_ｉ相互間の内容の近さに基づいて、マップ化アルゴリズムを用いて図８に示すようなマップを作成することができる。 When the content of the segment is close to the conditional sentence, the types of words used are similar, and the points indicated by the vectors are arranged close to each other. Therefore, the closeness between the feature vectors and the conditional sentence vectors is calculated, and mapping based on the closeness of the contents between information, that is, the “semantic distance” is performed based on the closeness between the vectors. The calculation may be a distance between vectors, or an arbitrary calculation such as inner product or outer product. Based on the proximity of the contents between the information obtained d _i each other can be created map as shown in FIG. 8 using the mapping algorithm.

本実施形態に係るシステムは、概念検索を用いてセグメントを抽出し、概念検索を用いて算出されたベクトルを用いて各セグメントの内容の分布をマップ化することができる。このため、条件文のどの単語に近いセグメントであるのかを分類した状態で表示することができる。 The system according to the present embodiment can extract segments using concept search, and map the distribution of the contents of each segment using a vector calculated using concept search. For this reason, it is possible to display in a state in which the word in the conditional sentence is close to the segment.

本発明は情報通信産業に適用することができる。 The present invention can be applied to the information communication industry.

１０：サーバ
１１：条件文取得部
１５：部分スコア算出部
１７：抽出部
１４：マップ化部
２０：ストレージ
３０：ユーザ端末 10: server 11: conditional sentence acquisition unit 15: partial score calculation unit 17: extraction unit 14: mapping unit 20: storage 30: user terminal

Claims

A document extraction method for extracting a document close to the concept of a conditional sentence from a plurality of documents,
When the partial score calculation unit obtains the conditional sentence, a partial score calculation procedure that quantifies the similarity with the conditional sentence for each segment constituting each part of the document and sets the score of the segment;
The extraction unit calculates the document score representing the similarity between the document and the conditional sentence after weighting based on a certain algorithm using the score of each segment constituting one document, An extraction procedure for selecting a document having a concept similar to the conditional sentence from the plurality of documents using a document score;
Document extraction method having in order.

The mapping unit synthesizes the feature vectors of the segments using the weights of the segments of the document extracted in the extraction procedure to generate the document feature vectors of the document, and uses the document feature vectors in the extraction procedure. A mapping procedure for calculating the similarity between the documents in the extracted document group and arranging the document extracted in the extraction procedure on the map according to the similarity between the documents is performed after the extraction procedure. The document extracting method according to claim 1, further comprising:

In the extraction procedure, the highest value of the score included in one document is calculated for each document, and a document in a predetermined range having the highest highest value is extracted from the plurality of documents.
The document extraction method according to claim 1 or 2.

In the extraction procedure, a comparison value is calculated for each document using a score in a range of a predetermined score among the scores included in one document, and the comparison value is high among the plurality of documents. Extract documents in a predetermined range,
The document extraction method according to claim 1 or 2.

A document extraction system that extracts a document close to the concept of a conditional sentence from a plurality of documents,
When the conditional sentence is acquired, a similarity score with the conditional sentence is converted into a numerical value for each segment constituting each part of the document, and a partial score calculation unit that sets the score of the segment;
A document score representing the similarity between the document and the conditional sentence is calculated using the score of each segment constituting one document, and the conditional sentence and the concept are calculated from the plurality of documents using the document score. An extractor that selects documents close to each other,
A document extraction system comprising:

A feature vector of each segment is generated by using the weight of each segment of the document extracted by the extraction unit to generate a document feature vector of the document, and the document feature vector is used to generate a document feature vector of the document. A mapping unit that calculates the degree of similarity between the documents, and places the extracted document on the map according to the degree of similarity between the documents,
The document extraction system according to claim 5, further comprising:

The extraction unit calculates the highest value of the score included in one document for each document, and extracts a document in a predetermined range having the highest highest value from the plurality of documents.
The document extraction system according to claim 5 or 6.

The extraction unit calculates a comparison value for each document using a score in a range of a predetermined high score among the scores included in one document, and the comparison value is high among the plurality of documents. Extract documents in a predetermined range,
The document extraction system according to claim 5 or 6.