WO2015159702A1 - Partial-information extraction system - Google Patents

Partial-information extraction system Download PDF

Info

Publication number
WO2015159702A1
WO2015159702A1 PCT/JP2015/060087 JP2015060087W WO2015159702A1 WO 2015159702 A1 WO2015159702 A1 WO 2015159702A1 JP 2015060087 W JP2015060087 W JP 2015060087W WO 2015159702 A1 WO2015159702 A1 WO 2015159702A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
segment
partial
condition
feature vector
Prior art date
Application number
PCT/JP2015/060087
Other languages
French (fr)
Japanese (ja)
Inventor
佳男 高枝
哲也 金田
弘海 矢野
康生 大原
Original Assignee
株式会社toor
サイバネットシステム株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社toor, サイバネットシステム株式会社 filed Critical 株式会社toor
Publication of WO2015159702A1 publication Critical patent/WO2015159702A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Definitions

  • the present invention relates to a partial information extraction system that further divides a plurality of pieces of information into partial information and extracts partial information close to target information.
  • Patent Document 1 calculates the appearance frequency of a keyword included in a document to be searched for each paragraph, and extracts a paragraph having a high appearance frequency.
  • the specific gravity of important words that are repeatedly used in the conditional sentence is the same as the specific gravity of other words that appear only once. That is, there is a problem that even if the conditional statement is described in detail, the search accuracy does not change or is lowered. In addition, since the index is created only from conditional statements, the number of words is limited, and the accuracy of calculating the similarity between extracted partial documents is reduced. There is also the problem that people need to read everything, which takes effort and time.
  • the object of the present invention is to realize a partial search with high accuracy in a short time.
  • the inventors have found that it is effective not to use a keyword-based search method, but to generate a feature vector of a unit document of a condition and a document group to be searched based on the appearance frequency of words and compare the two. . That is, by refining the conditions, it was found that even general-purpose words often use keywords related to keywords, and as a result, fluctuations in keywords due to the use of synonyms and the like are alleviated and search accuracy is improved.
  • an index that is the basis for calculating the appearance frequency of words is extracted from the condition
  • an index is extracted from the entire search target document.
  • a feature vector of a condition and a partial document (hereinafter referred to as a document segment) is also generated based on the index, and the similarity between the two is calculated.
  • the feature vector of the document segment does not change even if the conditional sentence is changed. Therefore, the feature vector of the document segment need only be calculated once, and it is not necessary to redo generation of the feature vector. Therefore, it is possible to extract similar document segments at high speed for various conditional sentences.
  • the similarity between the document segments included in the search result based on the condition can be calculated, and the search result can be clustered by contents. .
  • the partial information extraction method is: A partial information extraction method for extracting partial information close to the concept of a condition from a plurality of information, A vector generation procedure in which a feature vector generation unit generates an index from a group of information to be searched, divides the information into a plurality of predetermined segments, and generates a feature vector based on a vector space model based on the index for each segment
  • a vector determination unit that generates a feature vector of the condition as a condition vector, and calculates a similarity between the condition vector and the feature vector of the segment
  • a partial extraction procedure in which the partial extraction unit extracts the segment close to the condition on a predetermined basis using the similarity between the condition vector and the feature vector of the segment; In order.
  • the clustering unit calculates the similarity between the segments using the feature vector of the segment extracted in the partial extraction procedure, and based on the similarity between the segments, A clustering procedure for classifying the segments extracted in the partial extraction procedure into a plurality of information clusters may be further included after the partial extraction procedure.
  • the mapping unit includes a mapping procedure in which the segment extracted in the partial extraction procedure is arranged on a map according to the degree of similarity between the segments. You may have further after the procedure.
  • the partial information extraction system is: A partial information extraction system that extracts partial information close to the concept of conditions from a plurality of documents, An index is generated from the information group to be searched, the information is divided into a plurality of predetermined segments, and a feature vector generation unit that generates a feature vector based on a vector space model based on the index for each segment; Generating a feature vector of the condition as a condition vector, and calculating a similarity between the condition vector and the feature vector of the segment; Using the similarity between the condition vector and the feature vector of the segment, a partial extraction unit that extracts the segment close to the condition on a predetermined basis; Is provided.
  • the similarity between the segments is calculated using the feature vector of the segment extracted by the partial extraction unit, and the extraction of the partial extraction unit is performed based on the similarity between the segments.
  • a clustering unit that classifies the segment into a plurality of information clusters may be further provided.
  • the partial information extraction system may further include a mapping unit that arranges the segments extracted by the partial extraction unit on a map according to the similarity between the segments.
  • a partial search with high accuracy can be realized in a short time.
  • the structural example of the partial information extraction system which concerns on Embodiment 1 is shown.
  • the sequence of the partial information extraction system which concerns on Embodiment 1 is shown.
  • the structural example of the partial information extraction system which concerns on Embodiment 2 is shown.
  • the sequence of the partial information extraction system which concerns on Embodiment 2 is shown.
  • An example of a map is shown.
  • FIG. 1 shows a configuration example of a partial information extraction system according to this embodiment.
  • the partial information extraction system according to this embodiment includes a server 10, a storage 20, and a user terminal 30.
  • the storage 20 is an arbitrary storage medium accessible from the server 10.
  • the server 10 and the user terminal 30 are computers having computer resources such as a CPU (Central Processing Unit) and a storage medium, and a program is installed in the storage medium. Any number of servers 10, storages 20, and user terminals 30 may be employed. In the present embodiment, a case where there is one server 10, two storages 20, and one user terminal 30 will be described.
  • CPU Central Processing Unit
  • the storage 20 holds an information group.
  • the information group includes arbitrary data transmitted / received via the communication network, and includes, for example, text, numerical data, log data, and customer information. Examples of sentences include patents, papers, books, reports, and homepages. Examples of the numerical data include sensor data, measurement data, and POS (Point Of Sales) data. Examples of the log data include online access data and status data of various devices. In the present embodiment, a case where the information is a document will be described as an example.
  • FIG. 2 shows a sequence of the partial information extraction system according to this embodiment.
  • the server 10 acquires a document from the storage 20, divides the acquired document into a plurality of predetermined segments, and generates a feature vector based on the vector space model based on the index for each segment (S101). It is preferable that the feature vector of each segment is stored in the secondary storage 20 separately from the original information group and used for the subsequent calculation of similarity.
  • the original information group is not used at all in the calculation stage, and is used only when displaying the original information in the final stage.
  • User terminal 30 transmits a condition via a communication network (S102).
  • the server 10 acquires the feature vector of each segment from the storage 20 (S102), extracts a segment having a feature vector close to the feature vector of the condition (S104), and obtains the extraction result as the user. It transmits to the terminal 30 (S105).
  • the user terminal 30 displays the extraction result received from the server 10 (S106).
  • the server 10 includes a communication function unit (not shown) that transmits / receives information to / from the user terminal 30 and the storage 20 via a communication network, and a configuration for extracting a segment.
  • the configuration for extracting a segment includes, for example, a feature vector generation unit 11, a vector determination unit 12, and a partial extraction unit 13.
  • the server 10 may be realized by causing a computer to function as the feature vector generation unit 11, the vector determination unit 12, and the partial extraction unit 13. In this case, each configuration is realized by the CPU in the server 10 executing a computer program stored in a storage unit (not shown).
  • the server 10 executes the partial information extraction method according to the present embodiment when extracting the segment.
  • the partial information extraction method according to this embodiment includes a vector generation procedure (S101), a vector determination procedure (S103), and a partial extraction procedure (S104) in this order.
  • the feature vector generation unit 11 In the vector generation procedure (S101), the feature vector generation unit 11 generates a feature vector based on the vector space model for each segment.
  • the elements constituting the feature vector that is, the index, are not defined by the conditional sentence but are generated from the search target information group. Since the index of the feature vector does not depend on the conditional statement, the feature vector does not deteriorate depending on how the conditional statement is written. Further, even if the conditional statement changes, the feature vector of the same segment can always be used, so that the processing load on the server 10 is small.
  • the segment is, for example, a paragraph or a sentence.
  • paragraph units are identified by detecting line breaks.
  • a unit sentence is identified by detecting a punctuation mark “.” Or “.”, A question mark “?”, And an exclamation mark “!”.
  • the index is a vector element, for example, a word list. In the present embodiment, as an example, a case where a segment is a paragraph and an index is a word list will be described.
  • the vector determination unit 12 determines the proximity of the content with the condition d k for each segment d i . For example, the vector determination unit 12 vectorizes the condition d k based on the vector space model. Then, the vector determination unit 12 determines the proximity of the condition vector and the feature vector.
  • the information d i can be expressed in matrix with respect to the element t j
  • the condition can be described by a condition vector whose elements are words included in the condition.
  • a segment can also be described by a segment vector whose elements are words included in the segment.
  • segment d i concept vector d i (n i1, n i2, n i3, «) can be represented by.
  • the word t 1 in the segment d 1, application number of t 2, t 3 are 0, 1, 0, respectively
  • word t 1 in the segment d 2, t 2, t 3 of the application number respectively 2,1, 0, if the applicant number of words t 1, t 2, t 3 in the segment d 3 is 1, 2, and 3, respectively
  • the matrix M of the segment is expressed as follows.
  • the closeness of the contents of the segment d i and the condition d k can be quantified by calculating the feature vector d i and the condition vector d k .
  • the calculation used for digitization may be a distance between vectors, or an arbitrary calculation such as inner product or outer product.
  • the words that are commonly used for all segments do not affect the proximity of the content of the document. Therefore, in calculating the vector, it is preferable to provide a difference in the contribution of each word characteristic to each document to the vector of the other words. For example, weighting is performed using the tfidf (Term Frequency Inverse Document Frequency) method. Thereby, the precision of the closeness of the content of a segment can be improved.
  • the word weight tfidf used in the same manner in any document is small, and the tfidf having a large frequency used by the document has a large tfidf.
  • the determination of the closeness of the content may be performed based on the presence or absence of a word included in the condition, for example.
  • a determination is made based on whether the keyword is included or not included for each segment.
  • a logical expression is formed, and each segment is determined by a binary value indicating whether or not the logical expression is met.
  • the partial extraction unit 13 extracts a segment having a vector close to a predetermined condition from a plurality of segments.
  • the segment to be extracted may be a predetermined number of segments, or may be a segment whose vector is within a predetermined proximity range. In this way, by extracting segments with similar vectors, it is possible to extract only portions that are close to the concept constituted by the search conditions.
  • clustering processing may be performed.
  • the partial extraction unit 13 classifies the extracted segments into a plurality of information clusters based on the similarity between the segments using the similarity between the condition vector and the segment feature vector.
  • the classification is performed, for example, by classifying the common clusters in order from the closest vector distance. As described above, by performing the clustering process, it is possible to provide the user terminal 30 with the result of classifying the contents described in each segment into a hierarchy.
  • the document in the present invention is not limited to this.
  • the segment is, for example, time or time, region or place, or attribution. If the document includes customer data, the segment is, for example, time or time, region or location, attribution, or age.
  • the unit of time is arbitrary, for example, it may be a second unit or a year unit.
  • vectorization based on the vector space model is performed as follows.
  • the document is the access log data of the user in the online service
  • the number of accesses of the user t j between the times d i to d i + T (time interval T) is n ij .
  • the output numerical value of the sensor t j between the times d i to d i + T (time interval T) is n ij .
  • the image d i is frequency-converted, and the numerical value of each frequency component t j after the conversion is n ij .
  • the weighting tfidf is performed as follows.
  • the document is the access log data of the user in the online service
  • the weight tfidf of the user who accesses on average is small, and the weight tfidf of the user having large access unevenness is large.
  • the document is sensor data
  • the sensor weight tfidf for which the output numerical value does not change much decreases, and the sensor weight tfidf for which the output numerical value changes greatly increases.
  • the frequency weighting tfidf with a small component value variation between images is small, and the frequency weighting tfidf with a large component value variation between images is large.
  • FIG. 3 shows a configuration example of the partial information extraction system according to the present embodiment.
  • the partial information extraction system according to the present embodiment further includes a mapping unit 14 in addition to the configuration of the first embodiment.
  • FIG. 4 shows a sequence of the partial information extraction system according to this embodiment.
  • the partial information extraction method according to the present embodiment further includes a mapping procedure (S107) after the partial extraction procedure (S104) described in the first embodiment.
  • the server 10 transmits the map created by the mapping procedure to the user terminal 30 (S108).
  • the user terminal 30 displays the map received from the server 10 (S109).
  • mapping procedure (S107) points indicating segments and conditions extracted by the partial extraction unit 13 are arranged on the map based on the vector values created by the vector determination unit 12 according to the closeness of the contents of the vectors. To do.
  • the closeness between feature vectors is calculated, and based on the closeness between the vectors, mapping based on the closeness of contents between information, that is, “semantic distance” is performed.
  • the calculation may be a distance between vectors, or an arbitrary calculation such as inner product or outer product.
  • an information cluster including a plurality of segments may be arranged on the map. Based on the proximity of the contents between the information obtained d i each other can be created map as shown in FIG. 5 by using a mapping algorithm.
  • the system according to the present embodiment can extract segments using concept search and map the distribution of the contents of each segment using a vector calculated using concept search.
  • the present invention can be applied to the information and communication industry.

Abstract

The purpose of this invention is to implement fast, precise partial searches. The partial-information extraction method in this invention includes the following steps, in this order: a vector generation step (S101) in which information being searched is divided into a predetermined plurality of segments and a feature vector is generated for each segment; a vector determination step (S103) in which a condition vector consisting of a feature vector associated with a condition is generated, and for each segment, the degree of similarity between the condition vector and the feature vector for that segment is calculated; and a partial extraction step (S104) in which the degrees of similarity between the condition vector and the feature vectors for the respective segments are used to extract segments that, using predetermined criteria, are close to the aforementioned condition.

Description

部分情報抽出システムPartial information extraction system
 本発明は、複数の情報をさらに部分情報に分割し、そのなかから目標情報に近い部分情報を抽出する部分情報抽出システムに関する。 The present invention relates to a partial information extraction system that further divides a plurality of pieces of information into partial information and extracts partial information close to target information.
 情報の一例として文書を取り上げる。これまで大量の文書のなかから内容の近い文書を検索するシステムが提案されている(例えば、特許文献1参照)。特許文献1は、検索対象となる文書に含まれるキーワードの出現頻度を段落ごとに算出し、出現頻度の高い段落を抽出する。 Suppose a document as an example of information. So far, a system for searching for documents with similar contents from a large number of documents has been proposed (for example, see Patent Document 1). Patent Document 1 calculates the appearance frequency of a keyword included in a document to be searched for each paragraph, and extracts a paragraph having a high appearance frequency.
特開2013-30089号公報JP 2013-30089 A
 探し出したい記述内容を検索条件とし、その文章に近い部分的な記述内容を検索対象文章群から抽出する。特許文献1の発明では、条件文からインデックスを作成するための単語を抽出し、検索対象文書のページ毎のインデックスの単語単位の出現頻度を計算し、文書ページの重み付けを行う。しかし、この方法では、条件文によって生成されるインデックスが異なるため、対象文書のインデックスに基づく単語の出現頻度は、条件文を変える度に計算をやりなおす必要があり、計算時間がかかるといった問題がある。さらに、条件文が単なるインデックス抽出のためにのみ利用され、条件文での単語の出現頻度は計算されない。このため、条件文において繰り返し使われるような重要な単語の比重も他の1回しか現れない単語の比重と同じになってしまう。すなわち、条件文を詳しく記述しても検索精度が変わらない、あるいは低下させるという問題がある。さらに、インデックスが条件文だけから作成されるため、単語数が制限され、抽出された部分文書同士の類似性の計算精度が落ちるため、抽出結果の中から真に欲しい情報を探し出すのに、結局人が全て読む必要があり、そのために労力と時間がかかるといった問題もある。 記述 Use the description contents you want to find as a search condition, and extract partial description contents close to the sentence from the search target sentence group. In the invention of Patent Document 1, a word for creating an index is extracted from a conditional sentence, the appearance frequency of the word unit of the index for each page of the search target document is calculated, and the document page is weighted. However, in this method, since the index generated by the conditional sentence is different, the appearance frequency of the word based on the index of the target document needs to be recalculated every time the conditional sentence is changed, and there is a problem that it takes a long calculation time. . Furthermore, the conditional sentence is used only for index extraction, and the appearance frequency of words in the conditional sentence is not calculated. Therefore, the specific gravity of important words that are repeatedly used in the conditional sentence is the same as the specific gravity of other words that appear only once. That is, there is a problem that even if the conditional statement is described in detail, the search accuracy does not change or is lowered. In addition, since the index is created only from conditional statements, the number of words is limited, and the accuracy of calculating the similarity between extracted partial documents is reduced. There is also the problem that people need to read everything, which takes effort and time.
 このように、引用文献1の発明は、条件文を変える度にインデックスが変わるため、インデックスに基づく文書中の単語の出現頻度を都度再計算する必要があり、また条件を詳細化しても検索精度を向上させることができないといった問題もある。さらに、抽出結果から真に欲しい情報を探すのに手間がかかるといった問題があった。 As described above, in the invention of the cited document 1, since the index changes every time the conditional sentence is changed, it is necessary to recalculate the appearance frequency of the word in the document based on the index every time. There is also a problem that cannot be improved. Furthermore, there is a problem that it takes time and effort to search for information that is really desired from the extraction result.
 本発明は、短時間かつ高い精度の部分検索を実現することを目的とする。 The object of the present invention is to realize a partial search with high accuracy in a short time.
 従来のキーワードベースの検索手法においては、キーワード以外の類義語などを使った文章は、内容的には重要な文章であっても検索できない、といった問題があった。これを防ぐために類義語辞書を利用するなどいろいろな方法が提案されているが、辞書の作成等開発者によって異なるため、検索結果の再現性がないなどの問題がある。 In the conventional keyword-based search method, there is a problem that a sentence using a synonym other than a keyword cannot be searched even if it is an important sentence in terms of content. In order to prevent this, various methods such as using a synonym dictionary have been proposed, but there are problems such as lack of reproducibility of search results because it differs depending on the developer such as dictionary creation.
 発明者らは、キーワードベースの検索手法ではなく、単語の出現頻度を元に条件と検索対象の文書群の単位文書の特徴ベクトルを生成し、両者を比較する方法が有効であることを見出した。すなわち、条件を詳細化することで、汎用的な単語でもキーワードに関連した単語が多く使われ、その結果類義語などの使用によるキーワードのゆらぎが緩和され、検索精度が向上することを見出した。 The inventors have found that it is effective not to use a keyword-based search method, but to generate a feature vector of a unit document of a condition and a document group to be searched based on the appearance frequency of words and compare the two. . That is, by refining the conditions, it was found that even general-purpose words often use keywords related to keywords, and as a result, fluctuations in keywords due to the use of synonyms and the like are alleviated and search accuracy is improved.
 さらに、単語の出現頻度を計算する基本となるインデックスを条件から抽出すると、条件が変わる度にインデックスが変化するという問題が発生する。この問題を解決するため、検索対象文書全体からインデックスを抽出する。条件および部分文書(以下文書セグメントと呼ぶ)の特徴ベクトルもそのインデックスをベースに生成し、両者の類似度を計算する。この方法を用いることで、条件文を変えても文書セグメントの特徴ベクトルは変わらないため、文書セグメントの特徴ベクトルの計算は最初に一度だけ行うだけでよく、特徴ベクトルの生成をやり直す必要は無い。したがって、様々な条件文に対して、高速で類似文書セグメントを抽出することが可能となる。 Furthermore, if an index that is the basis for calculating the appearance frequency of words is extracted from the condition, there is a problem that the index changes every time the condition changes. In order to solve this problem, an index is extracted from the entire search target document. A feature vector of a condition and a partial document (hereinafter referred to as a document segment) is also generated based on the index, and the similarity between the two is calculated. By using this method, the feature vector of the document segment does not change even if the conditional sentence is changed. Therefore, the feature vector of the document segment need only be calculated once, and it is not necessary to redo generation of the feature vector. Therefore, it is possible to extract similar document segments at high speed for various conditional sentences.
 さらに、このようにして生成された文書セグメントの特徴ベクトルを使えば、条件をベースに検索した結果に含まれる文書セグメント同士の類似度も計算でき、検索結果を内容別にクラスタリングすることが可能となる。 Furthermore, by using the feature vector of the document segment generated in this way, the similarity between the document segments included in the search result based on the condition can be calculated, and the search result can be clustered by contents. .
 具体的には、本発明にかかる部分情報抽出方法は、
 複数の情報のなかから条件の概念に近い部分的な情報を抽出する部分情報抽出方法であって、
 特徴ベクトル生成部が、検索対象の情報群からインデックスを生成し、前記情報をあらかじめ定められた複数のセグメントに分割し、セグメント毎にインデックスに基づきベクトル空間モデルに基づく特徴ベクトルを生成するベクトル生成手順と、
 ベクトル判定部が、前記条件の特徴ベクトルを条件ベクトルとして生成し、前記条件ベクトルと前記セグメントの特徴ベクトルの類似度を計算するベクトル判定手順と、
 部分抽出部が、前記条件ベクトルと前記セグメントの特徴ベクトルの類似度を用いて、予め定められた基準で前記条件に近い前記セグメントを抽出する部分抽出手順と、
 を順に有する。
Specifically, the partial information extraction method according to the present invention is:
A partial information extraction method for extracting partial information close to the concept of a condition from a plurality of information,
A vector generation procedure in which a feature vector generation unit generates an index from a group of information to be searched, divides the information into a plurality of predetermined segments, and generates a feature vector based on a vector space model based on the index for each segment When,
A vector determination unit that generates a feature vector of the condition as a condition vector, and calculates a similarity between the condition vector and the feature vector of the segment;
A partial extraction procedure in which the partial extraction unit extracts the segment close to the condition on a predetermined basis using the similarity between the condition vector and the feature vector of the segment;
In order.
 本発明にかかる部分情報抽出方法では、クラスタリング部が、前記部分抽出手順で抽出された前記セグメントの特徴ベクトルを用いて前記セグメント同士の類似度を計算し、前記セグメント同士の類似度に基づき、前記部分抽出手順で抽出された前記セグメントを複数の情報クラスタに分類するクラスタリング手順を、前記部分抽出手順の後にさらに有してもよい。 In the partial information extraction method according to the present invention, the clustering unit calculates the similarity between the segments using the feature vector of the segment extracted in the partial extraction procedure, and based on the similarity between the segments, A clustering procedure for classifying the segments extracted in the partial extraction procedure into a plurality of information clusters may be further included after the partial extraction procedure.
 本発明にかかる部分情報抽出方法では、マップ化部が、前記部分抽出手順で抽出された前記セグメントを、前記セグメント同士の類似度に応じて、マップ上に配置するマップ化手順を、前記部分抽出手順の後にさらに有してもよい。 In the partial information extraction method according to the present invention, the mapping unit includes a mapping procedure in which the segment extracted in the partial extraction procedure is arranged on a map according to the degree of similarity between the segments. You may have further after the procedure.
 具体的には、本発明にかかる部分情報抽出システムは、
 複数の文書のなかから条件の概念に近い部分的な情報を抽出する部分情報抽出システムであって、
 検索対象の情報群からインデックスを生成し、前記情報をあらかじめ定められた複数のセグメントに分割し、セグメント毎にインデックスに基づきベクトル空間モデルに基づく特徴ベクトルを生成する特徴ベクトル生成部と、
 前記条件の特徴ベクトルを条件ベクトルとして生成し、前記条件ベクトルと前記セグメントの特徴ベクトルの類似度を計算するベクトル判定部と、
 前記条件ベクトルと前記セグメントの特徴ベクトルの類似度を用いて、予め定められた基準で前記条件に近い前記セグメントを抽出する部分抽出部と、
 を備える。
Specifically, the partial information extraction system according to the present invention is:
A partial information extraction system that extracts partial information close to the concept of conditions from a plurality of documents,
An index is generated from the information group to be searched, the information is divided into a plurality of predetermined segments, and a feature vector generation unit that generates a feature vector based on a vector space model based on the index for each segment;
Generating a feature vector of the condition as a condition vector, and calculating a similarity between the condition vector and the feature vector of the segment;
Using the similarity between the condition vector and the feature vector of the segment, a partial extraction unit that extracts the segment close to the condition on a predetermined basis;
Is provided.
 本発明にかかる部分情報抽出システムでは、前記部分抽出部の抽出した前記セグメントの特徴ベクトルを用いて前記セグメント同士の類似度を計算し、前記セグメント同士の類似度に基づき、前記部分抽出部の抽出した前記セグメントを複数の情報クラスタに分類するクラスタリング部を、さらに備えていてもよい。 In the partial information extraction system according to the present invention, the similarity between the segments is calculated using the feature vector of the segment extracted by the partial extraction unit, and the extraction of the partial extraction unit is performed based on the similarity between the segments. A clustering unit that classifies the segment into a plurality of information clusters may be further provided.
 本発明にかかる部分情報抽出システムでは、前記部分抽出部の抽出した前記セグメントを、前記セグメント同士の類似度に応じて、マップ上に配置するマップ化部を、さらに備えていてもよい。 The partial information extraction system according to the present invention may further include a mapping unit that arranges the segments extracted by the partial extraction unit on a map according to the similarity between the segments.
 本発明によれば、短時間かつ高い精度の部分検索を実現することができる。 According to the present invention, a partial search with high accuracy can be realized in a short time.
実施形態1に係る部分情報抽出システムの構成例を示す。The structural example of the partial information extraction system which concerns on Embodiment 1 is shown. 実施形態1に係る部分情報抽出システムのシーケンスを示す。The sequence of the partial information extraction system which concerns on Embodiment 1 is shown. 実施形態2に係る部分情報抽出システムの構成例を示す。The structural example of the partial information extraction system which concerns on Embodiment 2 is shown. 実施形態2に係る部分情報抽出システムのシーケンスを示す。The sequence of the partial information extraction system which concerns on Embodiment 2 is shown. マップの一例を示す。An example of a map is shown.
 以下、本発明の実施形態について、図面を参照しながら詳細に説明する。なお、本発明は、以下に示す実施形態に限定されるものではない。これらの実施の例は例示に過ぎず、本発明は当業者の知識に基づいて種々の変更、改良を施した形態で実施することができる。なお、本明細書及び図面において符号が同じ構成要素は、相互に同一のものを示すものとする。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. In addition, this invention is not limited to embodiment shown below. These embodiments are merely examples, and the present invention can be implemented in various modifications and improvements based on the knowledge of those skilled in the art. In the present specification and drawings, the same reference numerals denote the same components.
(実施形態1)
 図1に、本実施形態に係る部分情報抽出システムの構成例を示す。本実施形態に係る部分情報抽出システムは、サーバ10と、ストレージ20と、ユーザ端末30を備える。ストレージ20は、サーバ10からアクセス可能な任意の記憶媒体である。サーバ10及びユーザ端末30は、CPU(Central Processing Unit)及び記憶媒体などの計算機資源を備えたコンピュータであり、記憶媒体にはプログラムがインストールされている。サーバ10、ストレージ20及びユーザ端末30は、いずれも任意の数を採用しうるが、本実施形態では、サーバ10が1台、ストレージ20が2台、ユーザ端末30が1台の場合について示す。
(Embodiment 1)
FIG. 1 shows a configuration example of a partial information extraction system according to this embodiment. The partial information extraction system according to this embodiment includes a server 10, a storage 20, and a user terminal 30. The storage 20 is an arbitrary storage medium accessible from the server 10. The server 10 and the user terminal 30 are computers having computer resources such as a CPU (Central Processing Unit) and a storage medium, and a program is installed in the storage medium. Any number of servers 10, storages 20, and user terminals 30 may be employed. In the present embodiment, a case where there is one server 10, two storages 20, and one user terminal 30 will be described.
 ストレージ20は、情報群を保持する。情報群は、通信ネットワークを介して送受信される任意のデータを含み、例えば、文章、数値データ、ログデータ及び顧客情報を含む。文章は、例えば、特許、論文、書籍、レポート及びホームページが例示できる。数値データは、例えば、センサーデータ、測定データ、POS(Point Of Sales)データが例示できる。ログデータは、例えば、オンラインアクセスデータ、各種装置の状態データが例示できる。本実施形態では、一例として、情報が文書である場合について説明する。 The storage 20 holds an information group. The information group includes arbitrary data transmitted / received via the communication network, and includes, for example, text, numerical data, log data, and customer information. Examples of sentences include patents, papers, books, reports, and homepages. Examples of the numerical data include sensor data, measurement data, and POS (Point Of Sales) data. Examples of the log data include online access data and status data of various devices. In the present embodiment, a case where the information is a document will be described as an example.
 図2に、本実施形態に係る部分情報抽出システムのシーケンスを示す。サーバ10は、ストレージ20から文書を取得し、取得した文書をあらかじめ定められた複数のセグメントに分割し、セグメント毎にインデックスに基づきベクトル空間モデルに基づく特徴ベクトルを生成する(S101)。各セグメントの特徴ベクトルは、元の情報群とは別に2次的なストレージ20に格納され、以後の類似度の計算に利用されることが好ましい。元の情報群は、計算ステージでは一切利用されず、最終段階で元の情報を表示する際にのみ、利用される。 FIG. 2 shows a sequence of the partial information extraction system according to this embodiment. The server 10 acquires a document from the storage 20, divides the acquired document into a plurality of predetermined segments, and generates a feature vector based on the vector space model based on the index for each segment (S101). It is preferable that the feature vector of each segment is stored in the secondary storage 20 separately from the original information group and used for the subsequent calculation of similarity. The original information group is not used at all in the calculation stage, and is used only when displaying the original information in the final stage.
 ユーザ端末30は、通信ネットワークを介して条件を送信する(S102)。サーバ10は、ユーザ端末30から条件を受信すると、ストレージ20から各セグメントの特徴ベクトルを取得し(S102)、条件の特徴ベクトルに近い特徴ベクトルを有するセグメントを抽出し(S104)、抽出結果をユーザ端末30へ送信する(S105)。ユーザ端末30は、サーバ10から受信した抽出結果を表示する(S106)。 User terminal 30 transmits a condition via a communication network (S102). Upon receiving the condition from the user terminal 30, the server 10 acquires the feature vector of each segment from the storage 20 (S102), extracts a segment having a feature vector close to the feature vector of the condition (S104), and obtains the extraction result as the user. It transmits to the terminal 30 (S105). The user terminal 30 displays the extraction result received from the server 10 (S106).
 サーバ10は、通信ネットワークを介してユーザ端末30及びストレージ20と情報の送受信を行う通信機能部(不図示)と、セグメントを抽出するための構成を備える。セグメントを抽出するための構成は、例えば、特徴ベクトル生成部11と、ベクトル判定部12と、部分抽出部13と、を備える。サーバ10は、コンピュータを、特徴ベクトル生成部11、ベクトル判定部12及び部分抽出部13として機能させることで実現してもよい。この場合、サーバ10内のCPUが、記憶部(不図示)に記憶されたコンピュータプログラムを実行することで、各構成を実現する。 The server 10 includes a communication function unit (not shown) that transmits / receives information to / from the user terminal 30 and the storage 20 via a communication network, and a configuration for extracting a segment. The configuration for extracting a segment includes, for example, a feature vector generation unit 11, a vector determination unit 12, and a partial extraction unit 13. The server 10 may be realized by causing a computer to function as the feature vector generation unit 11, the vector determination unit 12, and the partial extraction unit 13. In this case, each configuration is realized by the CPU in the server 10 executing a computer program stored in a storage unit (not shown).
 サーバ10は、セグメントを抽出するに際し、本実施形態に係る部分情報抽出方法を実行する。本実施形態に係る部分情報抽出方法は、ベクトル生成手順(S101)と、ベクトル判定手順(S103)と、部分抽出手順(S104)と、を順に有する。 The server 10 executes the partial information extraction method according to the present embodiment when extracting the segment. The partial information extraction method according to this embodiment includes a vector generation procedure (S101), a vector determination procedure (S103), and a partial extraction procedure (S104) in this order.
 ベクトル生成手順(S101)では、特徴ベクトル生成部11が、セグメント毎にベクトル空間モデルに基づく特徴ベクトルを生成する。特徴ベクトルを構成する要素すなわちインデックスは、条件文によって定められるものではなく、検索対象情報群から生成される。特徴ベクトルのインデックスが条件文に依存しないため、条件文の記載の仕方によって特徴ベクトルが劣化してしまうようなこともない。また、条件文が変化した場合であっても常に同じセグメントの特徴ベクトルを用いることができるため、サーバ10の処理負荷が少ない。 In the vector generation procedure (S101), the feature vector generation unit 11 generates a feature vector based on the vector space model for each segment. The elements constituting the feature vector, that is, the index, are not defined by the conditional sentence but are generated from the search target information group. Since the index of the feature vector does not depend on the conditional statement, the feature vector does not deteriorate depending on how the conditional statement is written. Further, even if the conditional statement changes, the feature vector of the same segment can always be used, so that the processing load on the server 10 is small.
 文書が文章を含む場合、セグメントは、例えば、段落又は文である。段落の場合、例えば、改行を検出することで段落単位を識別する。文の場合、句点「。」又は「.」、疑問符「?」及び感嘆符「!」を検出することで単位文を識別する。インデックスは、ベクトルの要素であり、例えば単語リストである。本実施形態では、一例として、セグメントが段落であり、インデックスが単語リストである場合について説明する。 When the document includes a sentence, the segment is, for example, a paragraph or a sentence. In the case of paragraphs, for example, paragraph units are identified by detecting line breaks. In the case of a sentence, a unit sentence is identified by detecting a punctuation mark “.” Or “.”, A question mark “?”, And an exclamation mark “!”. The index is a vector element, for example, a word list. In the present embodiment, as an example, a case where a segment is a paragraph and an index is a word list will be described.
 ベクトル判定手順(S103)では、ベクトル判定部12が、セグメントdごとに条件dとの内容の近さを判定する。例えば、ベクトル判定部12は、ベクトル空間モデルに基づき条件dをベクトル化する。そして、ベクトル判定部12が、条件ベクトル及び特徴ベクトルの近さを判定する。 In the vector determination procedure (S103), the vector determination unit 12 determines the proximity of the content with the condition d k for each segment d i . For example, the vector determination unit 12 vectorizes the condition d k based on the vector space model. Then, the vector determination unit 12 determines the proximity of the condition vector and the feature vector.
 情報dが、要素tに対してマトリクス表記できる場合、情報dをベクトル空間モデルd=(t,t,t,……)で記述することができる。このため、条件は、条件に含まれる単語を要素とする条件ベクトルで記述することができる。またセグメントも、セグメントに含まれる単語を要素とするセグメントベクトルで記述することができる。 When the information d i can be expressed in matrix with respect to the element t j , the information d i can be described by a vector space model d i = (t 1 , t 2 , t 3 ,...). For this reason, the condition can be described by a condition vector whose elements are words included in the condition. A segment can also be described by a segment vector whose elements are words included in the segment.
 セグメントd中に出現する要素tの出現頻度をnijとすると、セグメントdは概念ベクトルd=(ni1,ni2,ni3,……)で表すことができる。例えば、セグメントdにおける単語t、t、tの出願回数がそれぞれ0、1、0であり、セグメントdにおける単語t、t、tの出願回数がそれぞれ2、1、0であり、セグメントdにおける単語t、t、tの出願回数がそれぞれ1、2、3である場合、セグメントの行列Mは以下のように表される。 When the frequency of occurrence of elements t j appearing in segment d i and n ij, segment d i concept vector d i = (n i1, n i2, n i3, ......) can be represented by. For example, the word t 1 in the segment d 1, application number of t 2, t 3 are 0, 1, 0, respectively, word t 1 in the segment d 2, t 2, t 3 of the application number, respectively 2,1, 0, if the applicant number of words t 1, t 2, t 3 in the segment d 3 is 1, 2, and 3, respectively, the matrix M of the segment is expressed as follows.
Figure JPOXMLDOC01-appb-M000001
Figure JPOXMLDOC01-appb-M000001
 セグメントdと条件dの内容の近さは、特徴ベクトルdと条件ベクトルdの演算によって数値化できる。数値化に用いる演算は、ベクトル相互間の距離であってもよいし、内積、外積等の任意の演算を用いてもよい。 The closeness of the contents of the segment d i and the condition d k can be quantified by calculating the feature vector d i and the condition vector d k . The calculation used for digitization may be a distance between vectors, or an arbitrary calculation such as inner product or outer product.
 ここで、どのセグメントにも共通に使用される単語は文書の内容の近さに影響を与えない。そこで、ベクトルの算出においては、各文書に特徴的な単語とそれ以外の単語のベクトルへの寄与に差を設けることが好ましい。例えば、tfidf(Term Frequency Inverse Document Frequency)法を使って重み付けを行う。これにより、セグメントの内容の近さの精度を向上することができる。どの文書にも同様に使われる単語の重み付けtfidfは小さく、文書によって使われる頻度が大きく異なる文書はtfidfが大きい。 Here, the words that are commonly used for all segments do not affect the proximity of the content of the document. Therefore, in calculating the vector, it is preferable to provide a difference in the contribution of each word characteristic to each document to the vector of the other words. For example, weighting is performed using the tfidf (Term Frequency Inverse Document Frequency) method. Thereby, the precision of the closeness of the content of a segment can be improved. The word weight tfidf used in the same manner in any document is small, and the tfidf having a large frequency used by the document has a large tfidf.
 内容の近さの判定は、例えば、条件に含まれる単語の有無に基づいて行ってもよい。条件に単数の単語が含まれる場合は、セグメント毎にキーワードを含むか含まないかの2値で判定する。条件に複数の単語がある場合は、論理式を構成し、セグメント毎にその論理式に適合するかしないかの2値で判定する。 The determination of the closeness of the content may be performed based on the presence or absence of a word included in the condition, for example. When a single word is included in the condition, a determination is made based on whether the keyword is included or not included for each segment. When there are a plurality of words in the condition, a logical expression is formed, and each segment is determined by a binary value indicating whether or not the logical expression is met.
 部分抽出手順(S104)では、部分抽出部13が、複数のセグメントのうちの予め定められた条件からのベクトルの近いセグメントを抽出する。このとき、抽出するセグメントは、予め定められた数のセグメントであってもよいし、ベクトルが予め定められた近さの範囲内にあるセグメントであってもよい。このように、ベクトルの近いセグメントを抽出することで、検索条件によって構成される概念に近い部分のみを抽出することができる。 In the partial extraction procedure (S104), the partial extraction unit 13 extracts a segment having a vector close to a predetermined condition from a plurality of segments. At this time, the segment to be extracted may be a predetermined number of segments, or may be a segment whose vector is within a predetermined proximity range. In this way, by extracting segments with similar vectors, it is possible to extract only portions that are close to the concept constituted by the search conditions.
 部分抽出手順(S104)では、クラスタリング処理をおこなってもよい。このとき、部分抽出部13は、条件ベクトルとセグメントの特徴ベクトルの類似度を用いて、セグメント同士の類似度に基づき、抽出されたセグメントを複数の情報クラスタに分類する。分類は、例えば、ベクトルの距離の近いものから順に共通のクラスタに分類する。このよういに、クラスタリング処理を行うことで、各セグメントに記載されている内容を階層化して分類した結果をユーザ端末30へ提供することができる。 In the partial extraction procedure (S104), clustering processing may be performed. At this time, the partial extraction unit 13 classifies the extracted segments into a plurality of information clusters based on the similarity between the segments using the similarity between the condition vector and the segment feature vector. The classification is performed, for example, by classifying the common clusters in order from the closest vector distance. As described above, by performing the clustering process, it is possible to provide the user terminal 30 with the result of classifying the contents described in each segment into a hierarchy.
 なお、本実施形態では、文書が文章である例について説明したが、本発明における文書はこれに限らない。文書が数値データ又はログデータを含む場合、セグメントは、例えば、時刻若しくは時間、地域若しくは場所、又は帰属先である。文書が顧客データを含む場合、セグメントは、例えば、時刻若しくは時間、地域若しくは場所、帰属先、又は年齢である。時間の単位は任意であり、例えば、秒単位であってもよいし、年単位であってもよい。 In this embodiment, the example in which the document is a sentence has been described, but the document in the present invention is not limited to this. When the document includes numerical data or log data, the segment is, for example, time or time, region or place, or attribution. If the document includes customer data, the segment is, for example, time or time, region or location, attribution, or age. The unit of time is arbitrary, for example, it may be a second unit or a year unit.
 また、文書が数値データ又はログデータを含む場合、ベクトル空間モデルに基づくベクトル化は以下のようにして行う。
 文書がオンラインサービスにおけるユーザのアクセスログデータの場合、時刻d~d+T(時間間隔T)の間における、ユーザtのアクセス数をnijとする。時刻dはベクトルd=(ni1,ni2,ni3,……)と表現できる。
 文書がセンサーデータの場合、時刻d~d+T(時間間隔T)の間における、センサーtの出力数値をnijとする。時刻dはベクトルd=(ni1,ni2,ni3,……)と表現できる。
 文書が画像データの場合、画像dを周波数変換し、変換後の各周波数の成分tの数値をnijとする。時刻dはベクトルd=(ni1,ni2,ni3,……)と表現できる。
When the document includes numerical data or log data, vectorization based on the vector space model is performed as follows.
When the document is the access log data of the user in the online service, the number of accesses of the user t j between the times d i to d i + T (time interval T) is n ij . The time d i can be expressed as a vector d i = (n i1 , n i2 , n i3 ,...).
When the document is sensor data, the output numerical value of the sensor t j between the times d i to d i + T (time interval T) is n ij . The time d i can be expressed as a vector d i = (n i1 , n i2 , n i3 ,...).
When the document is image data, the image d i is frequency-converted, and the numerical value of each frequency component t j after the conversion is n ij . The time d i can be expressed as a vector d i = (n i1 , n i2 , n i3 ,...).
 また、文書が数値データ又はログデータを含む場合、重み付けtfidfは以下のようにして行う。
 文書がオンラインサービスにおけるユーザのアクセスログデータの場合、始終平均的にアクセスするユーザの重み付けtfidfは小さくなり、アクセスのムラの大きいユーザの重み付けtfidfは大きくなる。
 文書がセンサーデータの場合、出力数値のあまり変化しないセンサーの重み付けtfidfは小さくなり、出力数値の変化の大きいセンサーの重み付けtfidfは大きくなる。
 文書が画像データの場合、画像間で成分値のバラツキの小さい周波数の重み付けtfidfは小さくなり、画像間で成分値のバラツキの大きい周波数の重み付けtfidfは大きくなる。
When the document includes numerical data or log data, the weighting tfidf is performed as follows.
When the document is the access log data of the user in the online service, the weight tfidf of the user who accesses on average is small, and the weight tfidf of the user having large access unevenness is large.
When the document is sensor data, the sensor weight tfidf for which the output numerical value does not change much decreases, and the sensor weight tfidf for which the output numerical value changes greatly increases.
When the document is image data, the frequency weighting tfidf with a small component value variation between images is small, and the frequency weighting tfidf with a large component value variation between images is large.
(実施形態2)
 図3に、本実施形態に係る部分情報抽出システムの構成例を示す。本実施形態に係る部分情報抽出システムは、実施形態1の構成に加え、さらにマップ化部14を備える。
(Embodiment 2)
FIG. 3 shows a configuration example of the partial information extraction system according to the present embodiment. The partial information extraction system according to the present embodiment further includes a mapping unit 14 in addition to the configuration of the first embodiment.
 図4に、本実施形態に係る部分情報抽出システムのシーケンスを示す。本実施形態に係る部分情報抽出方法は、実施形態1で説明した部分抽出手順(S104)の後に、マップ化手順(S107)をさらに有する。サーバ10は、マップ化手順で作成したマップをユーザ端末30へ送信する(S108)。ユーザ端末30は、サーバ10から受信したマップを表示する(S109)。 FIG. 4 shows a sequence of the partial information extraction system according to this embodiment. The partial information extraction method according to the present embodiment further includes a mapping procedure (S107) after the partial extraction procedure (S104) described in the first embodiment. The server 10 transmits the map created by the mapping procedure to the user terminal 30 (S108). The user terminal 30 displays the map received from the server 10 (S109).
 マップ化手順(S107)では、部分抽出部13の抽出したセグメント及び条件を示す点を、ベクトル判定部12の作成したベクトル値に基づき、ベクトル同士の内容の近さに応じて、マップ上に配置する。 In the mapping procedure (S107), points indicating segments and conditions extracted by the partial extraction unit 13 are arranged on the map based on the vector values created by the vector determination unit 12 according to the closeness of the contents of the vectors. To do.
 特徴ベクトル相互間の近さを計算し、ベクトル相互間の近さに基づいて、情報間の内容の近さすなわち「意味的距離」に基づくマップ化を行う。演算は、ベクトル相互間の距離であってもよいし、内積、外積等の任意の演算を用いてもよい。また、部分抽出部13がクラスタリング処理を行った場合、複数のセグメントを含む情報クラスタをマップ上に配置してもよい。得られた情報d相互間の内容の近さに基づいて、マップ化アルゴリズムを用いて図5に示すようなマップを作成することができる。 The closeness between feature vectors is calculated, and based on the closeness between the vectors, mapping based on the closeness of contents between information, that is, “semantic distance” is performed. The calculation may be a distance between vectors, or an arbitrary calculation such as inner product or outer product. When the partial extraction unit 13 performs the clustering process, an information cluster including a plurality of segments may be arranged on the map. Based on the proximity of the contents between the information obtained d i each other can be created map as shown in FIG. 5 by using a mapping algorithm.
 本実施形態に係るシステムは、概念検索を用いてセグメントを抽出し、概念検索を用いて算出されたベクトルを用いて各セグメントの内容の分布をマップ化することができる。 The system according to the present embodiment can extract segments using concept search and map the distribution of the contents of each segment using a vector calculated using concept search.
 本発明は情報通信産業に適用することができる。 The present invention can be applied to the information and communication industry.
10:サーバ
11:特徴ベクトル生成部
12:ベクトル判定部
13:部分抽出部
14:マップ化部
20:ストレージ
30:ユーザ端末
31:クラスタリング部
10: Server 11: Feature vector generation unit 12: Vector determination unit 13: Partial extraction unit 14: Mapping unit 20: Storage 30: User terminal 31: Clustering unit

Claims (6)

  1.  複数の情報のなかから条件の概念に近い部分的な情報を抽出する部分情報抽出方法であって、
     特徴ベクトル生成部が、検索対象の情報群からインデックスを生成し、前記情報をあらかじめ定められた複数のセグメントに分割し、セグメント毎にインデックスに基づきベクトル空間モデルに基づく特徴ベクトルを生成するベクトル生成手順と、
     ベクトル判定部が、前記条件の特徴ベクトルを条件ベクトルとして生成し、前記条件ベクトルと前記セグメントの特徴ベクトルの類似度を計算するベクトル判定手順と、
     部分抽出部が、前記条件ベクトルと前記セグメントの特徴ベクトルの類似度を用いて、予め定められた基準で前記条件に近い前記セグメントを抽出する部分抽出手順と、
     を順に有する部分情報抽出方法。
    A partial information extraction method for extracting partial information close to the concept of a condition from a plurality of information,
    A vector generation procedure in which a feature vector generation unit generates an index from a group of information to be searched, divides the information into a plurality of predetermined segments, and generates a feature vector based on a vector space model based on the index for each segment When,
    A vector determination unit that generates a feature vector of the condition as a condition vector, and calculates a similarity between the condition vector and the feature vector of the segment;
    A partial extraction procedure in which the partial extraction unit extracts the segment close to the condition on a predetermined basis using the similarity between the condition vector and the feature vector of the segment;
    The partial information extraction method which has sequentially.
  2.  前記部分抽出手順において、前記条件ベクトルと前記セグメントの特徴ベクトルの類似度を用いて、前記セグメント同士の類似度に基づき、抽出された前記セグメントを複数の情報クラスタに分類することを特徴とする請求項1に記載の部分情報抽出方法。 The partial extraction procedure uses the similarity between the condition vector and the feature vector of the segment to classify the extracted segment into a plurality of information clusters based on the similarity between the segments. Item 2. The partial information extraction method according to Item 1.
  3.  マップ化部が、前記部分抽出手順で抽出された前記セグメントを、前記セグメント同士の類似度に応じて、マップ上に配置するマップ化手順を、前記部分抽出手順の後にさらに有する請求項1又は2に記載の部分情報抽出方法。 The mapping unit further includes a mapping procedure for placing the segment extracted in the partial extraction procedure on a map according to the similarity between the segments after the partial extraction procedure. Partial information extraction method described in 1.
  4.  複数の文書のなかから条件の概念に近い部分的な情報を抽出する部分情報抽出システムであって、
     検索対象の情報群からインデックスを生成し、前記情報をあらかじめ定められた複数のセグメントに分割し、セグメント毎にインデックスに基づきベクトル空間モデルに基づく特徴ベクトルを生成する特徴ベクトル生成部と、
     前記条件の特徴ベクトルを条件ベクトルとして生成し、前記条件ベクトルと前記セグメントの特徴ベクトルの類似度を計算するベクトル判定部と、
     前記条件ベクトルと前記セグメントの特徴ベクトルの類似度を用いて、予め定められた基準で前記条件に近い前記セグメントを抽出する部分抽出部と、
     を備える部分情報抽出システム。
    A partial information extraction system that extracts partial information close to the concept of conditions from a plurality of documents,
    An index is generated from the information group to be searched, the information is divided into a plurality of predetermined segments, and a feature vector generation unit that generates a feature vector based on a vector space model based on the index for each segment;
    Generating a feature vector of the condition as a condition vector, and calculating a similarity between the condition vector and the feature vector of the segment;
    Using the similarity between the condition vector and the feature vector of the segment, a partial extraction unit that extracts the segment close to the condition on a predetermined basis;
    A partial information extraction system comprising:
  5.  前記部分抽出部は、前記条件ベクトルと前記セグメントの特徴ベクトルの類似度を用いて、前記セグメント同士の類似度に基づき、抽出された前記セグメントを複数の情報クラスタに分類することを特徴とする請求項4に記載の部分情報抽出システム。 The partial extraction unit classifies the extracted segments into a plurality of information clusters based on the similarity between the segments using the similarity between the condition vector and the feature vector of the segment. Item 5. The partial information extraction system according to Item 4.
  6.  前記部分抽出部の抽出した前記セグメントを、前記セグメント同士の類似度に応じて、マップ上に配置するマップ化部を、さらに備える請求項4又は5に記載の部分情報抽出システム。 The partial information extraction system according to claim 4 or 5, further comprising a mapping unit that arranges the segment extracted by the partial extraction unit on a map according to a similarity between the segments.
PCT/JP2015/060087 2014-04-14 2015-03-31 Partial-information extraction system WO2015159702A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2014082779A JP2015203960A (en) 2014-04-14 2014-04-14 partial information extraction system
JP2014-082779 2014-04-14

Publications (1)

Publication Number Publication Date
WO2015159702A1 true WO2015159702A1 (en) 2015-10-22

Family

ID=54323913

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2015/060087 WO2015159702A1 (en) 2014-04-14 2015-03-31 Partial-information extraction system

Country Status (2)

Country Link
JP (1) JP2015203960A (en)
WO (1) WO2015159702A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294733B (en) * 2016-08-10 2019-05-07 成都轻车快马网络科技有限公司 Page detection method based on text analyzing
JP7068106B2 (en) * 2018-08-28 2022-05-16 株式会社日立製作所 Test plan formulation support device, test plan formulation support method and program

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10207911A (en) * 1996-11-25 1998-08-07 Fuji Xerox Co Ltd Document retrieving device
JP2004213626A (en) * 2002-11-27 2004-07-29 Sony United Kingdom Ltd Storage and retrieval of information
JP2004295712A (en) * 2003-03-28 2004-10-21 Hitachi Ltd Method and device for retrieving similar document
JP2013182466A (en) * 2012-03-02 2013-09-12 Kurimoto Ltd Web search system and web search method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10207911A (en) * 1996-11-25 1998-08-07 Fuji Xerox Co Ltd Document retrieving device
JP2004213626A (en) * 2002-11-27 2004-07-29 Sony United Kingdom Ltd Storage and retrieval of information
JP2004295712A (en) * 2003-03-28 2004-10-21 Hitachi Ltd Method and device for retrieving similar document
JP2013182466A (en) * 2012-03-02 2013-09-12 Kurimoto Ltd Web search system and web search method

Also Published As

Publication number Publication date
JP2015203960A (en) 2015-11-16

Similar Documents

Publication Publication Date Title
CN107315759B (en) Method, device and processing system for classifying keywords and classification model generation method
CN108804641B (en) Text similarity calculation method, device, equipment and storage medium
US9454602B2 (en) Grouping semantically related natural language specifications of system requirements into clusters
US10042896B2 (en) Providing search recommendation
Alami et al. Unsupervised neural networks for automatic Arabic text summarization using document clustering and topic modeling
JP5936698B2 (en) Word semantic relation extraction device
JP5817531B2 (en) Document clustering system, document clustering method and program
US8095546B1 (en) Book content item search
US10002188B2 (en) Automatic prioritization of natural language text information
US20130060769A1 (en) System and method for identifying social media interactions
US8316032B1 (en) Book content item search
CN107688616B (en) Make the unique facts of the entity appear
US11580119B2 (en) System and method for automatic persona generation using small text components
US10936806B2 (en) Document processing apparatus, method, and program
WO2020114100A1 (en) Information processing method and apparatus, and computer storage medium
Rashid et al. Analysis of streaming data using big data and hybrid machine learning approach
JP2015203961A (en) document extraction system
Al Mostakim et al. Bangla content categorization using text based supervised learning methods
JP6346367B1 (en) Similarity index value calculation device, similarity search device, and similarity index value calculation program
CN111737607B (en) Data processing method, device, electronic equipment and storage medium
Komninos et al. Structured generative models of continuous features for word sense induction
WO2015159702A1 (en) Partial-information extraction system
TW201243627A (en) Multi-label text categorization based on fuzzy similarity and k nearest neighbors
JP6883561B2 (en) Vulnerability estimation device and vulnerability estimation method
CN112487181A (en) Keyword determination method and related equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15779391

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15779391

Country of ref document: EP

Kind code of ref document: A1