JP4350026B2

JP4350026B2 - Topic scope extraction device, control method thereof, and program

Info

Publication number: JP4350026B2
Application number: JP2004328283A
Authority: JP
Inventors: 克人別所; 雅博奥
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2004-11-11
Filing date: 2004-11-11
Publication date: 2009-10-21
Anticipated expiration: 2024-11-11
Also published as: JP2006139520A

Description

本発明は、複数の話題が混在するメッセージ列から、意味的なまとまりをなすメッセージ集合である話題スコープの集合を抽出する話題スコープ抽出装置に係り、特に、掲示板等、時系列に並んで会話の流れになっているメッセージ列であって、複数の話題からなるメッセージ列を対象とする。
The present invention relates to a topic scope extraction device that extracts a set of topic scopes, which are message sets that make up a semantic group, from a message sequence in which a plurality of topics are mixed. A message sequence that is a flow and is composed of a plurality of topics.

掲示板等のポータルサイトは、スレッド（元発言とこれに対するコメントとによって構成される言葉）の集合であり、各スレッドは、メッセージ列である。このメッセージ列は、時系列に並んでいる会話の流れであり、複数の話題で構成されている。 A portal site such as a bulletin board is a set of threads (words composed of original utterances and comments corresponding thereto), and each thread is a message string. This message sequence is a flow of conversations arranged in time series, and is composed of a plurality of topics.

このようなスレッド集合を対象として、自然言語文を入力し、上記入力文に適合するメッセージ集合を取得する検索処理がある。 For such a thread set, there is a search process for inputting a natural language sentence and acquiring a message set that matches the input sentence.

１つ１つのメッセージと入力文との意味的類似性によって、適合するメッセージを検索する手法が考えられる。 There can be considered a method of searching for a matching message based on the semantic similarity between each message and the input sentence.

しかし、掲示板等のメッセージの特徴は、１つ１つのメッセージは、極めて短く、含まれている単語の数が少ないことである。１つのメッセージは、それだけでは、完結した話題になってはいない。メッセージを、一般の文書検索における文書とみなすには、情報量が少な過ぎる。 However, a message such as a bulletin board is characterized in that each message is extremely short and contains a small number of words. A message alone is not a complete topic. The amount of information is too small to consider the message as a document in a general document search.

メッセージ列における１つの話題は、複数のメッセージで構成され、１つ１つのメッセージは、意味的なまとまりをなすメッセージ群の１つに過ぎない。意味的なまとまりをなすメッセージ群になって初めて、一般の文書検索における文書と同等の情報量を持つ。 One topic in the message sequence is composed of a plurality of messages, and each message is only one of a group of messages that form a semantic group. Only when a group of messages forms a semantic group, it has the same amount of information as a document in a general document search.

上記手法において、入力文との間で類似性の高いメッセージを検索するだけでは、上記メッセージを含み、意味的なまとまりをなすメッセージ群を取り出すことができない。このために、本来検索されるべきメッセージでありながら、検索結果から洩れるものが多数発生する。 In the above method, simply searching for a message having high similarity with the input sentence cannot extract a message group including the message and forming a meaningful group. For this reason, many of the messages that should be searched for are leaked from the search results.

したがって、メッセージ列から、意味的なまとまりをなすメッセージ集合である話題スコープを抽出し、入力文と適合する話題スコープを検索する構成が適切である。 Therefore, it is appropriate to extract a topic scope, which is a message set that makes a meaningful group from a message sequence, and to search for a topic scope that matches the input sentence.

意味的なまとまりをなすメッセージ集合を抽出する手法として、単語を座標とし、単語のメッセージにおける重要度を座標値とする特徴ベクトルで、各メッセージを表し、この特徴ベクトルを利用して、同一の話題のメッセージが同一クラスタとなるように、メッセージ集合をクラスタリングすることが知られている（たとえば、非特許文献１参照）。
余東明、石川孝著「コミュニティウェブにおけるアクティブ情報検索ためのトピック抽出」、The 17th Annual Conference of the Japanese Society for Artificial Intelligence、ｐｐ．１−４、２００３年 As a technique for extracting a set of semantic messages, each message is represented by a feature vector with the word as the coordinate and the importance of the word message as the coordinate value. It is known that the message set is clustered so that the messages in the same cluster are the same (for example, see Non-Patent Document 1).
Akira Yoto, Takashi Ishikawa “Topic Extraction for Active Information Search in Community Web”, The 17th Annual Conference of the Japanese Society for Artificial Intelligence, pp. 1-4, 2003

非特許文献１記載の手法では、メッセージを特徴ベクトルで表す。しかし、上記のように、掲示板等のメッセージは、概して短く、含まれている単語の数が少ないので、同一の話題のメッセージでありながら、含んでいる単語集合の共通部分が少なく、したがって、特徴ベクトルが遠くなるケースが頻出するという問題がある。 In the method described in Non-Patent Document 1, a message is represented by a feature vector. However, as described above, a message such as a bulletin board is generally short and contains a small number of words. Therefore, although it is a message of the same topic, there are few common parts of the included word set, and thus, a feature. There is a problem that cases where the vector is far away frequently occur.

また、逆に、本来は別々の話題に属するメッセージでありながら、含んでいる単語集合の共通部分が多いので、特徴ベクトルが近くなるケースが頻出し、このために、同一の話題のメッセージが同一クラスタになるようなクラスタリングを、適切に実行することができないという問題がある。 On the other hand, since the messages belong to different topics, there are many common parts of the included word sets, so there are many cases where the feature vectors are close, and for this reason, messages of the same topic are the same. There is a problem in that clustering that becomes a cluster cannot be executed properly.

本発明は、複数の話題が混在しているメッセージ列から、意味的なまとまりをなすメッセージ集合である話題スコープの集合を適切に抽出することができる話題スコープ抽出装置を提供することを目的とする。
An object of the present invention is to provide a topic scope extraction device capable of appropriately extracting a set of topic scopes, which are message sets that form a semantic group, from a message sequence in which a plurality of topics are mixed. .

請求項１記載の発明は、複数の話題が混在するメッセージ列から、意味的なまとまりをなすメッセージ集合である話題スコープの集合を抽出する話題スコープ抽出装置であって、各メッセージを形態素解析することによって、単語単位に分割するメッセージ形態素解析手段と、単語と上記単語の意味を表現するベクトルとの対が格納されている記憶手段である概念ベースと、上記概念ベースを検索することによって、上記メッセージ形態素解析手段で得られた各単語に対応するベクトルを取得するメッセージ内単語ベクトル取得手段と、任意のメッセージ対に対し、上記対の各メッセージ内の単語ベクトル集合を基に、上記対のメッセージ間のメッセージ内ベクトル集合間距離を算出するメッセージ内ベクトル集合間距離算出手段と、任意のメッセージ対に対し、上記対のメッセージ間の参照関係を基に、上記対のメッセージ間の参照距離を算出する参照距離算出手段と、上記メッセージ内単語ベクトル取得手段で得られた単語ベクトルの系列から、上記メッセージ列を同一話題の区間であるトピックセグメントの集合へ分割するトピックセグメンテーション手段と、任意のトピックセグメント対に対し、上記対の各トピックセグメント内の単語ベクトル集合を基に、上記対のトピックセグメント間のトピックセグメント内ベクトル集合間距離を算出するトピックセグメント内ベクトル集合間距離算出手段と、任意のメッセージ対に対し、上記メッセージ内ベクトル集合間距離算出手段で得られた上記対のメッセージ間のメッセージ内ベクトル集合間距離と、上記参照距離算出手段で得られた上記対のメッセージ間の参照距離と、上記トピックセグメント内ベクトル集合間距離算出手段で得られた上記対の各メッセージが属するトピックセグメント間のトピックセグメント内ベクトル集合間距離を基に、上記対のメッセージ間の最終的な距離を算出するメッセージ間距離算出手段と、上記メッセージ間距離算出手段で得られた任意のメッセージ間の距離を基に、メッセージ列をファジィクラスタリングし、その結果得られた各クラスタに対し、上記クラスタへの帰属度がある閾値以上のメッセージの集合を、話題スコープとして導出するファジィクラスタリング手段とを有することを特徴とする話題スコープ抽出装置である。 The invention described in claim 1 is a topic scope extraction device that extracts a set of topic scopes, which is a set of messages that make up a semantic group, from a message sequence in which a plurality of topics are mixed, and morphologically analyzes each message The message morpheme analyzing means for dividing into words, the concept base which is a storage means storing a pair of a word and a vector expressing the meaning of the word, and the concept base by searching the concept base Intra-message word vector acquisition means for acquiring a vector corresponding to each word obtained by the morpheme analysis means, and for any message pair, based on the word vector set in each message of the pair, between the pair of messages A message inter-vector set distance calculating means for calculating the inter-message vector set distance; For a message pair, based on the reference relationship between the messages in the pair, a reference distance calculation means for calculating a reference distance between the messages in the pair, and a word vector sequence obtained by the word vector acquisition means in the message Topic segmentation means for dividing the message sequence into a set of topic segments that are sections of the same topic, and for any topic segment pair, based on the word vector set in each topic segment of the pair, the pair of topics Between topic segment vector set distance calculating means for calculating the distance between topic segment vector sets between segments, and between the pair of messages obtained by the above-in-message vector set distance calculating means for any message pair. Distance between vector sets in message and reference distance calculation means Based on the obtained reference distance between the pair of messages and the distance between topic segment vector sets between the topic segments to which each pair of messages obtained by the topic segment vector set distance calculation means is obtained, Based on the inter-message distance calculation means for calculating the final distance between a pair of messages and the distance between any messages obtained by the above-mentioned inter-message distance calculation means, message strings are fuzzy clustered and obtained as a result. A topic scope extraction device comprising fuzzy clustering means for deriving a set of messages having a degree of belonging to a cluster equal to or greater than a threshold for each cluster as a topic scope.

請求項２記載の発明は、請求項１において、上記ファジィクラスタリング手段で得られた各話題スコープ内の単語ベクトル集合に関する情報を生成する話題スコープ内ベクトル情報生成手段と、入力文を形態素解析し単語単位に分割する入力文形態素解析手段と、上記概念ベースを検索することによって、上記入力文形態素解析手段で得られた各単語に対応するベクトルを取得する入力文内単語ベクトル取得手段と、上記入力文内単語ベクトル取得手段で得られた入力文内の単語ベクトル集合に関する情報を生成する入力文内ベクトル情報生成手段と、上記入力文内ベクトル情報生成手段で得られた入力文内の単語ベクトル集合に関する情報と、上記話題スコープ内ベクトル情報生成手段で得られた各話題スコープ内の単語ベクトル集合に関する情報との間の距離計算を行うことによって、入力文と各話題スコープ間の距離を算出する入力文・話題スコープ間距離算出手段とを有することを特徴とする話題スコープ抽出装置である。
The invention described in claim 2 is the method according to claim 1, wherein the in-topic vector information generating means for generating information on the word vector set in each topic scope obtained by the fuzzy clustering means, and the morphological analysis of the input sentence and the word Input sentence morpheme analyzing means for dividing into units, input sentence word vector obtaining means for obtaining a vector corresponding to each word obtained by the input sentence morpheme analyzing means by searching the concept base, and the input Input sentence vector information generating means for generating information related to a word vector set in the input sentence obtained by the in-sentence word vector acquiring means, and a word vector set in the input sentence obtained by the input sentence vector information generating means And the word vector set in each topic scope obtained by the topic scope vector information generating means. That by performing the distance calculation between the information, a topic scope extraction apparatus characterized by having an input sentence-topic scope distance calculation means for calculating the distance between the input sentence and the topic scope.

本発明によれば、複数の話題が混在しているメッセージ列から、意味的なまとまりをなすメッセージ集合である話題スコープの集合を適切に抽出することができるという効果を奏する。
According to the present invention, it is possible to appropriately extract a set of topic scopes, which is a set of messages that form a semantic group, from a message string in which a plurality of topics are mixed.

発明を実施するための最良の形態は、以下の実施例である。 The best mode for carrying out the invention is the following examples.

図１は、本発明の実施例１である話題スコープ抽出装置１００を示すブロック図である。 FIG. 1 is a block diagram illustrating a topic scope extraction apparatus 100 that is Embodiment 1 of the present invention.

話題スコープ抽出装置１００は、複数の話題が混在するメッセージ列から、意味的なまとまりをなすメッセージ集合である話題スコープの集合を抽出する話題スコープ抽出装置であって、メッセージ形態素解析手段１１と、概念ベース１２と、メッセージ内単語ベクトル取得手段１３と、メッセージ内ベクトル集合間距離算出手段１４と、参照距離算出手段１５と、トピックセグメンテーション手段１６と、トピックセグメント内ベクトル集合間距離算出手段１７と、メッセージ間距離算出手段１８と、ファジィクラスタリング手段１９とを有する。 The topic scope extraction apparatus 100 is a topic scope extraction apparatus that extracts a set of topic scopes, which is a message set that forms a semantic group, from a message sequence in which a plurality of topics are mixed. Base 12, message word vector acquisition means 13, message vector set distance calculation means 14, reference distance calculation means 15, topic segmentation means 16, topic segment vector set distance calculation means 17, and message An inter-distance calculation means 18 and a fuzzy clustering means 19 are provided.

図２は、実施例１において、入力となるメッセージ列の一例を示す図である。 FIG. 2 is a diagram illustrating an example of an input message string in the first embodiment.

メッセージは、投稿単位であり、番号ｉのメッセージをＭ_ｉと表す。図２において、メッセージＭ_９のメッセージ内容に記載されている「＞＞４」は、メッセージＭ_４へのリンクを示し、つまり、メッセージＭ_９がＭ_４を参照していることを意味している。 The message is a posting unit, and the message with the number _i is represented as Mi. In FIG. 2, “>> 4” described in the message content of the message M ₉ indicates a link to the message M ₄ , that is, means that the message M ₉ refers to M ₄ . .

このメッセージ列から抽出されるべき話題スコープの集合は、Ｃ_１＝｛Ｍ_１，Ｍ_２，Ｍ_３，Ｍ_４，Ｍ_９，Ｍ_１０｝、Ｃ_２＝｛Ｍ_１，Ｍ_５，Ｍ_６，Ｍ_７，Ｍ_８｝であり、Ｍ_１は、２つの話題スコープに属している。 The set of topic scopes to be extracted from this message sequence is C ₁ = {M ₁ , M ₂ , M ₃ , M ₄ , M ₉ , M ₁₀ }, C ₂ = {M ₁ , M ₅ , M ₆ , M ₇ , M ₈ }, and M ₁ belongs to two topic scopes.

メッセージ形態素解析手段１１は、各メッセージを形態素解析し、単語単位に分割する手段である。この結果得られた単語のうちで、品詞情報等を参照し、内容語のみを残す。ただし、Ｍ_９における「＞＞４」のように、他のメッセージへのリンクは、そのままの状態で残す。この単語単位の分割によって、各メッセージは、図３に示すようになる。 The message morpheme analyzing unit 11 is a unit that morphologically analyzes each message and divides the message into words. Among the words obtained as a result, the part of speech information or the like is referred to and only the content word is left. However, as in the ">>4" in M _9, links to other messages, leave intact. By dividing the word unit, each message becomes as shown in FIG.

図３は、図２に示すメッセージＭを形態素解析した結果を示す図である。 FIG. 3 is a diagram showing a result of morphological analysis of the message M shown in FIG.

図４は、実施例１における概念ベース１２の例を示す図である。 FIG. 4 is a diagram illustrating an example of the concept base 12 in the first embodiment.

概念ベース１２は、ハードディスク等の記憶手段に格納され、単語毎に、ｆ次元ベクトル値が付与されている。概念ベース１２中の単語は、名詞や動詞、形容詞等の自立語である。概念ベース１２における単語ベクトルは、意味的に類似している単語間ほど距離が近く、意味的に類似していない単語間程、距離が遠くなるように値が設定されている。 The concept base 12 is stored in a storage means such as a hard disk, and an f-dimensional vector value is assigned to each word. The words in the concept base 12 are independent words such as nouns, verbs, and adjectives. The word vectors in the concept base 12 are set such that the distance between words that are semantically similar is closer, and the distance between words that are not semantically similar is longer.

概念ベースの例は、特開平６−１０３３１５号公報の「類似性判別装置」や、特開平７−３０２２６５号公報の「類似性判別用データ精錬方法およびこの方法を実施する装置」で開示されているデータベースである。 Examples of concept bases are disclosed in “Similarity Discriminating Device” of Japanese Patent Laid-Open No. 6-103315 and “Similarity Discriminating Data Refinement Method and Device for Implementing this Method” of Japanese Patent Laid-Open No. 7-302265. Database.

また、Ｄｅｅｒｗｅｓｔｅｒの論文（S.Deerwester, S.T.Dumais, G.W.Furnas, T.K.Landauer and R.Harshman, Indexing by Latent Semantic Analysis, Journal of the American Society for Information Science, pp.391-407, 1990.）では、単語の文書における頻度を記録した単語・文書間の共起行列を、特異値分解によって次元数を縮退させた行列に変換しているが、この変換後の行列も、概念ベースの一例である。 In Deerwester's paper (S. Deerwester, STDumais, GWFurnas, TKLandauer and R. Harshman, Indexing by Latent Semantic Analysis, Journal of the American Society for Information Science, pp.391-407, 1990.) The co-occurrence matrix between words and documents in which the frequency in the document is recorded is converted into a matrix in which the number of dimensions is reduced by singular value decomposition, and this converted matrix is also an example of a concept base.

Ｓｃｈｕｔｚｅの論文（H.Schutze, Dimensions of Meaning, Proc. of Supercomputing '92, pp.786-796, 1992.）では、コーパス中の単語間の共起頻度を記録した単語・単語間の共起行例を、特異値分解によって次元数を縮退させた行列に変換しているが、この変換後の行列も、概念ベースの一例である。 In Schutze's paper (H. Schutze, Dimensions of Meaning, Proc. Of Supercomputing '92, pp.786-796, 1992.), the frequency of co-occurrence between words in the corpus is recorded. The example is converted into a matrix with a reduced number of dimensions by singular value decomposition, and this converted matrix is also an example of a concept base.

メッセージ内単語ベクトル取得手段１３は、概念ベース１２を検索することによって、メッセージ形態素解析手段１１が得た各単語に対応するベクトルを取得する手段である。 The in-message word vector acquisition means 13 is a means for acquiring a vector corresponding to each word obtained by the message morpheme analysis means 11 by searching the concept base 12.

メッセージ内ベクトル集合間距離算出手段１４は、任意のメッセージ対に対し、上記対の各メッセージ内の単語ベクトル集合を基に、上記対のメッセージ間のメッセージ内ベクトル集合間距離を算出する手段である。 The intra-message vector set distance calculating means 14 is a means for calculating an inter-message vector set distance between the paired messages based on a word vector set in each message of the pair for an arbitrary message pair. .

入力テキスト中の単語ベクトルの列（異なる出現位置の単語のベクトルは、値が同じでも別物とする）を、
Ｘ＝｛ｘ_１，ｘ_２，・・・，ｘ_｜ｘ｜｝
とする。 A sequence of word vectors in the input text (vectors of words at different positions are different, even if they have the same value)
X = {x ₁ , x ₂ ,..., X _{| x |}
And

各メッセージ内の単語ベクトル集合は、連続する、Ｘの要素の集合となる。以後、Ｍ_ｉは、メッセージＭ_ｉ内の単語ベクトル集合も同時に意味するものとする。 The word vector set in each message is a set of consecutive X elements. Henceforth, M _i shall mean the word vector set in message M _i simultaneously.

メッセージＭ_ｉの重心Ｄ（Ｍ_ｉ）は、 The center of gravity D (M _i ) of the message M _i is

と計算される。｜Ｍ_ｉ｜は、Ｍ_ｉ内の要素数である。

Is calculated. | M _i | is the number of elements in M _i .

メッセージＭ_ｉのコストＥ（Ｍ_ｉ）を The cost E (M _i ) of the message M _i

とする。

And

任意のメッセージＭ_ｉ、Ｍ_ｊ間のメッセージ内ベクトル集合間距離ΔＥ（Ｍ_ｉ，Ｍ_ｊ）を、Ｍ_ｉ、Ｍ_ｊを併合したときのコストの増分とする。ΔＥ（Ｍ_ｉ，Ｍ_ｊ）は、 The intra-message vector set distance ΔE (M _i , M _j ) between any message M _i and M _j is an increase in cost when M _i and M _j are merged. ΔE (M _i , M _j ) is

となる。

It becomes.

ただし、単語数が１個以下のメッセージについては、他のメッセージとのメッセージ内ベクトル集合間距離を算出しないようにする。 However, for a message having one word or less, the distance between the vector sets in the message with other messages is not calculated.

上記式（１）によって、任意のメッセージ間のメッセージ内ベクトル集合間距離は、図５に示すようになる。 According to the above equation (1), the distance between the message vector sets between arbitrary messages is as shown in FIG.

図５は、図２に示すメッセージＭ間のメッセージ内ベクトル集合間距離を示す図である。 FIG. 5 is a diagram illustrating a distance between message vector sets between messages M illustrated in FIG.

Ｍ_ｉ、Ｍ_ｊ（ｉ＜ｊ）間の距離は、ｉ行ｊ列目の成分にのみ記載している。 The distance between M _i and M _j (i <j) is described only for the component in the i-th row and j-th column.

任意のメッセージについて、自身との距離は、０．０となる。 For any message, the distance from itself is 0.0.

単語数が１個であるＭ_３とＭ_６については、他のメッセージとの距離は、算出されていない。Ｍ_３とＭ_６は、単語集合が一致しているが、距離は０．０ではなく、Ｍ_３とＭ_６が同一の話題に属することを、回避している。 For M ₃ and M ₆ having one word, the distance from other messages is not calculated. M ₃ and M ₆ have the same word set, but the distance is not 0.0, and it is avoided that M ₃ and M ₆ belong to the same topic.

Ｍ_１と、その内容を受けるＭ_２との距離は、０．２であり、Ｍ_１と、その内容を受けるＭ_５との距離も、０．２である。Ｍ_９とＭ_１０との距離は、０．３である。それ以外の異なるメッセージ間の距離は、０．５となっている。 The distance between M ₁ and M ₂ receiving the contents is 0.2, and the distance between M ₁ and M ₅ receiving the contents is also 0.2. The distance between the M ₉ and _{M 10} is 0.3. The distance between other different messages is 0.5.

参照距離算出手段１５は、任意のメッセージ対に対し、上記対のメッセージ間の参照関係を基に、上記対のメッセージ間の参照距離を算出する手段である。 The reference distance calculation means 15 is a means for calculating the reference distance between the pair of messages for an arbitrary message pair based on the reference relationship between the pair of messages.

２つのメッセージＭ_ｉ、Ｍ_ｊ（ｉ＜ｊ）に対し、Ｍ_ｊがＭ_ｉを参照しているときに、Ｍ_ｉ、Ｍ_ｊ間の参照距離を、０．０とする。Ｍ_ｉがＭ_ｊを参照していても、時間的に後に出現するメッセージとの間に、話題同一性が必ずしもあるわけではないので、それだけで、Ｍ_ｉ、Ｍ_ｊ間の参照距離を０．０にはしない。 For two messages _{_{M i, M j (i <}} j), _{when M j} refers to a _{M _i, M _i,} the reference distance between _{M j,} and 0.0. Even if M _i refers to M _j , there is not always topic identity with a message that appears later in time, so that the reference distance between M _i and M _{j is set} to 0. Do not set to zero.

これによって、任意のメッセージ間の参照距離は、図６に示すようになる。 Thus, the reference distance between arbitrary messages is as shown in FIG.

図６は、図２に示すメッセージＭ間の参照距離を示す図である。 FIG. 6 is a diagram illustrating a reference distance between the messages M illustrated in FIG.

Ｍ_９がＭ_４を参照しているので、Ｍ_４とＭ_９との参照距離は、０．０となる。それ以外のメッセージ間の参照距離は、算出されない。 Since M ₉ are referring to _{M 4,} reference distance between _{M 4} and _{M 9} is a 0.0. The reference distance between other messages is not calculated.

トピックセグメンテーション手段１６は、メッセージ内単語ベクトル取得手段１３で得られた単語ベクトルの系列から、メッセージ列を同一話題の区間であるトピックセグメントの集合へ分割する。 The topic segmentation means 16 divides the message string from the word vector series obtained by the in-message word vector acquisition means 13 into a set of topic segments that are sections of the same topic.

トピックセグメンテーションの方法は、特開２００２−３４２３２４号公報や特開２００４−２３４５１２号公報に開示されている。 A topic segmentation method is disclosed in Japanese Patent Application Laid-Open Nos. 2002-342324 and 2004-234512.

特開２００２−３４２３２４号公報に記載されているトピックセグメンテーションの方法は、任意の単語境界の前後に、所定個数の単語の集合である単語列をとり、各単語列に対し、上記単語列を構成する単語のベクトルの重心を算出し、前後の単語列に対応する重心間の余弦測度を、上記単語境界の結束度とし、この結束度が極小となる単語境界を、話題区間の境界と認定する方法である。 The method of topic segmentation described in Japanese Patent Laid-Open No. 2002-342324 takes a word string that is a set of a predetermined number of words before and after an arbitrary word boundary, and constructs the word string for each word string The centroid of the vector of the word to be calculated is calculated, and the cosine measure between the centroids corresponding to the preceding and following word strings is used as the cohesion degree of the word boundary, and the word boundary at which the cohesion degree is minimized is recognized as the boundary of the topic section. Is the method.

また、特開２００４−２３４５１２号公報に記載されているトピックセグメンテーションの方法は、任意の区間に対して、上記区間内の単語ベクトルの重心の各単語ベクトルとの間の距離の自乗の和を、コストとして求め、任意の区間列のコストを、上記区間列に含まれる区間のコストの和として、一定の条件下で、コストが最小となる区間列を、話題区間列と認定するする方法である。 In addition, the topic segmentation method described in Japanese Patent Application Laid-Open No. 2004-234512 has the sum of squares of the distance between each word vector and the centroid of the word vector in the section for an arbitrary section, This is a method for obtaining a cost as a topic section string, which is obtained as a cost, and the cost of an arbitrary section string is the sum of the costs of the sections included in the section string, and the section cost having the minimum cost under a certain condition. .

図７は、図２に示すメッセージ列をトピックセグメンテーションした結果を示す図である。 FIG. 7 is a diagram showing the result of topic segmentation of the message sequence shown in FIG.

トピックセグメンテーション手段１６により、メッセージ列は、図７に示すように、メッセージ列を、トピックセグメントの集合Ｔ_１＝｛Ｍ_１，Ｍ_２，Ｍ_３，Ｍ_４｝、Ｔ_２＝｛Ｍ_５，Ｍ_６，Ｍ_７，Ｍ_８｝、Ｔ_３＝｛Ｍ_９，Ｍ_１０｝へ分割される。 As shown in FIG. 7, the topic segmentation means 16 converts the message string into a set of topic segments T ₁ = {M ₁ , M ₂ , M ₃ , M ₄ }, T ₂ = {M ₅ , M ₆ , M ₇ , M ₈ }, T ₃ = {M ₉ , M ₁₀ }.

トピックセグメント内のメッセージ対の中には、距離が０．５と大きいものもあるが、トピックセグメンテーションによって、トピックセグメント内のメッセージ間の話題同一性を検出することが可能となる。 Some message pairs in a topic segment have a distance as large as 0.5. However, topic segmentation makes it possible to detect topic identity between messages in a topic segment.

トピックセグメント内ベクトル集合間距離算出手段１７は、任意のトピックセグメント対に対し、上記対の各トピックセグメント内の単語ベクトル集合を基に、上記対のトピックセグメント間のトピックセグメント内ベクトル集合間距離を算出する手段である。 The intra-topic-segment vector set distance calculating means 17 calculates, for an arbitrary topic segment pair, the distance between the topic-segment vector sets between the paired topic segments based on the word vector set in each topic segment of the pair. It is a means for calculating.

各トピックセグメント内の単語ベクトル集合は、連続する、Ｘの要素の集合となる。以後、Ｔ_ｉは、Ｔ_ｉ内の単語ベクトル集合も同時に意味するものとする。 The set of word vectors in each topic segment is a continuous set of X elements. Henceforth, T _i shall mean the word vector set in T _i at the same time.

トピックセグメント内ベクトル集合間距離を、メッセージ内ベクトル集合間距離と同様に、次のように定める。 The distance between topic segment vector sets is determined as follows in the same manner as the distance between message vector sets.

トピックセグメントＴ_ｉの重心を、Ｄ（Ｔ_ｉ）とし、Ｔ_ｉ内の要素数を、｜Ｔ_ｉ｜としたときに、任意のトピックセグメントＴ_ｉ，Ｔ_ｊ間のトピックセグメント内ベクトル集合間距離ΔＥ（Ｔ_ｉ，Ｔ_ｊ）を、 When the centroid of the topic segment T _i is D (T _i ) and the number of elements in T _i is | T _i |, the distance between the vector sets in the topic segment between any topic segments T _i and T _j ΔE (T _i , T _j )

とする。

And

上記式（２）によって、任意のトピックセグメント間のトピックセグメント内ベクトル集合間距離は、図８に示すようになる。 According to the above equation (2), the distance between topic segment vector sets between arbitrary topic segments is as shown in FIG.

図８は、図７に示すトピックセグメントＴ間のトピックセグメント内ベクトル集合間距離を示す図である。 FIG. 8 is a diagram showing the distance between the topic segment vector sets between the topic segments T shown in FIG.

Ｔ_ｉ、Ｔ_ｊ（ｉ＜ｊ）間の距離は、ｉ行ｊ列目の成分にのみ記載している。 The distance between T _i and T _j (i <j) is described only for the component in the i-th row and j-th column.

任意のトピックセグメントについて、自身との距離を、０．０となる。異なるトピックセグメント間の距離は、０．５となっている。 For any topic segment, the distance from itself is 0.0. The distance between different topic segments is 0.5.

メッセージ間距離算出手段１８は、任意のメッセージ対に対し、メッセージ内ベクトル集合間距離算出手段１４が得た上記対のメッセージ間のメッセージ内ベクトル集合間距離と、参照距離算出手段１５が得た上記対のメッセージ間の参照距離と、トピックセグメント内ベクトル集合間距離算出手段１７が得た上記対の各メッセージが属するトピックセグメント間のトピックセグメント内ベクトル集合間距離とのうちで、最小値を、上記対のメッセージ間の最終的な距離とする手段である。 The inter-message distance calculation means 18 is configured to obtain the inter-message vector set distance between the pair of messages obtained by the intra-message vector set distance calculation means 14 and the reference distance calculation means 15 obtained for any message pair. Among the reference distance between the pair of messages and the distance between the topic segment vector sets between the topic segments to which each of the pair of messages obtained by the topic segment vector set distance calculation means 17 is obtained, the minimum value is set as above. It is a means to make the final distance between a pair of messages.

これによって、任意のメッセージ間の距離は、図９に示すようになる。 Thus, the distance between arbitrary messages is as shown in FIG.

図９は、図２に示すメッセージＭ間の距離を示す図である。 FIG. 9 is a diagram showing the distance between the messages M shown in FIG.

同一トピックセグメント内の任意のメッセージ間の距離は、０．０となる。Ｍ_１とＭ_５とのメッセージ内ベクトル集合間距離は、０．２であり、各メッセージが属するトピックセグメント間の距離は、０．５であるので、メッセージ間距離は、０．２となる。 The distance between any messages in the same topic segment is 0.0. The distance between the message vector sets of M ₁ and M ₅ is 0.2, and the distance between the topic segments to which each message belongs is 0.5. Therefore, the distance between the messages is 0.2.

Ｍ_４とＭ_９とのメッセージ内ベクトル集合間距離は、０．５であり、参照距離は、０．０であり、各メッセージが属するトピックセグメント間の距離は、０．５であるので、メッセージ間距離は、０．０となる。 The distance between the vector sets in the message between M ₄ and M ₉ is 0.5, the reference distance is 0.0, and the distance between the topic segments to which each message belongs is 0.5. The inter-distance is 0.0.

上記以外の、異なるトピックセグメントに属するメッセージ間の距離は、０．５となる。 The distance between messages belonging to different topic segments other than the above is 0.5.

ファジィクラスタリング手段１９は、メッセージ間距離算出手段１８が得た任意のメッセージ間の距離を基に、メッセージ列を、ファジィクラスタリングし、その結果得られた各クラスタに対し、上記クラスタへの帰属度が所定の閾値以上であるメッセージの集合を、「話題スコープ」として導出する手段である。 The fuzzy clustering means 19 performs fuzzy clustering on the message string based on the distance between the messages obtained by the inter-message distance calculation means 18, and the degree of belonging to the cluster is determined for each cluster obtained as a result. This is means for deriving a set of messages that are equal to or greater than a predetermined threshold as a “topic scope”.

ファジィクラスタリングの方法は、「E.H.Ruspini, A new approach to clustering, Information and Control, Vol.15, pp.22-32, 1969.」に記載されている方法である。以下、このＲｕｓｐｉｎｉの方法について説明する。 The method of fuzzy clustering is a method described in “E.H. Ruspini, A new approach to clustering, Information and Control, Vol. 15, pp. 22-32, 1969.”. The Ruspini method will be described below.

メッセージの集合｛Ｍ_１，Ｍ_２，・・・，Ｍ_ｎ｝と、任意のメッセージＭ_ｉ、Ｍ_ｊ間の距離ｄ（Ｍ_ｉ，Ｍ_ｊ）とが与えられているとする。また、メッセージのファジィクラスタの集合を、｛Ｃ_１，Ｃ_２，・・・，Ｃ_ｒ｝とする。Ｕ＝（ｕ_ｐｉ）（１≦ｐ≦ｒ，ｉ≦ｉ≦ｎ）の各成分ｕ_ｐｉは、Ｍ_ｉのＣ_ｐへの帰属度を意味するものとし、Ｕ＝（ｕ_ｐｉ）は、 Assume that a set of messages {M ₁ , M ₂ ,..., M _n } and a distance d (M _i , M _j ) between arbitrary messages M _i and M _j are given. Further, a set of message fuzzy clusters is represented as {C ₁ , C ₂ ,..., C _r }. Each component u _pi of U = (u _pi ) (1 ≦ p ≦ r, i ≦ i ≦ n) means the degree of attribution of M _i to C _p , and U = (u _pi ) is

の条件を満たすものとする。評価関数

It shall satisfy the conditions of Evaluation function

を考える。重みωも、実数値をとる変数である。

think of. The weight ω is also a variable that takes a real value.

上記Ｒｕｓｐｉｎｉの方法では、評価関数Ｊ_Ｒ（Ｕ，ω）を最小にするＵ、ωすなわち、 In the Ruspini method, U, ω that minimizes the evaluation function J _R (U, ω), that is,

を求める。求めたＵ＝（ｕ_ｐｉ）の各成分ｕ_ｐｉが、Ｍ_ｉのＣ_ｐへの帰属度である。

Ask for. Each component u _{pi of the} obtained U = (u _pi ) is the degree of attribution of M _i to C _p .

帰属度の所定閾値αをとり、各ファジィクラスタＣ_ｐに対し、メッセージ集合｛Ｍ_ｉ：ｕ_ｐｉ≧α｝を、対応する話題スコープとする。 Taking a predetermined threshold value α of the degree of membership, a message set {M _i : _upi ≧ α} is set as a corresponding topic scope for each fuzzy cluster C _p .

図９に示す任意のメッセージ間の距離の情報を入力し、上記ファジィクラスタリングを実行することによって、話題スコープの集合Ｃ_１＝｛Ｍ_１，Ｍ_２，Ｍ_３，Ｍ_４，Ｍ_９，Ｍ_１０｝、Ｃ_２＝｛Ｍ_１，Ｍ_５，Ｍ_６，Ｍ_７，Ｍ_８｝が得られる。 By inputting the distance information between arbitrary messages shown in FIG. 9 and executing the fuzzy clustering, a set of topic scopes C ₁ = {M ₁ , M ₂ , M ₃ , M ₄ , M ₉ , M ₁₀ }, C ₂ = {M ₁ , M ₅ , M ₆ , M ₇ , M ₈ }.

Ｍ_１とＭ_５とのメッセージ間距離が、０．２と比較的小さいので、Ｍ_１は、トピックセグメントＴ_２内のメッセージ群とも、同一話題スコープに属している。 Since the inter-message distance between M ₁ and M ₅ is relatively small at 0.2, M ₁ belongs to the same topic scope as the message group in the topic segment T ₂ .

また、Ｍ_４とＭ_９とのメッセージ間距離が、０．０であるので、トピックセグメントＴ_１とＴ_３との距離は、０．５と大きかったにも関わらず、Ｔ_１とＴ_３内のメッセージ群が、同一話題スコープに属するようになっている。 The message distance between _{M 4} and _{M 9} are, because it is 0.0, the distance between the topic segment _{T 1} and _{T 3,} despite the larger and 0.5, _{T 1} and _{T 3} Message groups belong to the same topic scope.

図１０は、実施例１の動作を示すフローチャートである。 FIG. 10 is a flowchart illustrating the operation of the first embodiment.

まず、Ｓ１では、メッセージ形態素解析手段１１が、各メッセージを形態素解析することによって、単語単位に分割し、記憶装置に記憶する。 First, in S1, the message morphological analysis unit 11 divides each message into words by morphological analysis and stores them in the storage device.

なお、本明細書および請求の範囲において、「記憶装置」は、メモリに限らず、ハードディスク等の外部記憶装置を含む概念である。 In the present specification and claims, the “storage device” is a concept including not only a memory but also an external storage device such as a hard disk.

Ｓ２で、メッセージ内単語ベクトル取得手段１３が、概念ベース１２を検索することによって、メッセージ形態素解析手段１１で得られた各単語に対応するベクトルを取得し、記憶装置に記憶する。 In S2, the message word vector acquisition unit 13 searches the concept base 12 to acquire a vector corresponding to each word obtained by the message morpheme analysis unit 11, and stores it in the storage device.

Ｓ３で、メッセージ内ベクトル集合間距離算出手段１４が、任意のメッセージ対に対し、上記対の各メッセージ内の単語ベクトル集合を基に、上記対のメッセージ間のメッセージ内ベクトル集合間距離を算出し、記憶装置に記憶する。 In S3, the inter-message vector set distance calculating means 14 calculates the inter-message vector set distance between the paired messages based on the word vector set in each paired message for an arbitrary message pair. And store it in the storage device.

Ｓ４で、参照距離算出手段１５が、任意のメッセージ対に対し、上記対のメッセージ間の参照関係を基に、上記対のメッセージ間の参照距離を算出し、記憶装置に記憶する。 In S4, the reference distance calculation means 15 calculates the reference distance between the pair of messages for an arbitrary message pair based on the reference relationship between the pair of messages, and stores it in the storage device.

Ｓ５で、トピックセグメンテーション手段１６が、メッセージ内単語ベクトル取得手段１３で得られた単語ベクトルの系列から、メッセージ列を同一話題の区間であるトピックセグメントの集合へ分割し、記憶装置に記憶する。 In S5, the topic segmentation means 16 divides the message string from the word vector series obtained by the in-message word vector acquisition means 13 into a set of topic segments that are sections of the same topic, and stores them in the storage device.

Ｓ６で、トピックセグメント内ベクトル集合間距離算出手段１７が、任意のトピックセグメント対に対し、上記対の各トピックセグメント内の単語ベクトル集合を基に、上記対のトピックセグメント間のトピックセグメント内ベクトル集合間距離を算出し、記憶装置に記憶する。 In S6, the inter-topic-segment vector-set distance calculating means 17 calculates, for any topic segment pair, the intra-topic-segment vector set between the paired topic segments based on the word vector set in each pair of topic segments. The distance is calculated and stored in the storage device.

Ｓ７で、メッセージ間距離算出手段１８が、任意のメッセージ対に対し、メッセージ内ベクトル集合間距離算出手段１４で得られた上記対のメッセージ間のメッセージ内ベクトル集合間距離と、参照距離算出手段１５で得られた上記対のメッセージ間の参照距離と、トピックセグメント内ベクトル集合間距離算出手段１７で得られた上記対の各メッセージが属するトピックセグメント間のトピックセグメント内ベクトル集合間距離を基に、上記対のメッセージ間の最終的な距離を算出し、記憶装置に記憶する。 In S7, the inter-message distance calculating means 18 for any message pair, the inter-message vector set distance between the pair of messages obtained by the intra-message vector set distance calculating means 14 and the reference distance calculating means 15 Based on the reference distance between the pair of messages obtained in the above and the distance between topic segment vector sets between the topic segments to which each message of the pair belongs obtained by the distance calculation unit 17 within the topic segment vector set, The final distance between the pair of messages is calculated and stored in the storage device.

Ｓ８で、ファジィクラスタリング手段１９が、メッセージ間距離算出手段１８で得られた任意のメッセージ間の距離を基に、メッセージ列をファジィクラスタリングし、その結果得られた各クラスタに対し、上記クラスタへの帰属度がある閾値以上のメッセージの集合を、話題スコープとして導出し、記憶装置に記憶する。
In S8, the fuzzy clustering means 19 performs fuzzy clustering on the message sequence based on the distance between the arbitrary messages obtained by the inter-message distance calculating means 18, and for each cluster obtained as a result, A set of messages whose degree of belonging is equal to or greater than a threshold is derived as a topic scope and stored in a storage device.

図１１は、本発明の実施例２である話題スコープ抽出装置２００の構成を示すブロック図である。 FIG. 11 is a block diagram illustrating a configuration of a topic scope extraction apparatus 200 that is Embodiment 2 of the present invention.

話題スコープ抽出装置２００は、話題スコープ抽出装置１００に、話題スコープ内ベクトル情報生成手段２０と、入力文形態素解析手段２１と、入力文内単語ベクトル取得手段２２と、入力文内ベクトル情報生成手段２３と、入力文・話題スコープ間距離算出手段２４とが付加された装置である。 The topic scope extraction apparatus 200 is similar to the topic scope extraction apparatus 100 in that it includes topic scope vector information generation means 20, input sentence morpheme analysis means 21, input sentence word vector acquisition means 22, and input sentence vector information generation means 23. And an input sentence / topic scope distance calculation means 24 are added.

つまり、話題スコープ抽出装置２００は、メッセージ形態素解析手段１１と、概念ベース１２と、メッセージ内単語ベクトル取得手段１３と、メッセージ内ベクトル集合間距離算出手段１４と、参照距離算出手段１５と、トピックセグメンテーション手段１６と、トピックセグメント内ベクトル集合間距離算出手段１７と、メッセージ間距離算出手段１８と、ファジィクラスタリング手段１９と、話題スコープ内ベクトル情報生成手段２０と、入力文形態素解析手段２１と、入力文内単語ベクトル取得手段２２と、入力文内ベクトル情報生成手段２３と、入力文・話題スコープ間距離算出手段２４とを有する。 That is, the topic scope extraction device 200 includes a message morpheme analysis unit 11, a concept base 12, an in-message word vector acquisition unit 13, an in-message vector set distance calculation unit 14, a reference distance calculation unit 15, a topic segmentation. Means 16, inter-topic segment vector set distance calculating means 17, inter-message distance calculating means 18, fuzzy clustering means 19, topic-scope vector information generating means 20, input sentence morpheme analyzing means 21, input sentence An internal word vector acquisition unit 22, an input sentence vector information generation unit 23, and an input sentence / topic scope distance calculation unit 24.

話題スコープ内ベクトル情報生成手段２０は、ファジィクラスタリング手段１９が得た各話題スコープ内の単語ベクトル集合に関する情報を生成する手段である。 The topic-scope vector information generating means 20 is means for generating information on the word vector set in each topic scope obtained by the fuzzy clustering means 19.

話題スコープＣ_ｐ内の単語ベクトル集合は、Ｃ_ｐ内のメッセージに含まれる単語ベクトル集合の和である。 Word vector set in the topic scope C _p is the sum of the word vector set contained in the message in the C _p.

以後、Ｃ_ｐは、Ｃ_ｐ内の単語ベクトル集合も同時に意味するものとする。各話題スコープＣ_ｐに対し、Ｃ_ｐの重心Ｄ（Ｃ_ｐ）と、Ｃ_ｐ内の要素数｜Ｃ_ｐ｜とを算出し、保持する。これらＣ_ｐの重心Ｄ（Ｃ_ｐ）と、Ｃ_ｐ内の要素数｜Ｃ_ｐ｜とが、Ｃ_ｐ内の単語ベクトル集合に関する情報である。 Thereafter, C _p shall also mean at the same time the word vector set in the C _p. For each topic scope _{C p,} the center of gravity D of _{_{C p} (C p),} the number of elements in _{_{C p}} | _{C p} | and is calculated and maintained. These _{C p} of the center of gravity D _{(C p),} the number of elements in _{C p} | _{C p} | and is information about the word vector set in the _{C p.}

入力文形態素解析手段２１は、入力文を形態素解析し、単語単位に分割する手段である。この結果得られた単語のうちで、品詞情報等を参照し、内容語のみを残す。 The input sentence morpheme analyzing unit 21 is a unit that performs morphological analysis on the input sentence and divides the input sentence into words. Among the words obtained as a result, the part of speech information or the like is referred to and only the content word is left.

例として、入力文Ｑ：「おいしいラーメン店」は、「おいしい／ラーメン／店」となる。 As an example, the input sentence Q: “delicious ramen store” becomes “delicious / ramen / store”.

入力文内単語ベクトル取得手段２２は、概念ベース１２を検索することによって、入力文形態素解析手段２１が得た各単語に対応するベクトルを取得する手段である。 The input sentence word vector acquisition means 22 is a means for acquiring a vector corresponding to each word obtained by the input sentence morpheme analysis means 21 by searching the concept base 12.

以後、Ｑは、Ｑ内の単語ベクトル集合も同時に意味するものとする。 Henceforth, Q shall mean the word vector set in Q simultaneously.

入力文内ベクトル情報生成手段２３は、入力文内単語ベクトル取得手段２２が得た入力文内の単語ベクトル集合に関する情報を生成する手段である。入力文Ｑに対し、Ｑの重心Ｄ（Ｑ）と、Ｑ内の要素数｜Ｑ｜とを算出し、保持する。これらＱの重心Ｄ（Ｑ）と、Ｑ内の要素数｜Ｑ｜とが、入力文Ｑ内の単語ベクトル集合に関する情報である。 The input sentence vector information generating means 23 is means for generating information relating to a word vector set in the input sentence obtained by the input sentence word vector acquiring means 22. For the input sentence Q, the center of gravity D (Q) of Q and the number of elements | Q | in Q are calculated and held. The centroid D (Q) of Q and the number of elements | Q | in Q are information relating to the word vector set in the input sentence Q.

入力文・話題スコープ間距離算出手段２４は、入力文内ベクトル情報生成手段２３が得た入力文内の単語ベクトル集合に関する情報と、話題スコープ内ベクトル情報生成手段２０が得た各話題スコープ内の単語ベクトル集合に関する情報との間の距離計算を行うことによって、入力文と各話題スコープ間の距離を算出する手段である。 The input sentence / topic scope distance calculation means 24 includes information on the word vector set in the input sentence obtained by the input sentence vector information generation means 23 and each topic scope obtained by the topic scope vector information generation means 20. It is a means for calculating the distance between the input sentence and each topic scope by calculating the distance between the information related to the word vector set.

入力文Ｑと話題スコープＣ_ｐとの間の距離ｄ（Ｑ，Ｃ_ｐ）は、 The distance d (Q, C _p ) between the input sentence Q and the topic scope C _p is

として算出する。

Calculate as

話題スコープの集合Ｃ_１＝｛Ｍ_１，Ｍ_２，Ｍ_３，Ｍ_４，Ｍ_９，Ｍ_１０｝、Ｃ_２＝｛Ｍ_１，Ｍ_５，Ｍ_６，Ｍ_７，Ｍ_８｝に対し、話題スコープを、入力文Ｑとの距離によってソートすると、Ｃ_２、Ｃ_１の順になる。入力文Ｑに対し、この順で話題スコープを提示することが可能である。また、所定閾値以下の距離を持つ話題スコープのみに限定して、提示することも可能である。これによって、話題スコープＣ_２のみを提示する。 For a set of topic scopes C ₁ = {M ₁ , M ₂ , M ₃ , M ₄ , M ₉ , M ₁₀ }, C ₂ = {M ₁ , M ₅ , M ₆ , M ₇ , M ₈ } When the scopes are sorted according to the distance from the input sentence Q, the order is C ₂ and C ₁ . The topic scope can be presented in this order for the input sentence Q. It is also possible to present only the topic scope having a distance equal to or less than a predetermined threshold. As a result, we are presenting the only topic scope C _2.

なお、メッセージ内ベクトル集合間距離算出手段１４において、メッセージ内の単語数が一定数以下のメッセージ（以下、小メッセージと呼ぶ）については、他のメッセージとのメッセージ内ベクトル集合間距離を算出しないようにした。しかし、異なるトピックセグメント内にあるメッセージであって、メッセージ内の単語数が一定数以上のメッセージ（以下、大メッセージと呼ぶ）が、参照関係なしに、小メッセージの内容を受ける場合もあり得る。 It should be noted that the inter-message vector set distance calculation means 14 does not calculate the distance between the intra-message vector sets with other messages for messages whose number of words in the message is a certain number or less (hereinafter referred to as small messages). I made it. However, a message in a different topic segment and having a certain number of words or more in the message (hereinafter referred to as a large message) may receive the content of a small message without a reference relationship.

例として、「語ろう」という呼びかけのメッセージに対し、異なるトピックセグメント内にある大メッセージが受ける場合がある。 As an example, a large message in a different topic segment may be received in response to a message of “Let's talk”.

そこで、小メッセージの内容が、新たな話題を喚起するような内容の場合に限り、小メッセージと、上記メッセージよりも後にくる大メッセージとの間のメッセージ内ベクトル集合間距離を算出するようにしてもよい。この算出によって、２つのメッセージを、同一の話題に属させることができる。 Therefore, only when the content of the small message is a content that evokes a new topic, the distance between the vector sets in the message between the small message and the large message that comes after the message is calculated. Also good. By this calculation, two messages can belong to the same topic.

図１２、図１３は、実施例２における複数の他のメッセージへのリンクを持つメッセージの例を示す図である。 12 and 13 are diagrams illustrating examples of messages having links to a plurality of other messages according to the second embodiment.

１つのメッセージＭに、図１２、図１３に示すように、２つ以上のメッセージの集合Ｎへの参照がある場合、参照距離算出手段１５において、Ｍと、Ｎの任意の要素との間の参照距離を十分小さく、たとえば０．０に設定するようにしてもよい。 When there is a reference to a set N of two or more messages in one message M, as shown in FIGS. 12 and 13, the reference distance calculation means 15 determines between M and an arbitrary element of N. The reference distance may be set sufficiently small, for example, 0.0.

図１２に示す例では、Ｍ_４、Ｍ_５、Ｍ_６のそれぞれと、Ｍ_９との間の参照距離が０．０となる。図１３に示す例では、Ｍ_１３と、メッセージＭ_８、Ｍ_１０のそれぞれとの間の参照距離が０．０となる。 In the example illustrated in FIG. 12, the reference distance between each of M ₄ , M ₅ , and M ₆ and M ₉ is 0.0. In the example shown in FIG. _13, and _{M 13,} see the distance between the respective message _M _{8, M 10} is 0.0.

図１４は、図１３に示す複数の他のメッセージへのリンクを持つメッセージを分割した結果を示す図である。 FIG. 14 is a diagram showing a result of dividing a message having links to a plurality of other messages shown in FIG.

特に、図１３に示すような、参照しているリンク毎に叙述がなされているメッセージを、メッセージ中の空き行等のフォーマット情報を元に、図１４に示すように、リンク情報毎に分割することもできる。入力となるメッセージ列に対して、上記のように、メッセージの分割処理を予め行った上で、話題スコープ抽出装置２００に入力するようにしてもよい。図１４に示す例のように、所定のメッセージにおいて、「＞＞１３」のように、メッセージＭ_１３へのリンクがあれば、上記所定のメッセージは、Ｍ_１３．１とＭ_１３．２とを参照していることを意味する。 In particular, a message described for each link referred to as shown in FIG. 13 is divided into link information as shown in FIG. 14 based on format information such as empty lines in the message. You can also. As described above, a message division process may be performed in advance on a message string to be input, and then input to the topic scope extraction device 200. As in the example illustrated in FIG. 14, if a predetermined message has a link to the message M ₁₃ like “>> 13”, the predetermined message includes M _13.1 and M _13.2 . It means that you are referring.

なお、話題スコープ抽出装置２００において、リンク情報の有無に拘らず、１メッセージ内に複数のメッセージ内容がある場合、上記メッセージを分割し、この分割されたメッセージを、話題スコープ抽出装置２００に入力するようにしてもよい。メッセージの分割方法として、特開２００２−３４２３２４号公報や特開２００４−２３４５１２号公報に記載されている方法をとることができる。 In the topic scope extraction apparatus 200, when there are a plurality of message contents in one message regardless of the presence or absence of link information, the message is divided and the divided message is input to the topic scope extraction apparatus 200. You may do it. As a message dividing method, the methods described in JP-A-2002-342324 and JP-A-2004-234512 can be employed.

また、個々のメッセージ毎に分割するのではなく、メッセージ列の全テキストに対し分割を行い、その結果得られる各トピックセグメントと、各メッセージとの間の交わりの集合を入力としてもよい。この場合、トピックセグメンテーション手段１６において、トピックセグメンテーションは実行せず、既に得られているトピックセグメント集合を、トピックセグメント内ベクトル集合間距離算出手段１７以降の処理で用いる。 Further, instead of dividing each message, it is possible to divide the entire text of the message string and input a set of intersections between each topic segment and each message obtained as a result. In this case, the topic segmentation means 16 does not execute the topic segmentation, and the already obtained topic segment set is used in the processing after the topic segment vector set distance calculation means 17.

図１５は、実施例２において、投稿者名欄に前のメッセージの番号を記述されているメッセージの例を示す図である。 FIG. 15 is a diagram illustrating an example of a message in which the number of the previous message is described in the poster name column in the second embodiment.

図１５のように、メッセージに投稿者名欄があり、かつて自分が投稿したメッセージの番号を、新たに投稿するメッセージの投稿者名とするような事例もあり得る。このように、メッセージ中の投稿者名欄に、前のメッセージの番号が記述されている場合、２つのメッセージ間に話題同一性が認められることがあり得るので、参照距離算出手段１５が、２つのメッセージ間の参照距離を十分小さく、たとえば０．０に設定するようにしてもよい。図１５に示す例では、メッセージＭ_１５とＭ_１９との間の参照距離が、０．０となる。 As shown in FIG. 15, there may be a case in which a message has a contributor name field, and the number of the message posted by the user is used as the contributor name of the newly posted message. As described above, when the number of the previous message is described in the contributor name field in the message, the topic identity may be recognized between the two messages. The reference distance between two messages may be set sufficiently small, for example, 0.0. In the example illustrated in FIG. 15, the reference distance between the messages M ₁₅ and M ₁₉ is 0.0.

次に、実施例２の動作について説明する。 Next, the operation of the second embodiment will be described.

図１６は、実施例２の動作を示すフローチャートである。 FIG. 16 is a flowchart illustrating the operation of the second embodiment.

なお、ステップＳ１〜Ｓ８は、図１０のフローチャートにおけるステップＳ１〜Ｓ８と同じである。 Steps S1 to S8 are the same as steps S1 to S8 in the flowchart of FIG.

Ｓ８の次に、Ｓ９では、話題スコープ内ベクトル情報生成手段２０が、ファジィクラスタリング手段１９で得られた各話題スコープ内の単語ベクトル集合に関する情報を生成し、記憶装置に記憶する。 Next to S8, in S9, the in-topic scope vector information generating unit 20 generates information on the word vector set in each topic scope obtained by the fuzzy clustering unit 19, and stores it in the storage device.

Ｓ１１では、入力文形態素解析手段２１が、入力文を形態素解析し単語単位に分割する。 In S11, the input sentence morpheme analyzing means 21 performs morphological analysis on the input sentence and divides it into word units.

Ｓ１２では、入力文内単語ベクトル取得手段２２が、概念ベース１２を検索することによって、入力文形態素解析手段２１で得られた各単語に対応するベクトルを取得し、記憶装置に記憶する。 In S12, the input sentence word vector acquisition means 22 searches the concept base 12 to acquire a vector corresponding to each word obtained by the input sentence morpheme analysis means 21, and stores it in the storage device.

Ｓ１３では、入力文内ベクトル情報生成手段２３が、入力文内単語ベクトル取得手段２２で得られた入力文内の単語ベクトル集合に関する情報を生成し、記憶装置に記憶する。 In S13, the input sentence vector information generation unit 23 generates information on the word vector set in the input sentence obtained by the input sentence word vector acquisition unit 22, and stores the information in the storage device.

Ｓ１４では、入力文・話題スコープ間距離算出手段２４が、入力文内ベクトル情報生成手段２３で得られた入力文内の単語ベクトル集合に関する情報と、話題スコープ内ベクトル情報生成手段２０で得られた各話題スコープ内の単語ベクトル集合に関する情報との間の距離計算を行うことによって、入力文と各話題スコープ間の距離を算出し、記憶装置に記憶する。 In S 14, the input sentence / topic topic distance calculation unit 24 obtains the information related to the word vector set in the input sentence obtained by the input sentence vector information generation unit 23 and the topic scope in-scope vector information generation unit 20. By calculating the distance between the information related to the word vector set in each topic scope, the distance between the input sentence and each topic scope is calculated and stored in the storage device.

ところで、１トピックセグメント内のメッセージ間で距離を測ると、メッセージ内の単語数は十分あるわけではないので、距離は大きいことも多い。これは、１メッセージ単位で距離を測ると距離が大きくても、連続するメッセージ列に含まれる単語ベクトル集合をとると、上記メッセージ列に隣接するメッセージ列との間との差異が浮き彫りになる場合があることを意味する。トピックセグメンテーション手段１６によって、メッセージ列をトピックセグメントの集合へ分割することにより、１メッセージ単位の類似性では検出し難いトピックセグメント内のメッセージ間の話題同一性を検出することが可能となる。 By the way, when the distance between messages in one topic segment is measured, the number of words in the message is not sufficient, so the distance is often large. This means that even if the distance is large when measuring the distance in one message unit, the difference between the message sequence adjacent to the message sequence will be highlighted if the word vector set included in the continuous message sequence is taken Means there is. By dividing the message string into a set of topic segments by the topic segmentation means 16, it is possible to detect topic identity between messages in a topic segment that is difficult to detect with the similarity of one message unit.

参照距離算出手段１５によって、参照関係にあるメッセージ間の参照距離は無条件に小さく設定することができる。 The reference distance calculation means 15 can unconditionally set a reference distance between messages having a reference relationship.

メッセージ間距離算出手段１８では、メッセージ内ベクトル集合間距離、参照距離、トピックセグメント内ベクトル集合間距離の最小値を、メッセージ間の最終的な距離とする。 The inter-message distance calculation means 18 uses the minimum values of the inter-message vector set distance, the reference distance, and the topic segment vector set distance as the final distance between messages.

メッセージ内の単語数が一定数以下のメッセージ（以下、小メッセージと呼ぶ）と、他のメッセージとが同一の話題に属するための決定要因を考える。小メッセージの内容を、上記小メッセージと属するトピックセグメントが異なる、メッセージ内の単語数が一定数以上のメッセージ（以下、大メッセージと呼ぶ）が参照関係なしに受ける一部の場合を除いて、２つのメッセージの参照関係の有無と、所属するトピックセグメント間の距離が決定要因となる。メッセージ内ベクトル集合間距離算出手段１４では、小メッセージに対しては、他のメッセージとのメッセージ内ベクトル集合間距離を算出しないようにする。これによって、小メッセージと他のメッセージとの間の距離は、参照距離、トピックセグメント内ベクトル集合間距離で定まる。 Consider a determinant for a message having a certain number of words or less in a message (hereinafter referred to as a small message) and other messages belonging to the same topic. Except for some cases where the content of a small message is different from the small message and the topic segment to which the small message belongs and a message with a certain number of words or more in the message (hereinafter referred to as a large message) is received without reference relationship. The presence or absence of a reference relationship between two messages and the distance between the topic segments to which they belong are the determining factors. The intra-message vector set distance calculation means 14 does not calculate the inter-message vector set distance with other messages for small messages. Thereby, the distance between the small message and the other message is determined by the reference distance and the distance between the vector sets in the topic segment.

一方、２つの大メッセージが同一の話題に属するための決定要因は、参照関係の有無、所属するトピックセグメント間の距離、メッセージ内ベクトル集合間距離である。特に、大メッセージＡを受けて、Ａが属していた話題とは別の話題が、Ａと属するトピックセグメントが異なる大メッセージＢから、参照関係なしに始まることがある。この場合、メッセージＡは、複数の話題に属するべきものである。Ａ、Ｂ間のメッセージ内ベクトル集合間距離は、Ａ、Ｂそれぞれが属するトピックセグメント間の距離よりも小さくなる。この結果、Ａ、Ｂは別々の距離の大きいトピックセグメントに属するにも関わらず、ファジィクラスタリングによって、ＡはＢが属するトピックセグメント内のメッセージ集合とも同一の話題に属するようになる。
On the other hand, determinants for causing two large messages to belong to the same topic are the presence / absence of the reference relationship, the distance between the topic segments to which they belong, and the distance between the vector sets in the message. In particular, when a large message A is received, a topic different from the topic to which A belongs may start without a reference relationship from a large message B having a different topic segment to which A belongs. In this case, the message A should belong to a plurality of topics. The inter-message vector set distance between A and B is smaller than the distance between topic segments to which A and B belong. As a result, although A and B belong to topic segments with different distances, A belongs to the same topic as the message set in the topic segment to which B belongs by fuzzy clustering.

本発明は、複数の話題が混在するメッセージ列から、意味的なまとまりをなすメッセージ集合である話題スコープの集合を抽出する技術に適用可能である。
The present invention can be applied to a technique for extracting a set of topic scopes, which is a set of messages that form a semantic group, from a message sequence in which a plurality of topics are mixed.

本発明の実施例１である話題スコープ抽出装置１００を示すブロック図である。It is a block diagram which shows the topic scope extraction apparatus 100 which is Example 1 of this invention. 実施例１において、入力となるメッセージ列の一例を示す図である。In Example 1, it is a figure which shows an example of the message sequence used as input. 図２に示すメッセージＭを形態素解析した結果を示す図である。It is a figure which shows the result of having performed the morphological analysis of the message M shown in FIG. 実施例１における概念ベース１２の例を示す図である。It is a figure which shows the example of the concept base 12 in Example 1. FIG. 図２に示すメッセージＭ間のメッセージ内ベクトル集合間距離を示す図である。It is a figure which shows the distance between the vector sets in a message between the messages M shown in FIG. 図２に示すメッセージＭ間の参照距離を示す図である。It is a figure which shows the reference distance between the messages M shown in FIG. 図２に示すメッセージ列をトピックセグメンテーションした結果を示す図である。FIG. 3 is a diagram showing a result of topic segmentation of the message sequence shown in FIG. 2. 図７に示すトピックセグメントＴ間のトピックセグメント内ベクトル集合間距離を示す図である。It is a figure which shows the distance between the vector sets in a topic segment between the topic segments T shown in FIG. 図２に示すメッセージＭ間の距離を示す図である。It is a figure which shows the distance between the messages M shown in FIG. 実施例１の動作を示すフローチャートである。3 is a flowchart showing the operation of the first embodiment. 本発明の実施例２である話題スコープ抽出装置２００の構成を示すブロック図である。It is a block diagram which shows the structure of the topic scope extraction apparatus 200 which is Example 2 of this invention. 実施例２における複数の他のメッセージへのリンクを持つメッセージの例を示す図である。It is a figure which shows the example of the message which has the link to several other messages in Example 2. FIG. 実施例２における複数の他のメッセージへのリンクを持つメッセージの例を示す図である。It is a figure which shows the example of the message which has the link to several other messages in Example 2. FIG. 図１３に示す他の複数のメッセージへのリンクを持つメッセージを分割した結果を示す図である。It is a figure which shows the result of having divided | segmented the message which has the link to other several messages shown in FIG. 実施例２において、投稿者名欄に前のメッセージの番号が記述されているメッセージの例を示す図である。In Example 2, it is a figure which shows the example of the message in which the number of the previous message is described in the contributor name column. 実施例２の動作を示すフローチャートである。10 is a flowchart showing the operation of the second embodiment.

Claims

A topic scope extraction device that extracts a set of topic scopes, which are message sets that make up a semantic group, from a message sequence in which multiple topics are mixed,
A message morpheme analyzing means for dividing each message into words by performing morphological analysis;
A concept base which is a storage means in which a pair of a word and a vector expressing the meaning of the word is stored;
An in-message word vector acquisition means for acquiring a vector corresponding to each word obtained by the message morpheme analysis means by searching the concept base;
An intra-message vector set distance calculating means for calculating an inter-message vector set distance between the paired messages based on a word vector set in each pair of messages for an arbitrary message pair;
Reference distance calculation means for calculating a reference distance between the pair of messages based on a reference relationship between the pair of messages for an arbitrary message pair;
Topic segmentation means for dividing the message string into a set of topic segments that are sections of the same topic from the word vector series obtained by the word vector acquisition means in the message;
A topic segment vector set distance calculation means for calculating a distance between topic segment vector sets between the pair of topic segments based on a word vector set in each topic segment of the pair for an arbitrary topic segment pair; ;
For an arbitrary message pair, the inter-message vector set distance obtained by the intra-message vector set distance calculating means and the reference between the pair of messages obtained by the reference distance calculating means. Based on the distance and the distance between the topic segment vector sets between the topic segments to which each pair of messages belongs, obtained by the above-mentioned topic segment vector set distance calculation means, the final distance between the pair of messages is calculated. A message distance calculating means for calculating;
Based on the distance between any messages obtained by the message distance calculation means, fuzzy clustering is performed on the message string. Fuzzy clustering means to derive a set as a topic scope;
A topic scope extraction device characterized by comprising:

In claim 1,
Topic-in-scope vector information generating means for generating information on a word vector set in each topic scope obtained by the fuzzy clustering means;
An input sentence morpheme analyzing means for analyzing the input sentence and dividing it into word units;
An input sentence word vector acquisition means for acquiring a vector corresponding to each word obtained by the input sentence morpheme analysis means by searching the concept base;
Input sentence vector information generating means for generating information on a word vector set in the input sentence obtained by the input sentence word vector acquiring means;
The distance between the information about the word vector set in the input sentence obtained by the in-input sentence vector information generating means and the information about the word vector set in each topic scope obtained by the in-topic scope vector information generating means An input sentence / topic scope distance calculating means for calculating a distance between the input sentence and each topic scope by performing a calculation;
A topic scope extraction device characterized by comprising:

A method for controlling a topic scope extraction device that extracts a set of topic scopes, which are message sets that make up a semantic group, from a message sequence in which a plurality of topics are mixed,
A message morphological analysis step of dividing each message into words by storing each message in a storage device;
A vector corresponding to each word obtained in the message morpheme analysis step is acquired and stored by searching a concept base that is a storage means in which a pair of a word and a vector expressing the meaning of the word is stored. Obtaining a word vector in a message stored in the device;
For an arbitrary message pair, based on a word vector set in each message of the pair, a distance between the message vector sets between the pair of messages is calculated, and a distance between the message vector sets is stored in a storage device. When;
A reference distance calculation step of calculating a reference distance between the pair of messages for an arbitrary message pair based on a reference relationship between the pair of messages, and storing the reference distance in a storage device;
A topic segmentation step of dividing the message sequence into a set of topic segments that are sections of the same topic from the sequence of word vectors obtained in the word vector acquisition step in the message;
For an arbitrary topic segment pair, based on the word vector set in each topic segment of the pair, a distance between the topic segment vector sets between the paired topic segments is calculated, and stored in a storage device. An inter-set distance calculation step;
For an arbitrary message pair, the inter-message vector set distance between the pair of messages obtained in the intra-message vector set distance calculation step and the reference between the pair of messages obtained in the reference distance calculation step. Based on the distance and the distance between the topic segment vector sets between the topic segments to which each message of the pair belongs, obtained in the above-mentioned topic segment vector set distance calculation step, the final distance between the pair of messages is calculated. An inter-message distance calculating step of calculating and storing in a storage device;
Based on the distance between any messages obtained in the inter-message distance calculation step, the message sequence is fuzzy clustered, and for each cluster obtained as a result, the message belonging to the cluster has a certain degree of threshold or more. A fuzzy clustering step of deriving the set as a topic scope and storing it in a storage device;
A control method for a topic scope extraction device, comprising:

In claim 3,
Generating information about word vector sets in each topic scope obtained in the fuzzy clustering step and storing the information in the topic scope vector information in a storage device;
An input sentence morpheme analysis step of analyzing the input sentence and dividing it into words;
An input sentence word vector acquisition step of acquiring a vector corresponding to each word obtained in the input sentence morpheme analysis step by searching the concept base;
An input sentence vector information generation step of generating information on a word vector set in the input sentence obtained in the input sentence word vector acquisition step and storing the information in a storage device;
The distance between the information about the word vector set in the input sentence obtained in the input sentence vector information generation step and the information about the word vector set in each topic scope obtained in the topic scope vector information generation step Calculating the distance between the input sentence and each topic scope by performing calculation, and storing the distance between the input sentence and the topic scope in the storage device;
A control method for a topic scope extraction device, comprising:

A program that extracts a set of topic scopes, which are message sets that make up a semantic group, from a message sequence in which multiple topics are mixed,
A message morpheme analysis procedure in which each message is divided into words by morphological analysis and stored in a storage device;
A vector corresponding to each word obtained by the message morpheme analysis procedure is acquired and stored by searching a concept base that is a storage means in which a pair of a word and a vector expressing the meaning of the word is stored. Obtaining a word vector in a message stored in the device;
For an arbitrary message pair, based on the word vector set in each message of the pair, calculate the distance between the message vector sets between the paired messages and store the distance between the message vector sets in the storage device When;
A reference distance calculation procedure for calculating a reference distance between the pair of messages based on a reference relation between the pair of messages for an arbitrary message pair, and storing the reference distance in a storage device;
A topic segmentation procedure in which the message sequence is divided into a set of topic segments that are sections of the same topic from the sequence of word vectors obtained in the word vector acquisition procedure in the message, and stored in a storage device;
For an arbitrary topic segment pair, based on the word vector set in each topic segment of the pair, a distance between the topic segment vector sets between the paired topic segments is calculated, and stored in a storage device. A procedure for calculating the distance between sets;
For any message pair, the inter-message vector set distance between the pair of messages obtained by the above-intra-message vector set distance calculation procedure and the reference between the pair of messages obtained by the above-mentioned reference distance calculation procedure. Based on the distance and the distance between the topic segment vector sets between the topic segments to which each pair of messages belongs, obtained in the procedure for calculating the distance between the vector sets within the topic segment, the final distance between the pair of messages is calculated. A procedure for calculating a distance between messages to be calculated and stored in a storage device;
Based on the distance between the messages obtained in the inter-message distance calculation procedure, the message string is fuzzy clustered, and for each cluster obtained as a result, the message belonging to the cluster is more than a threshold value. A fuzzy clustering procedure for deriving the set as a topic scope and storing it in a storage device;
A program that causes a computer to execute.

In claim 5,
A topic scope vector information generating procedure for generating information on a word vector set in each topic scope obtained by the fuzzy clustering procedure and storing the information in a storage device;
An input sentence morpheme analysis procedure for analyzing an input sentence and dividing it into words;
Obtaining a vector corresponding to each word obtained by the input sentence morpheme analysis procedure by searching the concept base, and storing the word vector in the input sentence in a storage device;
An input sentence vector information generation procedure for generating information on a word vector set in the input sentence obtained by the above-described input sentence word vector acquisition procedure and storing the information in a storage device;
The distance between the information related to the word vector set in the input sentence obtained in the above-mentioned input sentence vector information generation procedure and the information related to the word vector set in each topic scope obtained in the topic scope vector information generation procedure Calculating the distance between the input sentence and each topic scope by performing calculation, and storing the distance between the input sentence and the topic scope in the storage device;
A program that causes a computer to execute.