JP2012079121A

JP2012079121A - Micro blog text classification device, method and program

Info

Publication number: JP2012079121A
Application number: JP2010224166A
Authority: JP
Inventors: Kyosuke Nishida; 京介西田; Takashi Fujimura; 考藤村
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2010-10-01
Filing date: 2010-10-01
Publication date: 2012-04-19
Anticipated expiration: 2030-10-01
Also published as: JP5389764B2

Abstract

PROBLEM TO BE SOLVED: To classify an input text (a micro blog text) depending on whether it is associated with an arbitrary designated character string given by a user.SOLUTION: This invention determines if a designated character string is included in an input text, outputs a designated concatenation text which is a concatenation of a text group stored in a designated text storage means, a text which is a concatenation of the designated concatenation text and the input text, a comparison concatenation text which is a concatenation of a text group stored in a comparison text storage means and a text which is a concatenation of the comparison concatenation text and the input text to a concatenation text storage means, performs data compression of each concatenated text stored in the concatenation text storage means, calculates a data size after the compression, calculates a relation score of the input text to the designated character string based on the size and classifies the input text depending on whether it is associated with the designated character string based on the relation score of a score output means.

Description

本発明は、マイクロブログテキスト分類装置及び方法及びプログラムに係り、特に、利用者が与える任意の指定文字列に関連するか否かを、データ圧縮の特性を利用して分類するためのマイクロブログテキスト分類装置及び方法及びプログラムに関する。 The present invention relates to a microblog text classification apparatus, method, and program, and more particularly, to a microblog text for classifying whether or not it relates to an arbitrary designated character string given by a user by using data compression characteristics. The present invention relates to a classification apparatus, method, and program.

近年ではマイクロブログ（ミニブログとも呼ばれる）という、主に利用者自身の現在の状況や雑感などを短いテキストで記すウェブサイトが普及している。マイクロブログは、更新が容易で、リアルタイム性に優れているため、インターネット上での第一次情報源としての役割を担い得る、重要なメディアである。 In recent years, microblogging (also called miniblogging), a website that mainly describes the current status and miscellaneous feelings of users themselves in short texts, has become widespread. Microblogging is an important medium that can be used as a primary information source on the Internet because it is easy to update and has excellent real-time characteristics.

大勢のユーザによって大量に投稿されるマイクロブログのテキスト集合から、ユーザが望む情報に関連したテキストのみを抽出することは大きな技術課題である。従来技術として、指定した文字列（キーワード）が含まれるテキストを抽出するキーワード検索が利用可能である。例えば、「口蹄疫」という家畜の感染病に関する情報を調べたい際は、キーワード検索で「口蹄疫」を入力すれば、「口蹄疫」という文字列が含まれるテキストを抽出することができる。しかし、キーワード検索では、ユーザが望む情報に関連したテキストでも、指定した文字列が含まれていないテキストを抽出することはできず、マイクロブログという短いテキストを扱うメディアでは、網羅性の高い検索を行うことができない。 Extracting only text related to information desired by users from a set of microblog texts posted by a large number of users is a major technical problem. As a conventional technique, keyword search for extracting text including a specified character string (keyword) can be used. For example, when it is desired to examine information related to an infectious disease of livestock called “foot-and-mouth disease”, if “foot-and-mouth disease” is input by keyword search, text including the character string “foot-and-mouth disease” can be extracted. However, keyword search cannot extract text that does not contain the specified character string, even for text related to the information that the user wants. For media that handles short texts such as microblogs, a highly comprehensive search is not possible. I can't do it.

次に、指定したタグが含まれるテキストを抽出するタグ検索が利用可能である。利用者がタグと呼ばれる短い文字列を付与してテキストを投稿することで、同じ話題に関するテキストを（他者のテキストも含めて）グループ化し、検索・閲覧に役立てている。例えば、口蹄疫に関するタグの一例として「#kouteieki」があり、このタグを用いると、
・「昨日の参議院農林水産委員会の動画を書き起こしました。→ http://sample.url/blGD2H #kouteieki」
・「県知事が記者会見。種牛殺処分へ。 #kouteiki」
という、「口蹄疫」が含まれていないテキストも抽出することができる。しかし、全ての利用者がタグを使用しているわけではないため、タグ検索の網羅性も十分ではない。 Next, a tag search that extracts text including a specified tag can be used. Users can post a text with a short character string called a tag, and group the texts on the same topic (including the texts of others) to help search and browse. For example, “#kouteieki” is an example of a tag related to foot-and-mouth disease,
・ "I wrote a video of the Council of Agriculture, Forestry and Fisheries Committee of Yesterday. → http: //sample.url/blGD2H #kouteieki"
・ "Prefectural governor press conference. To kill dairy cattle. #Kouteiki"
It is also possible to extract text that does not contain “foot-and-mouth disease”. However, since not all users use tags, the completeness of tag search is not sufficient.

ここで、指定した文字列（タグを含む任意の文字列）が含まれるテキスト集合を基に、入力されたテキストが指定した文字列に関連するかどうかを分類することで、検索の網羅性を高めることが可能である。 Here, based on a text set that includes the specified character string (any character string including tags), it is possible to classify whether the input text is related to the specified character string. It is possible to increase.

通常のブログのテキスト分類（例えば、特許文献１、非特許文献１参照）では、テキストを形態素解析によって素性（形態素）に分割した後に各素性と分類対象の関連性を、テキスト分類アルゴリズムによって学習する（例えば、非特許文献２参照）。しかし、マイクロブログでは他メディアに比べて、くだけた日本語表現が多く、辞書に登録されていない新奇語も日々出現するため、形態素解析の精度が低く、結果として分類精度も低くなってしまう。さらに、形態素解析を用いる場合は、日本語以外の言語で記述されたテキストの分類精度が大きく落ちる。また、連続するn文字を素性とするn-gram（通常、nには1〜3程度が選ばれる）では、くだけた表現にも新奇語にも対応し易いが、形態素解析を用いて得た素性より精度が低くなることが知られている。 In normal blog text classification (see, for example, Patent Document 1 and Non-Patent Document 1), the text is divided into features (morphemes) by morphological analysis, and the relationship between each feature and the classification target is learned by a text classification algorithm. (For example, refer nonpatent literature 2). However, microblogging has more Japanese expressions than other media, and new words that are not registered in the dictionary appear every day, so the accuracy of morphological analysis is low, and as a result, the classification accuracy is also low. Furthermore, when using morphological analysis, the classification accuracy of text written in a language other than Japanese is greatly reduced. In addition, n-grams with n consecutive characters as features (usually about 1 to 3 are selected for n) are easy to handle both simple expressions and novel words, but they were obtained using morphological analysis. It is known that the accuracy is lower than the feature.

また、マイクロブログは非常にリアルタイム性が高いため、ユーザは現在に関連するテキストを投稿する傾向が強い。このため、同じタグが含まれるテキストでも、テキストの内容は時間と共に大きく変化していく。例えば、「記者会見」という文字列は、ある時刻においては「#kouteieki（口蹄疫）」タグに強く関連するが、その1時間前には「#kouteieki」が含まれるテキスト集合に全く登場していない文字列であり、他の異なるタグに強く関連する文字列であった。テキスト分類器は、このような変化に対して素早く適応しなければならない。さらに、マイクロブログでは短時間の間に、非常に多くのテキストが投稿されるため、分類や学習処理が高速に行える手法でなければならない。 Also, since microblogging is very real-time, users tend to post relevant texts. For this reason, the content of the text changes greatly with time even for text including the same tag. For example, the string “press conference” is strongly associated with the “#kouteieki” tag at one time, but it does not appear at all in the text set that contains “#kouteieki” one hour before that. It was a string that was strongly related to other different tags. Text classifiers must adapt quickly to such changes. Furthermore, since microblogs post a large amount of text in a short time, it must be a method that can perform classification and learning processing at high speed.

特開２００８−３１０６２６号公報JP 2008-310626 A

T. Ohkura, Y. Kiyota, and H. Nakagawa. Browsing system for weblog articles based on automated folksonomy. In Proc. WWW 2006 Workshop on Weblogging Ecosystem: Aggregation, Analysis, and Dynamics, 2006.T. Ohkura, Y. Kiyota, and H. Nakagawa.Browsing system for weblog articles based on automated folksonomy.In Proc.WWW 2006 Workshop on Weblogging Ecosystem: Aggregation, Analysis, and Dynamics, 2006. M. Dredze, K. Crammer, F. Pereira. "Confidence-Weighted Linear Classification", Proceedings of the 25th international conference on Machine learning, Pages: 264-271, 2008.M. Dredze, K. Crammer, F. Pereira. "Confidence-Weighted Linear Classification", Proceedings of the 25th international conference on Machine learning, Pages: 264-271, 2008.

上記のように、現在はマイクロブログという、短く、くだけた文体で、高いリアルタイム性を持ったテキストが投稿されるメディアに対して、利用者が望む情報に関連するテキストのみを、高速に、高い精度を実現して分類できる技術は存在しない。 As mentioned above, currently only the text related to the information desired by the user is high-speed and high for a medium that is a microblog, which is a short, voluminous stylized text with high real-time properties. There is no technology that can classify with accuracy.

本発明は、上記従来技術の問題点に鑑みて、形態素解析やn-gram分割の必要が無く、どんな言語で記述されたテキストに対しても、指定テキスト集合の傾向変化に高速に適応しながら、高い分離精度を実現するマイクロブログテキスト分類装置及び方法及びプログラムを提供することを目的とする。 In view of the above-mentioned problems of the prior art, the present invention eliminates the need for morphological analysis and n-gram partitioning, and adapts to text changes written in any language at high speed to change the tendency of the designated text set. Another object of the present invention is to provide a microblog text classification device, method, and program for realizing high separation accuracy.

上記の課題を解決するために、本発明（請求項１）は、マイクロブログの入力テキストを、利用者が与える任意の指定文字列に関連するか否かを分類する、マイクロブログテキスト分類装置であって、
前記指定文字列が含まれる指定テキストを記憶する指定テキスト記憶手段と、
前記指定テキストを除く比較テキストを記憶する比較テキスト記憶手段と、
連結されたテキストを格納する連結テキスト記憶手段と、
前記入力テキストに、前記指定文字列が含まれるか否かを判定するテキスト解析手段と、
前記指定テキスト記憶手段に格納されているテキスト集合を連結した指定連結テキストと、該指定連結テキストと前記入力テキストを連結したテキストと、前記比較テキスト記憶手段に格納されているテキスト集合を連結した比較連結テキストと、該比較連結テキストと該入力テキストを連結したテキストを前記連結テキスト記憶手段に出力するテキスト連結手段と、
前記連結テキスト記憶手段に格納されている連結したテキストをそれぞれデータ圧縮し、圧縮後のデータサイズを求めるテキスト圧縮手段と、
前記データ圧縮手段で求められた前記圧縮後のデータサイズを基に、前記入力テキストの、指定文字列への関連スコアを求めるスコア出力手段と、
前記スコア出力手段の関連スコアに基づいて、前記入力テキストを前記指定文字列に関連するか否かを分類するテキスト分類手段と、を有する。 In order to solve the above-mentioned problem, the present invention (Claim 1) is a microblog text classification device that classifies whether or not an input text of a microblog is related to an arbitrary designated character string given by a user. There,
A designated text storage means for storing a designated text including the designated character string;
Comparison text storage means for storing comparison text excluding the designated text;
Concatenated text storage means for storing concatenated text;
Text analysis means for determining whether or not the designated character string is included in the input text;
A comparison in which a specified connected text obtained by concatenating text sets stored in the designated text storage means, a text obtained by connecting the designated connected text and the input text, and a text set stored in the comparison text storage means are connected. A text concatenation unit that outputs a concatenated text and a text obtained by concatenating the comparison concatenated text and the input text to the concatenated text storage unit;
Text compression means for compressing each of the concatenated texts stored in the connected text storage means to obtain a data size after compression;
Score output means for obtaining a relevance score of the input text to a specified character string based on the data size after compression obtained by the data compression means;
Text classification means for classifying whether or not the input text is related to the designated character string based on the related score of the score output means.

また、本発明（請求項２）は、前記スコア出力手段に、
データ圧縮後のデータサイズの差分と、短いテキストのスコアが小さくなり過ぎないようにするためのスムージングパラメータを用いて前記関連スコアを求める手段を含む。 In the present invention (Claim 2), the score output means includes
Means for obtaining the related score using a difference in data size after data compression and a smoothing parameter for preventing a short text score from becoming too small.

また、本発明（請求項３）は、マイクロブログの入力テキストを、利用者が与える任意の指定文字列に関連するか否かを分類する、マイクロブログテキスト分類方法であって、
前記指定文字列が含まれる指定テキストを記憶する指定テキスト記憶手段と、
前記指定テキストを除く比較テキストを記憶する比較テキスト記憶手段と、
連結されたテキストを格納する連結テキスト記憶手段と、を有する装置において、
テキスト解析手段が、前記入力テキストに、前記指定文字列が含まれるか否かを判定するテキスト解析ステップと、
テキスト連結手段が、前記指定テキスト記憶手段に格納されているテキスト集合を連結した指定連結テキストと、該指定連結テキストと前記入力テキストを連結したテキストと、前記比較テキスト記憶手段に格納されているテキスト集合を連結した比較連結テキストと、該比較連結テキストと該入力テキストを連結したテキストを前記連結テキスト記憶手段に出力するテキスト連結ステップと、
テキスト圧縮手段が、前記連結テキスト記憶手段に格納されている連結したテキストをそれぞれデータ圧縮し、圧縮後のデータサイズを求めるテキスト圧縮ステップと、
スコア出力手段が、前記データ圧縮ステップで求められた前記圧縮後のデータサイズを基に、前記入力テキストの、前記指定文字列への関連スコアを求めるスコア出力手段と、
テキスト分類手段が、前記スコア出力手段の関連スコアに基づいて、前記入力テキストを指定文字列に関連するか否かを分類するテキスト分類ステップと、を行う。 The present invention (Claim 3) is a microblog text classification method for classifying whether or not an input text of a microblog is related to an arbitrary designated character string given by a user,
A designated text storage means for storing a designated text including the designated character string;
Comparison text storage means for storing comparison text excluding the designated text;
A connected text storage means for storing the concatenated text,
A text analysis step for determining whether or not the specified character string is included in the input text;
A text concatenation unit concatenates a set of text stored in the specified text storage unit, a text concatenated with the input text and the input text, and a text stored in the comparison text storage unit A text concatenation step of outputting a text connected by connecting the comparison text and the input text to the text storage means;
A text compression means for compressing each of the concatenated text stored in the connected text storage means and obtaining a compressed data size;
Score output means for obtaining a related score of the input text to the designated character string based on the data size after compression obtained in the data compression step;
The text classification means performs a text classification step of classifying whether or not the input text is related to the designated character string based on the related score of the score output means.

また、本発明（請求項４）は、前記スコア出力ステップにおいて、
データ圧縮後のデータサイズの差分と、短いテキストのスコアが小さくなり過ぎないようにするためのスムージングパラメータを用いて前記関連スコアを求める。 In the present invention (Claim 4), in the score output step,
The related score is obtained using a difference in data size after data compression and a smoothing parameter for preventing a short text score from becoming too small.

また、本発明（請求項５）は、請求項１または２記載のマイクロブログテキスト分類装置を構成する各手段としてコンピュータを機能させるためのマイクロブログテキスト分類プログラムである。 The present invention (Claim 5) is a microblog text classification program for causing a computer to function as each means constituting the microblog text classification apparatus according to claim 1 or 2.

上記のように構成された請求項１記載のマイクロブログテキスト分類装置によれば、入力テキストが、利用者が指定した文字列が含まれる指定テキストの集合と、指定テキストを除く比較テキストの集合の、どちらと強い関連を示すかについて、データ圧縮を利用して分類するので、形態素解析やn-gram分割の必要が無く、どんな言語で記述されたテキストに対しても、指定テキスト集合の傾向変化に高速に適応しながら、高い分離精度を実現するテキスト分類が実現できる。 According to the microblog text classification device according to claim 1 configured as described above, the input text includes a set of designated text including a character string designated by a user and a set of comparison text excluding the designated text. Because it classifies using data compression to indicate which is strongly related to, there is no need for morphological analysis and n-gram division, and the change in the tendency of the specified text set for text written in any language Text classification that achieves high separation accuracy while adapting to high speed can be realized.

本発明の一実施の形態におけるマイクロブログテキスト分類装置の構成図である。It is a block diagram of the microblog text classification | category apparatus in one embodiment of this invention. 本発明の一実施の形態におけるマイクロブログテキスト分類装置の動作のフローチャートである。It is a flowchart of operation | movement of the microblog text classification | category apparatus in one embodiment of this invention. 本発明の一実施の形態におけるデータ構造とスコア出力例である。It is a data structure and score output example in one embodiment of the present invention. 本発明の一実施の形態におけるマイクロブログテキスト分類の例である。It is an example of the microblog text classification | category in one embodiment of this invention.

以下、図面と共に本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図１は、本発明の一実施の形態におけるマイクロブログテキスト分類装置の構成を示す。 FIG. 1 shows the configuration of a microblog text classification device according to an embodiment of the present invention.

同図に示すマイクロブログテキスト分類装置は、テキスト解析部１０とテキスト連結部２０とテキスト圧縮部３０とスコア出力部４０とテキスト分類部５０指定テキスト記憶部６０と比較テキスト記憶部７０と連結テキスト記憶部８０から構成され、マイクロブログのテキストと利用者が指定する文字列を入力として受け取り、入力テキストが指定文字列に関連するか否かを分類する。上記の構成のうち、指定テキスト記憶部６０、比較テキスト記憶部７０、連結テキスト記憶部８０は、ハードディスクやメモリ等の記憶媒体である。 The microblog text classification apparatus shown in FIG. 1 includes a text analysis unit 10, a text connection unit 20, a text compression unit 30, a score output unit 40, a text classification unit 50, a designated text storage unit 60, a comparative text storage unit 70, and a connected text storage. The section 80 is configured to receive a microblog text and a character string designated by the user as input, and classify whether or not the input text is related to the designated character string. Among the above configurations, the designated text storage unit 60, the comparison text storage unit 70, and the linked text storage unit 80 are storage media such as a hard disk and a memory.

指定テキスト記憶部６０には、指定文字列を含む入力テキストが格納される。 The designated text storage unit 60 stores an input text including a designated character string.

比較テキスト記憶部７０には、指定文字列を含まない入力テキストが格納される。 The comparison text storage unit 70 stores input text that does not include a designated character string.

連結テキスト記憶部８０は、テキスト連結部２０により連結されたテキストが格納される。 The linked text storage unit 80 stores the text linked by the text linking unit 20.

図２は、本発明の一実施の形態におけるマイクロブログテキスト分類装置の動作フローチャートである。 FIG. 2 is an operation flowchart of the microblog text classification device according to the embodiment of the present invention.

ステップ１）テキスト解析部１０は、入力テキスト（マイクロブログテキスト）に、利用者が指定する指定文字列が含まれるか否かを判定する。ここで、指定文字列とは、タグ（#から始まる文字列）、ユーザ名 Step 1) The text analysis unit 10 determines whether or not the input text (microblog text) includes a designated character string designated by the user. Here, the specified character string is a tag (character string starting with #), user name

タグやユーザ名を除く任意のキーワードなど、如何なる文字列でも良い。

Any character string such as an arbitrary keyword excluding a tag and a user name may be used.

入力テキストに指定文字列が含まれない場合はステップ２に進む。入力テキストに指定文字列が含まれる場合はステップ８に進む。 If the specified text is not included in the input text, the process proceeds to step 2. If the specified text is included in the input text, the process proceeds to step 8.

ステップ２）テキスト連結部２０は、指定テキスト記憶部６０が格納するテキスト集合を連結し、指定連結テキストAを出力する。このとき、テキストは時間順に（古い物ほどテキストの前方にくるように）連結する。 Step 2) The text concatenation unit 20 concatenates the text sets stored in the designated text storage unit 60 and outputs the designated concatenated text A. At this time, the texts are connected in order of time (the older ones are in front of the text).

ステップ３）テキスト連結部２０は、比較テキスト記憶部７０が格納するテキスト集合を連結し、比較連結テキストBを出力する。このとき、テキストは時間順に（古い物ほどテキストの前方にくるように）連結する。 Step 3) The text concatenation unit 20 concatenates the text sets stored in the comparison text storage unit 70 and outputs the comparison concatenated text B. At this time, the texts are connected in order of time (the older ones are in front of the text).

ステップ４）テキスト連結部２０は、指定連結テキストAの後に入力テキストを連結したテキストAxと、比較連結テキストBの後に入力テキストxを連結したテキストBxを連結テキスト記憶部８０に出力する。 Step 4) The text concatenation unit 20 outputs the text Ax in which the input text is concatenated after the designated concatenated text A and the text Bx in which the input text x is concatenated after the comparison concatenated text B to the concatenated text storage unit 80.

ステップ５）テキスト圧縮部３０は、連結テキスト記憶部80から読み出したテキストA、B、Ax、Bxをそれぞれ圧縮して、各テキストの圧縮後サイズZ(A)、Z(B)、Z(Ax)、Z(Bx)を求める。ここで、テキスト圧縮部３０は、文献１（Peter Deutsch, "RFC 1951 DEFLATE Compressed Data Format Specification version 1.3", Network Working Group, Request for Comments: 1951, May 1996.）で示されるdeflateなどの公知アルゴリズムのうち、いかなる物を使用しても良い。 Step 5) The text compression unit 30 compresses the texts A, B, Ax, and Bx read from the concatenated text storage unit 80, and the compressed sizes Z (A), Z (B), Z (Ax ), Z (Bx) is obtained. Here, the text compression unit 30 is a known algorithm such as deflate shown in Reference 1 (Peter Deutsch, "RFC 1951 DEFLATE Compressed Data Format Specification version 1.3", Network Working Group, Request for Comments: 1951, May 1996.). Any of these may be used.

ステップ６）スコア出力部４０は、テキスト圧縮部３０の出力結果Z(A)、Z(B)、Z(Ax)、Z(Bx)を用いて、以下の式により入力テキストxの関連スコアS(x)を算出する。 Step 6) The score output unit 40 uses the output results Z (A), Z (B), Z (Ax), Z (Bx) of the text compression unit 30 and the related score S of the input text x by the following formula: Calculate (x).

ここで、γはスムージングパラメータで、短い入力テキストのスコアが小さくなりすぎることと、ゼロ除算を防ぐ目的で用いられる（γ＝5などを用いる）。なお、入力テキストxに指定文字列が含まれている場合は、S(x)=0とする。

Here, γ is a smoothing parameter, and is used for the purpose of preventing the score of a short input text from becoming too small and preventing division by zero (eg, using γ = 5). If the input text x includes a specified character string, S (x) = 0.

図３に、マイクロブログテキスト分類装置のデータ構造と、スコア出力例を示す。指定連結テキストAと入力テキストxの間で一致する文字列が多く含まれ、指定連結テキストAを事前情報として用いると入力テキストxを圧縮しやすくなる場合、Z(Ax)−Z(A)の値はゼロに近づく。一方、比較連結テキストBと入力テキストxの間に関連が無い場合、入力テキストxは圧縮されにくいため、Z(Bx)−Z(B)の値がZ(Ax)−Z(A)に比べて大きくなる。データ圧縮による分類は、文献２（Dario Benedetto, Emanuele Caglioti, and Vittorio Loreto, "Language Trees and Zipping", Physical Review Letters, 88:4, 2002）などにより公知であるが、本発明は、圧縮後サイズの差分の比率とスムージングパラメータの適用により、マイクロブログの短いテキストを、利用者が指定する指定文字列に関連するか否かを分類することを可能とする。 FIG. 3 shows a data structure of the microblog text classification device and an example of score output. If there are many matching character strings between the specified linked text A and the input text x, and if the specified linked text A is used as prior information, it will be easier to compress the input text x, then Z (Ax) -Z (A) The value approaches zero. On the other hand, if there is no relationship between the comparison text B and the input text x, the input text x is difficult to compress, so the value of Z (Bx) -Z (B) is compared to Z (Ax) -Z (A) Become bigger. Classification by data compression is known from Document 2 (Dario Benedetto, Emanuele Caglioti, and Vittorio Loreto, "Language Trees and Zipping", Physical Review Letters, 88: 4, 2002). By applying the difference ratio and the smoothing parameter, it is possible to classify whether the short text of the microblog is related to the designated character string designated by the user.

ここで、データ圧縮では、形態素解析などを行う必要は無く、いかなる言語で記述されたテキストも精度を落とすことなく扱える。また、非常にシンプルな分類手法であり、従来のテキスト分類器に比べ高速な分類が可能となる。さらに、テキストは時間順に連結されるため、多くのユーザが一斉に同内容のテキストを投稿したような場合はテキストが圧縮されやすくなり、高い分類精度を実現しやすい。 Here, in data compression, it is not necessary to perform morphological analysis, and text written in any language can be handled without reducing accuracy. In addition, it is a very simple classification method, and it enables high-speed classification as compared with a conventional text classifier. Furthermore, since the texts are linked in chronological order, when many users post the same contents at the same time, the texts are easily compressed, and high classification accuracy is easily achieved.

ステップ７）テキスト解析部１０は、ステップ１において入力テキストに指定文字列が含まれる場合は、入力テキストを指定テキスト記憶部６０に格納する。テキスト解析部１０は、入力テキストに指定文字列が含まれない場合は、入力テキストを比較テキスト記憶部７０に格納する。 Step 7) If the specified text is included in the input text in Step 1, the text analysis unit 10 stores the input text in the specified text storage unit 60. The text analysis unit 10 stores the input text in the comparison text storage unit 70 when the designated text is not included in the input text.

なお、比較テキスト記憶部７０に格納する対象のテキストは、指定文字列が含まれるテキストを除く全てのテキストとする。また、指定文字列でタグが指定された場合は、全てのタグが含まれるテキストから前記指定タグを含むテキストを除いたテキストとしてもよい。さらに、指定文字列でユーザ名が指定された場合は、全てのユーザ名が含まれるテキストから前記ユーザ名を含むテキストを除いたテキストとしても良い。 Note that the texts to be stored in the comparison text storage unit 70 are all texts excluding the text including the designated character string. Further, when a tag is designated by a designated character string, it may be a text obtained by excluding the text including the designated tag from the text including all the tags. Furthermore, when a user name is specified by a specified character string, it may be a text obtained by excluding text including the user name from text including all user names.

また、指定テキスト記憶部６０と比較テキスト記憶部７０に格納するテキストの量については、計算量の減少と、ごく最近の傾向を強く反映させるために、最新のN個のテキストや、過去T時間以内のテキスト、最新のMサイズ分のテキストに限定させても良い。 The amount of text stored in the designated text storage unit 60 and the comparison text storage unit 70 is the latest N texts or past T time in order to strongly reflect the decrease in the amount of calculation and the most recent trend. May be limited to the text within, the latest M size text.

本発明の学習処理は、本ステップに関する処理のみであり、非常に高速な学習が実現できる。 The learning process of the present invention is only a process related to this step, and very fast learning can be realized.

ステップ８）テキスト分類部５０は、スコア出力部４０の出力結果に基づき、スコアS(x)が閾値θよりも小さい場合に、入力テキストxが指定文字列に関連していると分類する。 Step 8) Based on the output result of the score output unit 40, the text classification unit 50 classifies that the input text x is related to the designated character string when the score S (x) is smaller than the threshold θ.

閾値θの値を小さく設定すると、高い精度で分類が可能になり、閾値θの値を大きく設定すると、高い網羅率を実現する分類が可能になる。θの値には1.0 などを利用する。 If the value of the threshold value θ is set small, classification can be performed with high accuracy, and if the value of the threshold value θ is set large, classification that achieves a high coverage rate is possible. Use 1.0 for the value of θ.

図４に、θ＝1.0とした場合のマイクロブログのテキスト分類結果の例を示す。口蹄疫に関連するテキストのみが抽出できていることがわかる。 FIG. 4 shows an example of the text classification result of the microblog when θ = 1.0. It can be seen that only text related to foot-and-mouth disease can be extracted.

上記のように、指定の文字列を含むサンプルテキストがあり、サンプルテキストを圧縮したサイズをＡ、サンプルテキストと入力テキストを連結させて圧縮したサイズをＡｘとしたとき、テキストｘが指定文字列との関係が深いのであれば、効率よく圧縮されAとAxの差（関連度合い）は小さくなる。この数値が小さいほど関連度が深いことになる。従って、マイクロブログのような短いテキストｘであっても、指定文字列に関連するかを容易に判定することが可能となる。 As described above, there is a sample text including a specified character string, where A is the compressed size of the sample text, and Ax is the compressed size obtained by concatenating the sample text and the input text. Is deeply compressed, the difference between A and Ax (relationship) becomes small. The smaller the number, the deeper the relevance. Therefore, it is possible to easily determine whether a short text x such as a microblog is related to a designated character string.

また、本発明は、図１に示す構成要素の動作をプログラムとして構築し、マイクロブログテキスト分類装置として利用されるコンピュータにインストールして実行させる、または、ネットワークを介して流通させることが可能である。 In addition, the present invention can construct the operation of the components shown in FIG. 1 as a program, and can be installed and executed on a computer used as a microblog text classification device, or distributed via a network. .

また、構築されたプログラムをハードディスクやフレキシブルディスク、ＣＤ−ＲＯＭ等の可搬記憶媒体に格納し、コンピュータにインストールする、または、配布することが可能である。 Further, the constructed program can be stored in a portable storage medium such as a hard disk, a flexible disk, or a CD-ROM, and can be installed or distributed in a computer.

本発明は、マイクロブログテキストを、利用者が指定する文字列に関連するか否かを分類可能な手法であり、情報検索の支援に利用可能である。 The present invention is a technique capable of classifying whether or not microblog text is related to a character string designated by a user, and can be used for information search support.

１０テキスト解析部
２０テキスト連結部
３０テキスト圧縮部
４０スコア出力部
５０テキスト分類部
６０指定テキスト記憶部
７０比較テキスト記憶部
８０連結テキスト記憶部 DESCRIPTION OF SYMBOLS 10 Text analysis part 20 Text connection part 30 Text compression part 40 Score output part 50 Text classification part 60 Designated text storage part 70 Comparison text storage part 80 Connection text storage part

Claims

A microblog text classification device that classifies whether input text of a microblog is related to an arbitrary designated character string given by a user,
A designated text storage means for storing a designated text including the designated character string;
Comparison text storage means for storing comparison text excluding the designated text;
Concatenated text storage means for storing concatenated text;
Text analysis means for determining whether or not the designated character string is included in the input text;
A comparison in which a specified connected text obtained by concatenating text sets stored in the designated text storage means, a text obtained by connecting the designated connected text and the input text, and a text set stored in the comparison text storage means are connected. A text concatenation unit that outputs a concatenated text and a text obtained by concatenating the comparison concatenated text and the input text to the concatenated text storage unit;
Text compression means for compressing each of the concatenated texts stored in the connected text storage means to obtain a data size after compression;
Score output means for obtaining a relevance score of the input text to a specified character string based on the data size after compression obtained by the data compression means;
Text classification means for classifying whether or not the input text is related to the designated character string, based on the related score of the score output means;
A microblog text classification device characterized by comprising:

The score output means includes
2. The microblog text classification apparatus according to claim 1, further comprising means for obtaining the related score using a difference in data size after data compression and a smoothing parameter for preventing a short text score from becoming too small.

A microblog text classification method for classifying whether input text of a microblog is related to an arbitrary designated character string given by a user,
A designated text storage means for storing a designated text including the designated character string;
Comparison text storage means for storing comparison text excluding the designated text;
A connected text storage means for storing the concatenated text,
A text analysis step for determining whether or not the specified character string is included in the input text;
A text concatenation unit concatenates a set of text stored in the specified text storage unit, a text concatenated with the input text and the input text, and a text stored in the comparison text storage unit A text concatenation step of outputting a text connected by connecting the comparison text and the input text to the text storage means;
A text compression means for compressing each of the concatenated text stored in the connected text storage means and obtaining a compressed data size;
Score output means for obtaining a related score of the input text to the designated character string based on the data size after compression obtained in the data compression step;
A text classification step for classifying whether or not the input text is related to a designated character string based on a related score of the score output means;
A microblog text classification method characterized by:

In the score output step,
4. The microblog text classification method according to claim 3, wherein the related score is obtained using a difference in data size after data compression and a smoothing parameter for preventing a short text score from becoming too small.

A microblog text classification program for causing a computer to function as each means constituting the microblog text classification device according to claim 1 or 2.