JP6279354B2

JP6279354B2 - Topic identification device and topic identification method

Info

Publication number: JP6279354B2
Application number: JP2014042168A
Authority: JP
Inventors: 恭輔吉田; 大和三竹; 邦興加藤; 嶋田　貴夫; 貴夫嶋田; 章人片桐
Original assignee: Ｎｔｔコムオンライン・マーケティング・ソリューション株式会社
Priority date: 2014-03-04
Filing date: 2014-03-04
Publication date: 2018-02-14
Anticipated expiration: 2034-03-04
Also published as: JP2015169969A

Description

本発明は、話題特定装置、および話題特定方法に関する。 The present invention relates to a topic identification device and a topic identification method.

次のような話題語抽出装置が知られている。この話題語抽出装置では、ソーシャル・メディアサーバから取得した書き込み情報から、指示代名詞、挨拶に用いられる単語、時節に関連する単語を排除して重要単語を抽出し、抽出した重要単語ごとに算出した重要度に基づいて話題語を抽出する（例えば、特許文献１）。 The following topic word extraction devices are known. In this topic word extraction device, important words are extracted from written information obtained from social media servers by excluding pronouns, words used for greetings, words related to time, and calculated for each extracted important word A topic word is extracted based on importance (for example, patent document 1).

特開２０１３−６９２４６号公報JP 2013-69246 A

しかしながら、書き込み情報のような文章には、複数の単語が含まれており、これらの単語が関連して１つの話題を構成していることが多い。このため、文章内の重要単語を抽出しただけでは、文章の話題を特定するための話題語を精度高く抽出できない可能性があった。 However, sentences such as written information include a plurality of words, and these words often constitute one topic in association with each other. For this reason, there is a possibility that a topic word for specifying a topic of a sentence cannot be extracted with high accuracy only by extracting an important word in the sentence.

本発明による話題特定装置は、ユーザによって投稿された文章情報を蓄積した投稿情報を取得する投稿情報取得手段と、投稿情報取得手段によって取得された投稿情報を解析して、投稿情報に含まれる単語を抽出する単語抽出手段と、単語抽出手段によって抽出された各単語に対して、投稿情報内における出現頻度を算出する出現頻度算出手段と、出現頻度算出手段によって算出された各単語ごとの出現頻度に基づいて、出現頻度が上位となるあらかじめ設定された所定数の単語を上位頻出単語として特定する上位頻出単語特定手段と、上位頻出単語特定手段によって特定された上位頻出単語のそれぞれについて、上位頻出単語に含まれる他の単語との間のつながりの強さを示す２単語間のつながり強度を算出するつながり強度算出手段と、つながり強度算出手段によって算出されたつながり強度があらかじめ設定された所定の閾値以上の上位頻出単語の組み合わせを、上位頻出単語ペアとして抽出する上位頻出単語ペア抽出手段と、上位頻出単語ペア抽出手段によって抽出された上位頻出単語ペアのうち、つながり強度が最も大きい上位頻出単語ペアを構成する２つの単語を、投稿情報に含まれる話題を構成するコアとなる単語（以下「コア単語」と呼ぶ）として特定するコア単語特定手段と、コア単語特定手段によって特定された２つのコア単語のそれぞれについて、つながり強度算出手段で算出されたつながり強度が所定の閾値以上の上位頻出単語をコア単語に関連する関連単語として特定する関連単語特定手段と、コア単語特定手段によって特定された２つのコア単語と、関連単語特定手段によって特定された関連単語とを、話題を構成する１組の話題構成単語群として特定する話題構成単語群特定手段とを備えることを特徴とする。
本発明による話題特定方法は、ユーザによって投稿された文章情報を蓄積した投稿情報を取得する投稿情報取得手順と、投稿情報取得手順で取得した投稿情報を解析して、投稿情報に含まれる単語を抽出する単語抽出手順と、単語抽出手順で抽出した各単語に対して、投稿情報内における出現頻度を算出する出現頻度算出手順と、出現頻度算出手順で算出した各単語ごとの出現頻度に基づいて、出現頻度が上位となるあらかじめ設定された所定数の単語を上位頻出単語として特定する上位頻出単語特定手順と、上位頻出単語特定手順で特定した上位頻出単語のそれぞれについて、上位頻出単語に含まれる他の単語との間のつながりの強さを示す２単語間のつながり強度を算出するつながり強度算出手順と、つながり強度算出手順で算出したつながり強度があらかじめ設定された所定の閾値以上の上位頻出単語の組み合わせを、上位頻出単語ペアとして抽出する上位頻出単語ペア抽出手順と、上位頻出単語ペア抽出手順で抽出した上位頻出単語ペアのうち、つながり強度が最も大きい上位頻出単語ペアを構成する２つの単語を、投稿情報に含まれる話題を構成するコアとなる単語（以下「コア単語」と呼ぶ）として特定するコア単語特定手順と、コア単語特定手順で特定した２つのコア単語のそれぞれについて、つながり強度算出手順で算出したつながり強度が所定の閾値以上の上位頻出単語をコア単語に関連する関連単語として特定する関連単語特定手順と、コア単語特定手順で特定した２つのコア単語と、関連単語特定手順で特定した関連単語とを、話題を構成する１組の話題構成単語群として特定する話題構成単語群特定手順とをコンピュータに実行させるための方法である。 A topic identification device according to the present invention includes a posting information acquisition unit that acquires posting information in which text information posted by a user is accumulated, and a word included in the posting information by analyzing the posting information acquired by the posting information acquisition unit. The word extraction means for extracting the appearance frequency, the appearance frequency calculation means for calculating the appearance frequency in the posted information for each word extracted by the word extraction means, and the appearance frequency for each word calculated by the appearance frequency calculation means Based on the above, an upper frequent word specifying means for specifying a predetermined number of words having a higher appearance frequency as an upper frequent word and an upper frequent word specified by the upper frequent word specifying means A connection strength calculating means for calculating a connection strength between two words indicating the strength of a connection between other words included in the word; By means of an upper frequent word pair extracting means for extracting, as upper frequent word pairs, a combination of upper frequent words having a connection strength calculated by the path strength calculating means that is equal to or higher than a predetermined threshold set in advance, and by an upper frequent word pair extracting means. Among the extracted top frequent word pairs, two words constituting the top frequent word pair having the highest connection strength are used as core words (hereinafter referred to as “core words”) constituting the topic included in the posted information. For each of the core word identification means to be identified and the two core words identified by the core word identification means, an association that relates the upper frequent words whose connection strength calculated by the connection strength calculation means is a predetermined threshold or more to the core word A related word specifying means for specifying as a word, two core words specified by the core word specifying means, A related word identified by communicating word specifying means, characterized in that it comprises a topic structure word group specifying means for specifying a set of topics constituent words that constitute the subject.
The topic identification method according to the present invention includes a posting information acquisition procedure for acquiring posting information that accumulates text information posted by a user, and analyzing the posting information acquired in the posting information acquisition procedure to determine a word included in the posting information. Based on the word extraction procedure to be extracted, the appearance frequency calculation procedure for calculating the appearance frequency in the post information for each word extracted in the word extraction procedure, and the appearance frequency for each word calculated in the appearance frequency calculation procedure The upper frequent word specifying procedure for specifying a predetermined number of words having a higher appearance frequency as a higher frequent word and the higher frequent word specified in the higher frequent word specifying procedure are included in the higher frequent word. The connection strength calculation procedure for calculating the connection strength between two words indicating the strength of the connection between other words and the connection strength calculation procedure. Out of the top frequent word pair extraction procedure that extracts a combination of top frequent words that are greater than or equal to a predetermined threshold that has a predetermined strength as a top frequent word pair, and the top frequent word pair extracted in the top frequent word pair extraction procedure, A core word specifying procedure for specifying two words constituting the top frequent word pair having the highest connection strength as core words (hereinafter referred to as “core words”) constituting a topic included in the posted information; For each of the two core words specified in the specifying procedure, a related word specifying procedure for specifying, as a related word related to the core word, an upper frequent word whose connection strength calculated in the connection strength calculating procedure is equal to or higher than a predetermined threshold, and the core word A set of topic structures that constitute a topic, with the two core words specified in the specifying procedure and the related words specified in the related word specifying procedure. A method for executing the topic structure word group specific procedures on a computer for identifying a group of words.

本発明によれば、２単語間のつながり強度が最も大きい上位頻出単語ペアを構成する２つの単語を、コア単語として特定するとともに、コア単語のそれぞれについて、つながり強度が所定の閾値以上の上位頻出単語を関連単語として特定して、特定した２つのコア単語と関連単語とを話題を構成する１組の話題構成単語群として特定するようにしたので、ユーザによって投稿された文章に含まれる複数の単語のつながり強度を加味して、精度高く話題を特定するための単語を抽出することができる。 According to the present invention, two words constituting a top frequent word pair having the largest connection strength between two words are identified as core words, and the top frequent occurrences having a connection strength of a predetermined threshold or more for each of the core words. Since a word is specified as a related word, and the specified two core words and related words are specified as a set of topic constituent words constituting a topic, a plurality of words included in a sentence posted by the user A word for specifying a topic with high accuracy can be extracted in consideration of the connection strength of words.

話題特定装置１００の一実施の形態の構成を示すブロック図である。1 is a block diagram showing a configuration of an embodiment of a topic identification device 100. FIG. ２単語間のつながり強度の算出例を示す図である。It is a figure which shows the example of calculation of the connection intensity | strength between two words. 話題文章抽出結果情報の具体例を示す図である。It is a figure which shows the specific example of topic text extraction result information. 話題特定装置１００で実行される処理の流れを示すフローチャート図である。3 is a flowchart showing a flow of processing executed by the topic identification device 100. FIG.

図１は、本実施の形態における話題特定装置１００の一実施の形態の構成を示すブロック図である。話題特定装置１００としては、例えば、サーバ装置やパソコン等の情報処理装置が用いられ、図１は、話題特定装置１００としてサーバ装置を用いた場合の一実施の形態の構成を示している。話題特定装置１００は、操作部材１０１と、接続ＩＦ（インターフェース）１０２と、制御装置１０３と、記録装置１０４とを備えている。 FIG. 1 is a block diagram showing a configuration of an embodiment of the topic identification device 100 according to the present embodiment. As the topic specifying device 100, for example, an information processing device such as a server device or a personal computer is used. FIG. 1 shows a configuration of an embodiment in which a server device is used as the topic specifying device 100. The topic identification device 100 includes an operation member 101, a connection IF (interface) 102, a control device 103, and a recording device 104.

操作部材１０１は、話題特定装置１００の操作者によって操作される種々の装置、例えばキーボードやマウスを含む。 The operation member 101 includes various devices operated by an operator of the topic identification device 100, such as a keyboard and a mouse.

接続ＩＦ１０２は、話題特定装置１００をＬＡＮやインターネット等の通信回線に接続するためのインターフェースであり、例えば、ＬＡＮに有線で接続するための有線ＬＡＮモジュールや、ＬＡＮに無線で接続するための無線ＬＡＮモジュールなどが用いられる。 The connection IF 102 is an interface for connecting the topic identification device 100 to a communication line such as a LAN or the Internet. For example, a wired LAN module for connecting to the LAN by wire, or a wireless LAN for connecting to the LAN wirelessly. Modules are used.

制御装置１０３は、ＣＰＵ、メモリ、およびその他の周辺回路によって構成され、話題特定装置１００の全体を制御する。なお、制御装置１０３を構成するメモリは、例えばＳＤＲＡＭ等の揮発性のメモリである。このメモリは、ＣＰＵがプログラム実行時にプログラムを展開するためのワークメモリや、データを一時的に記録するためのバッファメモリとして使用される。 The control device 103 includes a CPU, a memory, and other peripheral circuits, and controls the entire topic identification device 100. The memory constituting the control device 103 is a volatile memory such as SDRAM, for example. This memory is used as a work memory for the CPU to expand the program when the program is executed and a buffer memory for temporarily recording data.

記録装置１０４は、話題特定装置１００が蓄える種々のデータや、制御装置１０３が実行するためのプログラムのデータ等を記録するための記録装置であり、例えばＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）やＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）等が用いられる。なお、記録装置１０４に記録されるプログラムのデータは、ＣＤ−ＲＯＭやＤＶＤ−ＲＯＭなどの記録媒体に記録されて提供されたり、ネットワークを介して提供され、操作者が取得したプログラムのデータを記録装置１０４にインストールすることによって、制御装置１０３がプログラムを実行できるようになる。 The recording device 104 is a recording device for recording various data stored in the topic identification device 100, data of a program to be executed by the control device 103, and the like, for example, HDD (Hard Disk Drive) or SSD (Solid State). Drive) or the like is used. The program data recorded in the recording device 104 is provided by being recorded on a recording medium such as a CD-ROM or DVD-ROM, or provided via a network, and records the program data acquired by the operator. By installing in the device 104, the control device 103 can execute the program.

本実施の形態における話題特定装置１００では、ＷＥＢ上で文章を入力して公開することができるサービスを利用して、ユーザが投稿し、公開した文章を対象として解析を行い、現在、どのような内容が主な話題として投稿されているかを特定するための処理を実行する。ＷＥＢ上で文章を入力して公開することができるサービスとしては、例えばＴｗｉｔｔｅｒ（登録商標）のような、ユーザが短文を投稿してＷＥＢ上に公開することができるサービスを想定する。 The topic identification device 100 according to the present embodiment uses a service that allows a user to input and publish text on the WEB, analyzes the text posted by the user and published, and currently Execute processing to identify whether the content is posted as the main topic. As a service that allows a user to input and publish text on WEB, a service such as Twitter (registered trademark) that allows a user to post a short text and publish it on WEB is assumed.

ユーザが投稿した文章を対象とした解析を行うために、本実施の形態では、制御装置１０３は、あらかじめ上記のようなサービスを提供するサービス事業者が運営する外部サーバから、ユーザによって投稿された文章情報を蓄積した投稿情報を取得する。例えば、話題特定装置１００には、サービス事業者が運営する外部サーバから投稿情報を取得するためのＡＰＩ（アプリケーションプログラミングインターフェース）が実装されており、制御装置１０３は、あらかじめ設定された所定時間間隔でＡＰＩを起動し、接続ＩＦ１０２を介して外部サーバから投稿情報を取得して、記録装置１０４に記録する。 In this embodiment, in order to perform an analysis on the text posted by the user, the control device 103 is posted by the user from an external server operated by a service provider that provides the service as described above. Get post information that has accumulated text information. For example, the topic identification device 100 is equipped with an API (Application Programming Interface) for acquiring post information from an external server operated by a service provider, and the control device 103 is configured at predetermined time intervals. The API is activated, the posting information is acquired from the external server via the connection IF 102, and is recorded in the recording device 104.

制御装置１０３は、記録装置１０４に記録されている投稿情報を対象として、投稿情報から、後述する処理で特定する話題構成単語群の特定に必要のない文章情報を除去するためのフィルタリング処理を実行する。これによって、投稿情報から話題の特定とは関係がない可能性が高い文章情報が除去される。なお、本実施の形態では、フィルタリング処理として、以下のような判定処理を行って、投稿内容がニュースであるもの、広告であるもの、リツイートされたもの、機械的に大量に投稿されたもの（ｂｏｔ）、話題特定目的に適さないもの、内容が他の文章と重複するもの、ひらがな率が所定値以下であるものを不要な文章情報として除去するものとする。 The control device 103 executes filtering processing for removing postal information that is not necessary for specifying a topic constituent word group to be specified by processing to be described later from the post information, with respect to post information recorded in the recording device 104. To do. As a result, sentence information that is highly likely to be unrelated to topic identification is removed from the posted information. In the present embodiment, the following determination processing is performed as filtering processing, and the posted content is news, advertising, retweeted, or mechanically posted in large quantities ( bot), those that are not suitable for the topic identification purpose, those whose contents overlap with other sentences, and those whose hiragana rate is a predetermined value or less are removed as unnecessary sentence information.

投稿内容がニュースであるか否かの判定処理は、投稿内容がニュースである場合の文章パターンをリスト化したニュースパターンリストをあらかじめ用意しておき、制御装置１０３は、ニュースパターンリストを参照して、投稿情報に含まれる各文章情報に対して正規表現によるパターンマッチングを行い、文章情報にマッチした文字列があれば、その文章情報はニュースであると判定して、投稿情報から除去する。また、ニュースを投稿する投稿者のアカウントをリスト化したニュースアカウントリストをあらかじめ用意しておき、制御装置１０３は、ニュースアカウントリストを参照して、投稿情報内にニュースアカウントリストに含まれるアカウントが投稿した文章情報があれば、その文章情報はニュースであると判定して、投稿情報から除去する。 In determining whether the posted content is news, a news pattern list that lists sentence patterns when the posted content is news is prepared in advance, and the control device 103 refers to the news pattern list. Then, pattern matching using regular expressions is performed on each piece of text information included in the posted information, and if there is a character string that matches the text information, the text information is determined to be news and removed from the posted information. In addition, a news account list in which accounts of posters who post news are listed is prepared in advance, and the control device 103 refers to the news account list and posts accounts included in the news account list in the posting information. If there is any written text information, it is determined that the text information is news and is removed from the posted information.

投稿内容が広告であるか否かの判定処理は、投稿内容が広告である場合の文章パターンをリスト化した広告パターンリストをあらかじめ用意しておき、制御装置１０３は、広告パターンリストを参照して、投稿情報に含まれる各文章情報に対して正規表現によるパターンマッチングを行い、文章情報にマッチした文字列があれば、その文章情報は広告であると判定して、投稿情報から除去する。また、広告を投稿する投稿者のアカウントをリスト化した広告アカウントリストをあらかじめ用意しておき、制御装置１０３は、広告アカウントリストを参照して、投稿情報内に広告カウントリストに含まれるアカウントが投稿した文章情報があれば、その文章情報は広告であると判定して、投稿情報から除去する。 The process for determining whether or not the posted content is an advertisement is prepared in advance with an advertisement pattern list that lists sentence patterns when the posted content is an advertisement. The control device 103 refers to the advertisement pattern list. Then, pattern matching using regular expressions is performed on each piece of text information included in the post information, and if there is a character string that matches the text information, the text information is determined to be an advertisement and removed from the post information. In addition, an advertisement account list in which the accounts of posters who post advertisements are listed is prepared in advance, and the control device 103 refers to the advertisement account list and posts accounts included in the advertisement count list in the posting information. If there is any written text information, it is determined that the text information is an advertisement and is removed from the posted information.

投稿内容がリツイートされたものであるか否かの判定処理は、各文章情報について、本文が「ＲＴ」から始まっていれば、その文章情報はリツイートされたものであると判定して、投稿情報から除去する。 Whether the posted content is retweeted or not is determined. If the text starts with “RT” for each text information, it is determined that the text information is retweeted, and the posted information Remove from.

投稿内容が機械的に大量に投稿されたもの（ｂｏｔ）であるか否かの判定処理は、投稿内容がｂｏｔである場合の文章パターンをリスト化したｂｏｔパターンリストをあらかじめ用意しておき、制御装置１０３は、ｂｏｔパターンリストを参照して、投稿情報に含まれる各文章情報に対して正規表現によるパターンマッチングを行い、文章情報にマッチした文字列があれば、その文章情報はｂｏｔであると判定して、投稿情報から除去する。また、ｂｏｔを投稿する投稿者のアカウントをリスト化したｂｏｔアカウントリストをあらかじめ用意しておき、制御装置１０３は、ｂｏｔアカウントリストを参照して、投稿情報内にｂｏｔアカウントリストに含まれるアカウントが投稿した文章情報があれば、その文章情報はｂｏｔであると判定して、投稿情報から除去する。また、ｂｏｔを投稿する投稿者のプロフィール文として使用される可能性が高い文章パターンをリスト化したｂｏｔプロフィール文リストをあらかじめ用意しておき、制御装置１０３は、ｂｏｔプロフィール文リストを参照して、投稿者のプロフィール文に対して正規表現によるパターンマッチングを行い、プロフィール文にマッチした文字列があれば、その投稿者が投稿した文章情報はｂｏｔであると判定して、投稿情報から除去する。また、制御装置１０３は、投稿者のアカウント名の別名として設定されたスクリーンネームに「ｂｏｔ」という文字列が含まれている場合には、その投稿者が投稿した文章情報はｂｏｔであると判定して、投稿情報から除去する。 Whether or not the posted content is mechanically posted in large quantities (bots) is determined by preparing a bot pattern list that lists sentence patterns when the posted content is bot in advance. The device 103 refers to the bot pattern list, performs pattern matching with regular expressions for each piece of text information included in the posted information, and if there is a character string that matches the text information, the text information is “bot”. Judgment and removal from posting information. Also, a bot account list that lists the accounts of the posters who post bots is prepared in advance, and the control device 103 refers to the bot account list and posts the account included in the bot account list in the posting information. If there is any written text information, it is determined that the text information is “bot” and is removed from the posted information. In addition, a bot profile sentence list in which sentence patterns that are likely to be used as profile sentences of posters who post bots are listed in advance is prepared, and the control device 103 refers to the bot profile sentence list, Pattern matching by a regular expression is performed on the profile sentence of the poster, and if there is a character string that matches the profile sentence, it is determined that the text information posted by the poster is bot and is removed from the posted information. In addition, when the screen name set as an alias for the account name of the poster includes the character string “bot”, the control device 103 determines that the text information posted by the poster is “bot”. And remove it from the posted information.

投稿内容が本実施の形態における話題特定目的に適さないものであるか否かの判定処理は、制御装置１０３は、各文章情報に対して、本文からＵＲＬ、スクリーンネーム、メールアドレスを除去し、残った本文にあらかじめ設定されたキーワードが含まれていなければ、その文章情報は話題特定目的に適さないものであると判定して、投稿情報から除去する。 In the determination process of whether or not the posted content is not suitable for the topic identification purpose in the present embodiment, the control device 103 removes the URL, screen name, and mail address from the text for each text information, If the remaining text does not include a preset keyword, it is determined that the sentence information is not suitable for the topic identification purpose, and is removed from the posted information.

投稿内容が他の文章と重複するものであるか否かの判定処理は、制御装置１０３は、各文章情報ごとに本文内容をハッシュ値に逐一変換し、同一のハッシュ値を持つものは文面が重複する文章情報とみなし、１件を残し他の文章情報は投稿情報から除去する。 In the process of determining whether or not the posted content overlaps with other text, the control device 103 converts the text content into a hash value for each text information one by one. It is regarded as duplicate text information, and one text is left and the other text information is removed from the posted information.

投稿内容がひらがな率が所定値以下であるものであるか否かの判定処理は、制御装置１０３は、各文章情報に対して、本文全体の文字数に対する本文に含まれるひらがな文字数の割合を算出し、算出した割合が所定割合以下、例えば１０％以下である場合には、その文章情報はひらがな率が所定値以下であると判定して、投稿情報から除去する。 In the process of determining whether or not the posted content has a hiragana rate equal to or lower than a predetermined value, the control device 103 calculates the ratio of the number of hiragana characters included in the text to the total number of characters for each text information. When the calculated ratio is equal to or less than a predetermined ratio, for example, 10% or less, the sentence information is determined to have a hiragana ratio equal to or less than a predetermined value and removed from the posted information.

制御装置１０３は、上述したようにフィルタリング処理を施した後の投稿情報を対象として、以下のように処理を行う。 The control device 103 performs the following process on the posted information after the filtering process as described above.

制御装置１０３は、投稿情報に含まれる文章情報を対象として、形態素解析処理を行い、特定の品詞の単語のみを抽出するとともに、抽出した特定の品詞の単語の中からあらかじめ設定された除去対象の文字または文字列を除去する。本実施の形態では、制御装置１０３は、公知の形態素解析エンジンを利用して、投稿情報から名詞、形容詞、形容動詞を抽出した上で、抽出した単語の中から非自立語、数字１文字、アルファベット１文字を除去対象語として特定して除去する。なお、形態素解析処理によって抽出した単語が自立した自立語であるか非自立語であるかは、形態素解析処理において判断され、抽出された各単語には、それぞれが自立語であるか非自立語であるかを示す情報が付されていて、制御装置１０３は、該情報に基づいて非自立語を特定して除去するものとする。 The control device 103 performs a morphological analysis process on the sentence information included in the posted information, extracts only words of a specific part of speech, and sets a removal target that is set in advance from the extracted words of a specific part of speech. Remove a character or string. In the present embodiment, the control device 103 uses a known morphological analysis engine to extract nouns, adjectives, and adjective verbs from the posted information, and then, from the extracted words, non-independent words, one number character, One alphabetic character is identified as a removal target word and removed. Note that whether the word extracted by the morphological analysis process is an independent word or a non-independent word is determined in the morpheme analysis process, and each extracted word is an independent word or a non-independent word. It is assumed that the control device 103 identifies and removes non-independent words based on the information.

「ゴールは近いようだ」という文章情報を例に、本実施の形態における特定の品詞の抽出処理と、除去対象語の除去処理について説明する。例えば、「ゴールは近いようだ」の文章情報は、形態素解析処理により、「ゴール」、「は」、「近い」、「よう」、「だ」の５つの単語に分割された上で、「ゴール」は名詞、「は」は助詞、係助詞、「近い」は形容詞、「よう」は名詞、「だ」は助動詞と解析され、処理結果として名詞、形容詞、形容動詞である「ゴール」、「近い」、「よう」が抽出される。抽出された各単語には、「ゴール」と「近い」は自立語、「よう」は非自立語を示す情報が付されており、制御装置１０３は、非自立語である「よう」を除去する。これによって、「ゴールは近いようだ」という文章情報からは、最終的に「ゴール」と「近い」の２単語が以下に説明する話題構成単語を特定するための処理で処理対象となる単語として抽出される。 A specific part-of-speech extraction process and a removal target word removal process according to the present embodiment will be described using sentence information “goal seems close” as an example. For example, the sentence information “Goal seems close” is divided into five words “Goal”, “Ha”, “Close”, “Yo”, “Da” by morphological analysis processing, `` Goal '' is a noun, `` ha '' is a particle, a coordinator, `` close '' is an adjective, `` yo '' is a noun, `` da '' is an auxiliary verb, and the result is a noun, adjective, adjective verb `` goal '', “Near” and “Yo” are extracted. Each extracted word is attached with information indicating “goal” and “close” as independent words and “yo” as non-independent words, and the control device 103 removes “yo” as non-independent words. To do. As a result, from the sentence information “Goal seems to be close”, the two words “Goal” and “Close” are finally processed as words to be processed in the process for specifying the topic constituent words described below. Extracted.

制御装置１０３は、投稿情報から上記処理で処理対象として抽出した各単語の出現頻度を算出し、出現頻度が上位のものから１０件を、すなわち出願頻度が上位の１０単語を上位頻出単語として特定する。 The control device 103 calculates the appearance frequency of each word extracted as the processing target in the above processing from the posted information, and identifies the 10 words with the highest appearance frequency, that is, the 10 words with the highest application frequency as the upper frequent words. To do.

制御装置１０３は、特定した上位頻出単語のそれぞれについて、上位頻出単語に含まれる他の単語との間のつながりの強さを示す２単語間のつながり強度を算出する。具体的には、制御装置１０３は、次式（１）により、単語Ａと単語Ｂの２単語間のつながりの強さを示す指標値Ｒ（Ａ,Ｂ）を算出する。これにより、上位頻出単語である１０単語のそれぞれにつき、全ての２単語間の組み合わせについて、計４５通りのつながり指標値Ｒ（Ａ,Ｂ）が算出される。なお、次式（１）において、Ｕは投稿情報に含まれる文章情報の数を表す変数であり、ｄｆ（Ａ）は単語Ａを含む文章情報の数を表す変数であり、ｄｆ（Ｂ）は単語Ｂを含む文章情報の数を表す変数であり、ｄｆ（Ａ∩Ｂ）は単語Ａと単語Ｂの両方を含む文章情報の数を表す変数である。

The control device 103 calculates, for each of the identified upper frequent words, the connection strength between the two words indicating the strength of the connection with the other words included in the upper frequent words. Specifically, the control device 103 calculates an index value R (A, B) indicating the strength of connection between the two words A and B by the following equation (1). As a result, for each of the 10 words that are the top frequent words, a total of 45 connection index values R (A, B) are calculated for all combinations between two words. In the following equation (1), U is a variable indicating the number of text information included in the posted information, df (A) is a variable indicating the number of text information including the word A, and df (B) is Df (A∩B) is a variable representing the number of sentence information including both the word A and the word B. The variable represents the number of sentence information including the word B.

制御装置１０３は、式（１）により算出した４５通りの指標値Ｒ（Ａ,Ｂ）を偏差値に変換することによって、２単語間のつながり強度を算出する。 The control device 103 calculates the connection strength between two words by converting the 45 index values R (A, B) calculated by the equation (1) into deviation values.

制御装置１０３は、２単語間のつながり強度があらかじめ設定された閾値以上の上位頻出単語の組み合わせを、上位頻出単語ペアとして抽出する。そして、抽出した上位頻出単語ペアのうち、つながり強度が最も大きい上位頻出単語ペアを構成する２つの単語を、投稿情報に含まれる主な話題を構成するコアとなる単語（以下「コア単語」と呼ぶ）として特定する。さらに、特定した２つのコア単語のそれぞれについて、つながり強度が所定の閾値以上の上位頻出単語を、コア単語に関連する関連単語として特定する。制御装置１０３は、このようにして特定した２つのコア単語と、それらに関連する関連単語とを、話題を構成する１組の話題構成単語群として特定する。 The control device 103 extracts a combination of upper frequent words whose connection strength between two words is equal to or higher than a preset threshold value as an upper frequent word pair. Then, of the extracted top frequent word pairs, two words constituting the top frequent word pair having the highest connection strength are replaced with the core word constituting the main topic included in the posted information (hereinafter referred to as “core word”). Specified). Further, for each of the two specified core words, the upper frequent word having a connection strength equal to or higher than a predetermined threshold is specified as a related word related to the core word. The control device 103 identifies the two core words identified in this way and the related words related to them as a set of topic constituent words constituting the topic.

制御装置１０３は、１組の話題構成単語群を特定した後は、既にコア単語として特定した上位頻出単語と既に関連単語として特定された単語を除いて、残った上位頻出単語ペアのうち、つながり強度が最も大きい上位頻出単語ペアを構成する２つの単語を新たなコア単語として特定するとともに、既にコア単語として特定した上位頻出単語を除いた上位頻出単語の中から上記新たなコア単語に対する関連単語を特定することにより、二組目以降の話題構成単語群を特定していく。制御装置１０３は、この話題構成単語群を特定するための処理を、上位頻出単語ペアが特定できなくなるまで繰り返す。 After identifying a set of topic constituent words, the control device 103 removes the upper frequent words that have already been identified as core words and the words that have already been identified as related words from among the remaining frequent words pairs. The two words constituting the highest frequent word pair having the highest intensity are specified as new core words, and the related words for the new core word from the higher frequent words excluding the higher frequent words already identified as core words By specifying, the topic constituent word groups in the second and subsequent sets are specified. The control device 103 repeats the process for specifying the topic constituent word group until the upper frequent word pairs cannot be specified.

２単語間のつながり強度の算出例と、それに基づく話題構成単語群の特定例について、図２を用いて説明する。図２に示す例では、「日本」、「政党」、「国民」、「秘密保護法」、「議員」、「首相」、「政権」、「政治」、「反対」、「法案」の１０単語が上位頻出単語として特定され、それぞれ２単語間のつながり強度として、式（１）で算出した指標値の偏差値が算出されている。ここでは、上位頻出単語ペアを抽出するための閾値、および関連単語を特定するための閾値として、いずれも偏差値４９．０が設定されている場合を例に、話題構成単語群の特定方法について説明する。 A calculation example of the connection strength between two words and a specific example of a topic constituent word group based on the calculation will be described with reference to FIG. In the example shown in FIG. 2, “Japan”, “Political Party”, “National”, “Secret Protection Act”, “Deputy Member”, “Prime Minister”, “Government”, “Politics”, “Opposition”, “Bill” A word is identified as an upper frequent word, and the deviation value of the index value calculated by equation (1) is calculated as the connection strength between the two words. Here, the topic constituent word group specifying method will be described by taking, as an example, a case in which a deviation value 49.0 is set as a threshold for extracting a top frequent word pair and a threshold for specifying a related word. explain.

この図２に示す例においては、２単語間のつながり強度が閾値である４９．０以上であって、つながり強度が最も大きい上位頻出単語の組み合わせは、符号２ａで示す「政党」と「政権」の組み合わせとなる。制御装置１０３は、この「政党」と「政権」の組み合わせを１つ目の上位頻出単語ペアとして抽出し、「政党」と「政権」をコア単語として特定する。 In the example shown in FIG. 2, the combination of the top frequent words having the largest connection strength between the two words having a connection strength of 49.0 or more as the threshold is “Political Party” and “Government” indicated by reference numeral 2a. It becomes a combination. The control device 103 extracts the combination of the “political party” and the “government” as the first upper frequent word pair, and identifies “the political party” and “the government” as core words.

制御装置１０３は、コア単語として特定した「政党」と「政権」のそれぞれについて、つながり強度が閾値である４９．０以上の関連単語を特定する。図２の例では、「政党」に対しては、つながり強度が５５，１である「首相」が関連単語として特定され、「政権」に対しては、つながり強度が５５，２である「国民」が関連単語として特定される。制御装置１０３は、この結果に基づき、「政党」、「政権」、「首相」、「国民」の４つの単語を１組目の話題構成単語群として特定する。 The control device 103 specifies related words having a connection strength of 49.0 or more, which is a threshold value, for each of “political party” and “government” specified as core words. In the example of FIG. 2, “primary” whose connection strength is 55,1 is identified as a related word for “political party”, and “national” whose connection strength is 55,2 for “government” Is identified as a related word. Based on this result, the control device 103 identifies four words of “political party”, “government”, “prime”, and “national” as the first set of topic constituent word groups.

次に、既にコア単語、関連単語として特定した「政党」、「政権」、「首相」、「国民」を除いた上位頻出単語のうち、２単語間のつながり強度が閾値である４９．０以上であって、つながり強度が最も大きい上位頻出単語の組み合わせは、符号２ｂで示す「秘密保護法」と「反対」の組み合わせとなる。制御装置１０３は、この「秘密保護法」と「反対」の組み合わせを２つ目の上位頻出単語ペアとして抽出し、「秘密保護法」と「反対」をコア単語として特定する。 Next, among the top frequent words excluding “Political Party”, “Government”, “Prime Minister”, and “National” that have already been identified as core words and related words, the connection strength between the two words is a threshold value of 49.0 or more The combination of the top frequent words having the highest connection strength is a combination of “secret protection method” and “opposite” indicated by reference numeral 2b. The control device 103 extracts the combination of the “secret protection method” and “opposite” as the second upper frequent word pair, and identifies “secret protection method” and “opposite” as the core words.

制御装置１０３は、コア単語として特定した「秘密保護法」と「反対」のそれぞれについて、既にコア単語として特定した「政党」、「政権」、「秘密保護法」、「反対」を除いた上位頻出単語以外の中から、つながり強度が閾値である４９．０以上の関連単語を特定する。図２の例では、「秘密保護法」に対しては、つながり強度が５１，３である「首相」と４９．０である「法案」が関連単語として特定され、「反対」に対しては、つながり強度が４９．８である「国民」と４９．３である「法案」が関連単語として特定される。制御装置１０３は、この結果に基づき、「秘密保護法」、「反対」、「首相」、「法案」、「国民」の５つの単語を２組目の話題構成単語群として特定する。 For each of the “secret protection law” and “opposite” specified as the core word, the control device 103 excludes “political party”, “government”, “secret protection law”, and “opposite” already specified as the core word. A related word having a connection strength of 49.0 or more, which is a threshold value, is identified from words other than the frequent words. In the example of FIG. 2, for “Secret Protection Law”, “Prime Minister” with a connection strength of 51, 3 and “Bill” with 49.0 are specified as related words. “National” with a connection strength of 49.8 and “Bill” with 49.3 are identified as related words. Based on this result, the control device 103 identifies five words “secret protection law”, “opposite”, “prime”, “bill”, and “national” as the second set of topic constituent word groups.

図２に示す例では、この段階で更なる上位頻出単語ペアの抽出ができないため、制御装置１０３は、話題構成単語群の特定を終了する。これにより、１組目の話題構成単語群により、「政党」、「政権」、「首相」、「国民」の４つの単語が投稿情報に含まれる文章情報で主に話題となっている１つ目の単語群であることが特定される。また、２組目の話題構成単語群により、「秘密保護法」、「反対」、「首相」、「法案」、「国民」の５つの単語が投稿情報に含まれる文章情報で主に話題となっている２つ目の単語群であることが特定される。 In the example illustrated in FIG. 2, since it is not possible to extract a higher-rank frequent word pair at this stage, the control apparatus 103 ends the topic constituent word group specification. As a result, according to the first group of topic constituent words, one of the four words “political party”, “government”, “prime”, and “national” is mainly the topic in the text information included in the posted information. It is specified to be a word group of eyes. In addition, according to the second set of topic constituent words, five words of “Secret Protection Law”, “Opposite”, “Prime Minister”, “Bill” and “National” are mainly included in the text information included in the posted information. The second word group is identified.

制御装置１０３は、記録装置１０４に記録されている投稿情報の中から、上述した処理によって特定した話題構成単語群に含まれるコア単語と関連単語とに基づいて、話題構成単語群に対応する話題にマッチする文章情報を話題文章情報として抽出する。以下、制御装置１０３によって実行される話題文章情報を抽出するための処理について説明する。 The control device 103 selects a topic corresponding to the topic constituent word group based on the core word and the related word included in the topic constituent word group specified by the above-described processing from the posted information recorded in the recording device 104. Is extracted as topic sentence information. Hereinafter, processing for extracting topic sentence information executed by the control device 103 will be described.

制御装置１０３は、話題構成単語群に含まれるコア単語の数と関連単語の数の合計に対する、文章情報に含まれる重複を排除したコア単語の数と関連単語の数の合計の割合を話題構成単語の網羅率として算出する。具体的には、投稿情報に含まれる「秘密保護法が成立した。この法案は国民の多くが反対している」という文章情報と、「秘密保護法には反対だ。これだけ多くの国民が判定しているのに可決されるのはおかしい。もっと国民の意見を尊重すべきだ。」という文章情報を例に、網羅率の算出例について説明する。 The control device 103 determines the ratio of the total number of core words and related words excluding duplicates included in the sentence information to the total number of core words and related words included in the topic constituent word group. Calculate as word coverage. Specifically, the text information included in the posted information that "the secret protection law was enacted. This bill is against many citizens" and "the opposite to the secret protection law. However, it is strange that it will be passed. You should respect the opinions of the people more. "

１組目の話題構成単語群についてみると、「秘密保護法が成立した。この法案は国民の多くが反対している」という文章情報には、「政党」、「政権」、「首相」、「国民」の４つの単語のうち「国民」が１つ含まれる。この場合、１／４＝０．２５となり、網羅率は２５％と算出される。また、「秘密保護法には反対だ。これだけ多くの国民が判定しているのに可決されるのはおかしい。もっと国民の意見を尊重すべきだ。」という文章情報には、「国民」が２つ含まれる。この場合、「国民」は重複しているため、重複を排除して１つとし、１／４＝０．２５となり、網羅率は２５％と算出される。 Looking at the first group of topical word groups, the text information that “the secret protection law was enacted. Many of the citizens are against this bill” includes “party”, “government”, “prime”, Of the four words “national”, one “national” is included. In this case, 1/4 = 0.25, and the coverage rate is calculated as 25%. In addition, in the text information that says, “We are against the confidentiality protection law. It is strange that many people judge it. It is strange that we should respect the opinions of the people.” Two are included. In this case, since “National” is duplicated, the number of duplicates is eliminated to be one, 1/4 = 0.25, and the coverage rate is calculated as 25%.

次に２組目の話題構成単語群についてみると、「秘密保護法が成立した。この法案は国民の多くが反対している」という文章情報には、「秘密保護法」、「反対」、「首相」、「法案」、「国民」の５つの単語のうち「秘密保護法」、「反対」、「法案」、「国民」の４つが各１つずつ含まれる。この場合、４／５＝０．８となり、網羅率は８０％と算出される。また、「秘密保護法には反対だ。これだけ多くの国民が判定しているのに可決されるのはおかしい。もっと国民の意見を尊重すべきだ。」という文章情報には、「秘密保護法」、「反対」が各１つずつ含まれ、「国民」が２つ含まれる。この場合、「国民」は重複しているため、重複を排除して１つとし、３／５＝０．６となり、網羅率は６０％と算出される。 Next, looking at the second set of topical word groups, the text information that “the secret protection law was enacted. Many of the citizens are against this bill” includes “secret protection law”, “opposition” Of the five words "Prime Minister", "Bill", and "National", each of the four words "Secret Protection Law", "Opposite", "Bill", and "National" is included. In this case, 4/5 = 0.8, and the coverage rate is calculated as 80%. In addition, the text information that says, “I am against the Secret Protection Law. It is strange that many people have judged it. It should be passed. ”And“ opposite ”are included one by one, and two“ nationals ”are included. In this case, since “National” is duplicated, the number of duplicates is eliminated to be one, 3/5 = 0.6, and the coverage rate is calculated as 60%.

制御装置１０３は、また、話題構成単語群に含まれるコア単語と関連単語のうちの少なくとも１つの単語を含む文章情報を対象として、以下に示すように各文章情報にスコア付けを行う。本実施の形態では、制御装置１０３は、各文章情報を対象として単語頻度スコア、名詞含有率スコア、ひらがな含有率スコアの３つのスコアを算出し、これら３つのスコアに基づいて文章情報にスコア付けを行う。 The control device 103 also scores each piece of sentence information as described below for sentence information including at least one of the core words and related words included in the topic constituent word group. In the present embodiment, the control device 103 calculates three scores of a word frequency score, a noun content rate score, and a hiragana content rate score for each piece of sentence information, and scores the sentence information based on these three scores. I do.

まず、単語頻度スコアの算出方法について説明する。制御装置１０３は、投稿情報内の話題構成単語群に含まれるコア単語と関連単語のうちの少なくとも１つの単語を含む文章情報を対象文章情報として、上述した処理と同様に、対象文章情報から名詞、形容詞、形容動詞を抽出し、抽出した単語の中から非自立語、数字１文字、アルファベット１文字を除去した後の単語について、全ての対象文章情報に含まれる各単語の出現頻度を算出し、出願頻度が高い単語ほど重みが高くなるように、各単語に対して重み値を設定する。制御装置１０３は、対象文章情報のそれぞれについて、各対象文章情報に含まれる単語に設定された重み値を加算することによって単語頻度スコアを算出する。 First, a method for calculating a word frequency score will be described. The control device 103 uses the sentence information including at least one word of the core word and the related word included in the topic constituent word group in the posting information as the target sentence information, from the target sentence information, as in the above-described process, , Extract adjectives, adjective verbs, and calculate the appearance frequency of each word included in all target sentence information for the word after removing non-independent words, one number character, and one alphabetic character from the extracted words A weight value is set for each word so that the word having a higher application frequency has a higher weight. The control device 103 calculates a word frequency score by adding a weight value set to a word included in each target text information for each target text information.

次に、名詞含有率スコアについて説明する。制御装置１０３は、対象文章情報のそれぞれについて、各対象文章情報に含まれる単語の数と、各対象文章情報に含まれる名詞の数とに基づいて、次式（２）により各対象文章情報の名詞含有率を算出する。
名詞含有率＝名詞の数／全単語数・・・（２）
制御装置１０３は、次式（３）に示すように、式（２）で算出した名刺含有率の逆数を名詞含有率スコアとして算出する。
名詞含有率スコア＝１／名詞含有率・・・（３） Next, the noun content rate score will be described. Based on the number of words included in each target sentence information and the number of nouns included in each target sentence information, the control device 103 uses the following formula (2) for each target sentence information. Calculate the noun content rate.
Noun content rate = number of nouns / total number of words (2)
As shown in the following equation (3), the control device 103 calculates the reciprocal number of the business card content rate calculated by the equation (2) as a noun content rate score.
Noun content score = 1 / Noun content rate (3)

次に、ひらがな含有率スコアについて説明する。制御装置１０３は、対象文章情報のそれぞれについて、各対象文章情報に含まれる文字数と、各対象文章情報に含まれるひらがなの数とに基づいて、次式（４）により各対象文章情報のひらがな含有率スコアを算出する。
ひらがな含有率スコア＝ひらがなの文字数／全文字数・・・（４） Next, the hiragana content rate score will be described. For each target sentence information, the control device 103 determines the hiragana content of each target sentence information according to the following equation (4) based on the number of characters included in each target sentence information and the number of hiragana included in each target sentence information. Calculate the rate score.
Hiragana content score = Hiragana character count / total character count (4)

制御装置１０３は、上述した処理で算出した単語頻度スコア、名詞含有率スコア、ひらがな含有率スコアに基づいて、次式（５）により各文章情報のスコアＳｃｏｒｅ（ｄｉ）を算出する。なお、次式（５）において、Ｓｃｏｒｅ_ｆｒｅｑ（ｄｉ）は単語頻度スコアを示す変数であり、Ｓｃｏｒｅ_ｎｏｕｎ（ｄｉ）は名詞含有率スコアを示す変数であり、Ｓｃｏｒｅ_ｈｉｒａ（ｄｉ）はひらがな含有率スコアを示す変数である。また、α、β、γは、それぞれ混合比であり、ここではα＝β＝γ＝１／３として、単語頻度スコア、名詞含有率スコア、ひらがな含有率スコアを同じ比率で掛けあわせることとする。
Ｓｃｏｒｅ（ｄｉ）＝Ｓｃｏｒｅ_ｆｒｅｑ（ｄｉ）^α・Ｓｃｏｒｅ_ｎｏｕｎ（ｄｉ）^β・Ｓｃｏｒｅ_ｈｉｒａ（ｄｉ）^γ ・・・（５） Based on the word frequency score, the noun content rate score, and the hiragana content rate score calculated in the above-described processing, the control device 103 calculates the score Score (di) of each sentence information by the following equation (5). In the following equation (5), Score _freq (di) is a variable indicating a word frequency score, Score _noun (di) is a variable indicating a noun content rate score, and Score _hira (di) is a hiragana content rate score. Is a variable indicating Α, β, and γ are mixing ratios. Here, α = β = γ = 1/3, and the word frequency score, noun content rate score, and hiragana content rate score are multiplied by the same ratio. .
Score (di) = Score _freq (di) ^α · Score _noun (di) ^β · Score _hira (di) ^γ (5)

制御装置１０３は、上述した網羅率とスコアとに基づいて、それぞれの話題構成単語群ごとに、その話題構成単語群が表す話題にマッチした文章情報を投稿情報から抽出する。なお、ここでの文章情報の抽出数はあらかじめ設定されており、例えば、各話題構成単語群ごとに３つずつの文章情報を抽出するものとする。 Based on the above-described coverage rate and score, the control device 103 extracts, for each topic constituent word group, sentence information that matches the topic represented by the topic constituent word group from the posted information. Note that the number of sentence information extracted here is set in advance. For example, three pieces of sentence information are extracted for each topic constituent word group.

投稿情報からそれぞれの話題構成単語群が表す話題にマッチした文章情報を抽出するために、制御装置１０３は、まず網羅率があらかじめ設定された閾値よりも高い文章情報を抽出数を満たすように抽出する。この段階で、設定された抽出数分の文章情報が抽出できた場合には、抽出を完了する。 In order to extract sentence information matching the topic represented by each topic constituent word group from the posted information, the control device 103 first extracts sentence information whose coverage rate is higher than a preset threshold value so as to satisfy the extraction number. To do. At this stage, if text information corresponding to the set number of extractions can be extracted, the extraction is completed.

制御装置１０３は、抽出した文章情報が設定された抽出数を満たない場合には、さらに式（５）により算出したスコアがあらかじめ設定された閾値よりも高い文章情報を抽出数を満たすように抽出する。 When the extracted text information does not satisfy the set number of extractions, the control device 103 further extracts text information whose score calculated by the equation (5) is higher than a preset threshold value so as to satisfy the number of extractions. To do.

制御装置１０３は、それぞれの話題構成単語群ごとに、網羅率とスコアとに基づいて文章情報を抽出した場合には、話題構成単語群ごとに、抽出した話題文章情報を一覧表示した話題文章抽出結果情報を生成する。話題文章抽出結果情報は、例えば、図３に示すように、話題構成単語群に含まれる各単語を列挙した上で、これらの単語によって表される話題に沿った文章情報として抽出した話題文章情報を表示したテキストファイルとして生成される。図３においては、１つ目の話題（話題１）として、政党／政権／首相／国民の話題構成単語群が列挙され、その下に話題１にマッチする３つの話題文章情報が表示されている。また、２つ目の話題（話題２）として、秘密保護法／反対／首相／法案／国民の話題構成単語群が列挙され、その下に話題２にマッチする３つの話題文章情報が表示されている例を示している。なお、制御装置１０３は、話題構成単語群を抽出できない場合や、網羅率とスコアとに基づく文章情報を１件も抽出できないときは、投稿情報には主な話題が存在しなかったことを示すテキストメッセージを表示して、話題文章抽出結果情報を生成すればよい。 When the sentence information is extracted based on the coverage rate and the score for each topic constituent word group, the control device 103 extracts the topic sentence information that lists the extracted topic sentence information for each topic constituent word group. Generate result information. The topic sentence extraction result information is, for example, as shown in FIG. 3, the topic sentence information extracted as the sentence information along the topic represented by these words after enumerating each word included in the topic constituent word group Is generated as a text file. In FIG. 3, as the first topic (topic 1), the topic constituent words of the political party / the government / the prime minister / the public are listed, and below the three topic sentence information matching topic 1 are displayed. . Also, as the second topic (topic 2), the secret protection law / opposition / prime / bill / national topic composition word group is listed, and below it, three topic sentence information matching topic 2 is displayed. An example is shown. The control device 103 indicates that there is no main topic in the posted information when the topic constituent word group cannot be extracted or when no sentence information based on the coverage rate and the score can be extracted. A text message may be displayed to generate topic sentence extraction result information.

上述した処理によって生成された話題文章抽出結果情報を、現在、どのような話題が投稿されているかを把握したい企業や人物に対して提供するようにすれば、本実施の形態における話題特定装置１００で生成した話題文章抽出結果情報を有効に活用することができる。また、制御装置１０３は、所定時間間隔で話題文章抽出結果情報を作成して、あらかじめ登録されているメールアドレスに対して送信するようにすれば、常に、どのような話題が投稿されているかに関する最新の情報を希望者に配信することができる。 If the topic sentence extraction result information generated by the above-described processing is provided to a company or person who wants to know what topic is currently posted, the topic identification device 100 according to the present embodiment. It is possible to effectively utilize the topic sentence extraction result information generated in the above. Further, if the control device 103 creates topic sentence extraction result information at predetermined time intervals and transmits it to a pre-registered mail address, it is always related to what topic is posted. The latest information can be distributed to those who wish.

図４は、本実施の形態における話題特定装置１００で実行される処理の流れを示すフローチャートである。図４に示す処理は、話題文章抽出結果情報の作成タイミングになると起動するプログラムとして、制御装置１０３によって実行される。なお、図４に示す処理において、上述した投稿情報のフィルタリング処理は既に完了しており、記録装置１０４には、フィルタリング後の投稿情報が記録されているものとする。 FIG. 4 is a flowchart showing a flow of processing executed by the topic identification device 100 according to the present embodiment. The process shown in FIG. 4 is executed by the control device 103 as a program that starts when the topic sentence extraction result information is created. In the process illustrated in FIG. 4, the post information filtering process described above has already been completed, and post information after filtering is recorded in the recording device 104.

ステップＳ１０において、制御装置１０３は、投稿情報に含まれる文章情報を対象として、形態素解析処理を行い、投稿情報から名詞、形容詞、形容動詞を抽出し、さらに上述した除去対象語を除去して処理対象とする単語を抽出する。その後、ステップＳ２０へ進む。 In step S10, the control device 103 performs a morphological analysis process on sentence information included in the posted information, extracts nouns, adjectives, and adjective verbs from the posted information, and further removes the above-described removal target words and performs processing. Extract the target word. Then, it progresses to step S20.

ステップＳ２０では、制御装置１０３は、投稿情報から抽出した各単語の出現頻度を算出し、出現頻度が上位の１０件を上位頻出単語として特定する。その後、ステップＳ３０へ進む。 In step S <b> 20, the control device 103 calculates the appearance frequency of each word extracted from the posted information, and identifies the ten cases with the highest appearance frequency as upper frequent words. Then, it progresses to step S30.

ステップＳ３０では、制御装置１０３は、ステップＳ２０で特定した上位頻出単語のそれぞれについて、上位頻出単語に含まれる他の単語との間のつながりの強さを示す２単語間のつながり強度を算出する。その後、ステップＳ４０へ進む。 In step S30, the control device 103 calculates the connection strength between two words indicating the strength of connection between each of the upper frequent words specified in step S20 and other words included in the upper frequent words. Thereafter, the process proceeds to step S40.

ステップＳ４０では、制御装置１０３は、２単語間のつながり強度があらかじめ設定された閾値以上の上位頻出単語の組み合わせを、上位頻出単語ペアとして抽出する。その後、ステップＳ５０へ進む。 In step S40, the control apparatus 103 extracts a combination of upper frequent words having a connection strength between two words equal to or higher than a preset threshold value as an upper frequent word pair. Thereafter, the process proceeds to step S50.

ステップＳ５０では、制御装置１０３は、ステップＳ４０で抽出した上位頻出単語ペアのうち、つながり強度が最も大きい上位頻出単語ペアを構成する２つの単語をコア単語として特定するとともに、特定した２つのコア単語のそれぞれについて、つながり強度が所定の閾値以上の上位頻出単語を、コア単語に関連する関連単語として特定して、特定した２つのコア単語と、それらに関連する関連単語とを、話題を構成する１組の話題構成単語群として特定する。その後、ステップＳ６０へ進む。 In step S50, the control device 103 identifies, as core words, two words constituting the upper frequent word pair having the highest connection strength among the upper frequent word pairs extracted in step S40, and the identified two core words For each of the above, the top frequent words whose connection strength is greater than or equal to a predetermined threshold are specified as related words related to the core word, and the two specified core words and related words related thereto are included in the topic. It is specified as a group of topic constituent words. Thereafter, the process proceeds to step S60.

ステップＳ６０では、制御装置１０３は、上述したように、既にコア単語として特定した上位頻出単語を除いて、さらに上位頻出単語ペアが特定できるか否かを判定する。ステップＳ６０で肯定判断した場合には、ステップＳ４０へ戻る。これに対して、ステップＳ６０で否定判断した場合には、ステップＳ７０へ進む。 In step S60, as described above, the control device 103 determines whether or not a higher frequent word pair can be identified by excluding higher frequent words that have already been identified as core words. If a positive determination is made in step S60, the process returns to step S40. On the other hand, if a negative determination is made in step S60, the process proceeds to step S70.

ステップＳ７０では、制御装置１０３は、上述したように、話題構成単語群ごとに、話題構成単語群に含まれるコア単語の数と関連単語の数の合計に対する、文章情報に含まれる重複を排除したコア単語の数と関連単語の数の合計の割合を話題構成単語の網羅率として算出する。その後、ステップＳ８０へ進む。 In step S70, as described above, for each topic constituent word group, the control device 103 eliminates duplication included in the sentence information with respect to the total number of core words and related words included in the topic constituent word group. The ratio of the total number of core words and related words is calculated as the coverage rate of the topic constituent words. Thereafter, the process proceeds to step S80.

ステップＳ８０では、制御装置１０３は、上述したように、話題構成単語群ごとに、話題構成単語群に含まれるコア単語と関連単語のうちの少なくとも１つの単語を含む文章情報を対象として、単語頻度スコア、名詞含有率スコア、ひらがな含有率スコアを算出し、式（５）により各文章情報のスコアＳｃｏｒｅ（ｄｉ）を算出する。その後、ステップＳ９０へ進む。 In step S80, as described above, for each topic constituent word group, the control device 103 targets word information including at least one of the core words and related words included in the topic constituent word group as a word frequency. A score, a noun content rate score, and a hiragana content rate score are calculated, and a score Score (di) of each sentence information is calculated by the equation (5). Thereafter, the process proceeds to step S90.

ステップＳ９０では、制御装置１０３は、上述したように、算出した網羅率とスコアとに基づいて、それぞれの話題構成単語群ごとに、その話題構成単語群が表す話題にマッチした文章情報を投稿情報から抽出する。その後、ステップＳ１００へ進む。 In step S90, as described above, the control device 103 posts sentence information that matches the topic represented by the topic constituent word group for each topic constituent word group based on the calculated coverage rate and score. Extract from Then, it progresses to step S100.

ステップＳ１００では、制御装置１０３は、話題構成単語群ごとに、抽出した話題文章情報を一覧表示した話題文章抽出結果情報を生成して、記録装置１０４に記録する。その後、処理を終了する。 In step S <b> 100, the control device 103 generates topic sentence extraction result information in which the extracted topic sentence information is displayed as a list for each topic constituent word group, and records it in the recording device 104. Thereafter, the process ends.

以上説明した本実施の形態によれば、以下のような作用効果を得ることができる。
（１）制御装置１０３は、ユーザによって投稿された文章情報を蓄積した投稿情報を取得し、取得した投稿情報を解析して投稿情報に含まれる単語を抽出し、抽出した各単語に対して、投稿情報内における出現頻度を算出し、算出した各単語ごとの出現頻度に基づいて、出現頻度が上位となるあらかじめ設定された所定数の単語を上位頻出単語として特定するようにした。そして、制御装置１０３は、特定した上位頻出単語のそれぞれについて、上位頻出単語に含まれる他の単語との間のつながりの強さを示す２単語間のつながり強度を算出し、算出したつながり強度があらかじめ設定された所定の閾値以上の上位頻出単語の組み合わせを、上位頻出単語ペアとして抽出し、抽出した上位頻出単語ペアのうち、つながり強度が最も大きい上位頻出単語ペアを構成する２つの単語をコア単語として特定し、特定した２つのコア単語のそれぞれについて、つながり強度が所定の閾値以上の上位頻出単語をコア単語に関連する関連単語として特定し、特定した２つのコア単語と関連単語とを、話題を構成する１組の話題構成単語群として特定するようにした。これによって、ユーザによって投稿された文章情報に含まれる複数の単語のつながり強度を加味して、精度高く、ユーザが投稿した主な話題を特定するための単語を抽出することができる。 According to the present embodiment described above, the following operational effects can be obtained.
(1) The control device 103 acquires post information that accumulates text information posted by the user, analyzes the acquired post information, extracts words included in the post information, and for each extracted word, The appearance frequency in the posted information is calculated, and based on the calculated appearance frequency for each word, a predetermined number of words having a higher appearance frequency are specified as upper frequent words. Then, the control device 103 calculates the connection strength between the two words indicating the strength of the connection between each of the identified upper frequent words and the other words included in the upper frequent words, and the calculated connection strength is A combination of top frequent words exceeding a predetermined threshold set in advance is extracted as a top frequent word pair, and two words constituting the top frequent word pair having the highest connection strength are extracted from the extracted top frequent word pairs. For each of the two identified core words, the top frequent words whose connection strength is a predetermined threshold or more are identified as related words related to the core word, and the identified two core words and related words are It was specified as a set of topic constituent words that make up a topic. Thereby, it is possible to extract a word for specifying a main topic posted by the user with high accuracy in consideration of the connection strength of a plurality of words included in the text information posted by the user.

（２）制御装置１０３は、既にコア単語、関連単語として特定された単語を除いて、残った上位頻出単語ペアのうち、つながり強度が最も大きい上位頻出単語ペアを構成する２つの単語をコア単語として特定し、これをコア単語を特定可能な上位頻出単語ペアがなくなるまで繰り返すようにした。これによって、投稿情報に複数の話題に関する文章情報が含まれている場合に、それら複数の話題に対して、それぞれコア単語を特定することができる。 (2) The control device 103 determines the two words constituting the upper frequent word pair having the highest connection strength from the remaining upper frequent word pairs, except for the words already identified as the core words and related words, as the core words. This is repeated until there are no higher frequent word pairs that can identify the core word. Thereby, when the post information includes text information related to a plurality of topics, the core word can be specified for each of the plurality of topics.

（３）制御装置１０３は、既にコア単語として特定されたコア単語を除いた上位頻出単語の中から関連単語を特定するようにした。これによって、既にコア単語として特定された上位頻出単語が、他の話題の関連単語として重複して特定されることを防ぐことができる。 (3) The control device 103 identifies related words from the top frequent words excluding the core words already identified as core words. As a result, it is possible to prevent the upper frequent word already specified as the core word from being redundantly specified as the related word of another topic.

（４）制御装置１０３は、投稿情報に含まれる名詞、形容詞、形容動詞を選択して投稿情報から単語を抽出するようにした。これによって、話題を構成する可能性が高い品詞の単語に限定して、話題構成単語群の特定を行うことができる。 (4) The control device 103 selects a noun, an adjective, and an adjective verb included in the posted information and extracts words from the posted information. As a result, it is possible to specify a topic group word group by limiting to words with parts of speech that are likely to constitute a topic.

（５）制御装置１０３は、投稿情報から名詞、形容詞、形容動詞を抽出した上で、抽出した単語の中から非自立語、数字１文字、アルファベット１文字を除去対象語として特定して除去するようにした。これによって、処理対象とする単語を話題を構成する可能性が高い単語に絞り込んで、話題を構成する単語の抽出精度を向上させることができる。また、処理対象とする単語を絞り込むことで、制御装置１０３の処理負荷を低減させるとともに、処理速度を向上させることができる。 (5) The control device 103 extracts nouns, adjectives, and adjective verbs from the posted information, and then identifies and removes non-independent words, one numeric character, and one alphabetic character from the extracted words as removal target words. I did it. As a result, it is possible to narrow down the words to be processed into words that are highly likely to constitute a topic, and to improve the extraction accuracy of the words that constitute the topic. Further, by narrowing down the words to be processed, the processing load on the control device 103 can be reduced and the processing speed can be improved.

（６）制御装置１０３は、特定した話題構成単語群に含まれるコア単語と関連単語とに基づいて、投稿情報の中から、話題構成単語群に対応する話題にマッチする文章情報を話題文章情報として抽出するようにした。これによって、ユーザによって投稿された文章情報の中から、多くのユーザが投稿した主な話題にマッチする文章情報を特定することができる。 (6) Based on the core word and the related word included in the identified topic constituent word group, the control device 103 converts the sentence information that matches the topic corresponding to the topic constituent word group from the posted information. As extracted. This makes it possible to specify text information that matches the main topic posted by many users from text information posted by the user.

（７）制御装置１０３は、話題構成単語群に含まれるコア単語の数と関連単語の数の合計に対する、文章情報に含まれる重複を排除したコア単語の数と関連単語の数の合計の割合を話題構成単語の網羅率として算出し、網羅率が高い所定数の文章情報を話題文章情報として抽出するようにした。これによって、文章中におけるコア単語と関連単語の網羅率が高い文章情報は、より話題構成単語群によって表される話題を反映した文章情報であることを加味して、投稿情報の中から、多くのユーザが投稿した主な話題にマッチする文章情報を精度高く抽出することができる。 (7) The control apparatus 103 is a ratio of the total number of core words and related words excluding duplicates included in the sentence information to the total number of core words and related words included in the topic constituent word group. Is calculated as the coverage rate of the topic constituent words, and a predetermined number of text information with a high coverage rate is extracted as the topic text information. As a result, text information with high coverage of core words and related words in the text is more text information that reflects the topic represented by the topic word group, Text information that matches the main topic posted by users can be extracted with high accuracy.

（８）制御装置１０３は、話題構成単語群に含まれるコア単語と関連単語のうちの少なくとも１つの単語を含む文章情報を対象文章情報とし、対象文章情報に含まれる各単語の出現頻度に基づいて、各単語に対して重み値を設定し、対象文章情報に含まれる単語に設定された重み値を加算することによって単語頻度スコアを算出し、対象文章情報に含まれる単語の数に対する対象文章情報に含まれる名詞の数に基づいて、対象文章情報における名詞含有率スコアを算出し、対象文章情報に含まれる文字数に対する対象文章情報に含まれるひらがなの数に基づいて、対象文章情報におけるひらがな含有率スコアを算出し、単語頻度スコア、名詞含有率スコア、ひらがな含有率スコアに基づいてスコアを算出し、スコアが高い所定数の文章情報を話題文章情報として抽出するようにした。これによって、文章内における単語の出現頻度に基づく重み付け、文章内における名詞の含有率、文章内におけるひらがなの含有率を加味して、主な話題にマッチする文章情報の抽出精度を向上させることができる。 (8) The control device 103 sets sentence information including at least one of the core words and related words included in the topic constituent word group as target sentence information, and based on the appearance frequency of each word included in the target sentence information. The word frequency score is calculated by setting a weight value for each word and adding the weight value set to the word included in the target sentence information, and the target sentence for the number of words included in the target sentence information Based on the number of nouns included in the information, the noun content score in the target sentence information is calculated, and the hiragana content in the target sentence information is calculated based on the number of hiragana included in the target sentence information relative to the number of characters included in the target sentence information. Rate score, calculate score based on word frequency score, noun content rate score, hiragana content rate score, and a predetermined number of sentence information with high score It was to be extracted as a die sentence information. This improves the extraction accuracy of sentence information that matches the main topic, taking into account the weighting based on the appearance frequency of words in the sentence, the content ratio of nouns in the sentence, and the content ratio of hiragana in the sentence. it can.

（９）制御装置１０３は、対象文章情報に含まれる名詞、形容詞、形容動詞から非自立語、数字１文字、アルファベット１文字を除去した後の単語を対象として、単語頻度スコアを算出するようにした。これによって、話題を構成する可能性が高い品詞の単語に限定して、単語頻度スコアを算出することができる。また、単語頻度スコアの算出対象とする単語を絞り込むことで、制御装置１０３の処理負荷を低減させるとともに、処理速度を向上させることができる。 (9) The control device 103 calculates a word frequency score for a word after removing a non-independent word, one number character, and one alphabetic character from the noun, adjective, and adjective verb included in the target sentence information. did. As a result, the word frequency score can be calculated only for words with parts of speech that are likely to constitute a topic. Further, by narrowing down the words to be calculated for the word frequency score, the processing load on the control device 103 can be reduced and the processing speed can be improved.

（１０）制御装置１０３は、話題構成単語群ごとに、話題文章情報を一覧表示した話題文章抽出結果情報を生成して、記録装置１０４に記録するようにした。これによって、話題構成単語群ごとに、それらの単語群に基づいて特定される話題に沿ったものとして、どのような文章情報が投稿されているかを一覧表示して記録することができる。 (10) The control device 103 generates topic sentence extraction result information in which topic sentence information is displayed as a list for each topic constituent word group, and records it in the recording device 104. Thus, for each topic constituent word group, it is possible to display a list of what text information is posted as being in line with the topic specified based on those word groups.

（１１）制御装置１０３は、投稿情報から話題構成単語群の特定に必要のない文章情報を除去するためのフィルタリング処理を実行し、フィルタリング後の投稿情報を対象として話題構成単語群を特定するようにした。このように、あらかじめ投稿情報から話題構成単語群の特定に必要のない文章情報を除去しておくことにより、話題構成単語群の特定精度を向上させることができる。 (11) The control device 103 executes a filtering process for removing sentence information that is not necessary for specifying the topic constituent word group from the posted information, and specifies the topic constituent word group for the post information after filtering. I made it. In this way, the accuracy of specifying the topic constituent word group can be improved by removing in advance the sentence information that is not necessary for specifying the topic constituent word group from the posted information.

―変形例―
なお、上述した実施の形態の話題特定装置１００は、以下のように変形することもできる。
（１）上述した実施の形態では、制御装置１０３は、外部サーバから取得した投稿情報に対してフィルタリング処理を実行し、投稿内容がニュースであるもの、広告であるもの、リツイートされたもの、機械的に大量に投稿されたもの（ｂｏｔ）、話題特定目的に適さないもの、内容が他の文章と重複するもの、ひらがな率が所定値以下であるものを不要な文章情報として除去する例について説明した。しかしながら、これらのフィルタリング処理による除去対象は一例であって、話題構成単語群の特定に必要のない文章情報として、その他の文章情報も存在する場合には、それらもフィルタリング処理の対象とすればよい。 -Modification-
Note that the topic identification device 100 according to the above-described embodiment can be modified as follows.
(1) In the above-described embodiment, the control device 103 performs filtering processing on the posted information acquired from the external server, and the posted content is news, advertising, retweeted, machine Explains an example of removing unnecessary text information that has been submitted in large numbers (bots), not suitable for topic specific purposes, content that overlaps with other text, or that has a hiragana rate below a predetermined value did. However, the removal target by these filtering processes is an example, and when other sentence information exists as sentence information that is not necessary for specifying the topic constituent word group, these may be also targeted for the filtering process. .

（２）上述した実施の形態では、制御装置１０３は、投稿情報から話題構成単語群の特定に必要のない文章情報を除去するためのフィルタリング処理を実行し、フィルタリング後の投稿情報を対象として話題構成単語群を特定する例について説明した。しかしながら、フィルタリング処理を行わずに、外部サーバから取得した投稿情報をそのまま用いて話題構成単語群を特定するようにしてもよい。 (2) In embodiment mentioned above, the control apparatus 103 performs the filtering process for removing the text information which is not necessary for pinpoint word group specification from posting information, and is a topic for post information after filtering. An example of specifying a constituent word group has been described. However, the topic constituent word group may be specified using the posting information acquired from the external server as it is without performing the filtering process.

（３）上述した実施の形態では、制御御装置１０３は、投稿情報から抽出した各単語の出現頻度を算出し、出現頻度が上位のものから１０件を上位頻出単語として特定する例について説明した。しかしながら、上位頻出単語の特定件数は１０件に限定されず、この件数は話題特定装置１００の運営者が任意に変更できるようにしてもよい。 (3) In the above-described embodiment, the control apparatus 103 calculates the appearance frequency of each word extracted from the posted information, and has described an example in which the top 10 appearance words are identified as the top frequent words. . However, the specific number of upper frequent words is not limited to ten, and the number of cases may be arbitrarily changed by the operator of the topic identification device 100.

（４）上述した実施の形態では、制御装置１０３は、式（１）により算出した４５通りの指標値Ｒ（Ａ,Ｂ）を偏差値に変換することによって、２単語間のつながり強度を算出する例について説明した。しかしながら、式（１）により算出した指標値Ｒ（Ａ,Ｂ）をそのまま２単語間のつながり強度として用いてもよいし、式（１）により算出した指標値Ｒ（Ａ,Ｂ）に対して、偏差値以外の評価値を算出して、その算出結果を２単語間のつながり強度として用いてもよい。 (4) In the embodiment described above, the control device 103 calculates the connection strength between two words by converting the 45 index values R (A, B) calculated by the equation (1) into deviation values. The example to do was demonstrated. However, the index value R (A, B) calculated by the formula (1) may be used as it is as the connection strength between the two words, or the index value R (A, B) calculated by the formula (1) is used. An evaluation value other than the deviation value may be calculated, and the calculation result may be used as the connection strength between two words.

（５）上述した実施の形態では、上位頻出単語ペアを抽出するための偏差値の閾値、および関連単語を抽出するための偏差値の閾値は、いずれも４９．０である例について説明した。しかしながら、この閾値は、４９．０に限定されるものではなく、実験結果や運用結果に基づいて、上位頻出単語ペアや関連単語の抽出精度を上げるための閾値が見つかった場合には、その数値を閾値として設定すればよい。また、上位頻出単語ペアを抽出するための偏差値の閾値と関連単語を抽出するための偏差値の閾値は、共通の値であってもよいし、それぞれ個別に設定できるようにしてもよい。 (5) In the above-described embodiment, the example in which the threshold value of the deviation value for extracting the upper frequent word pair and the threshold value of the deviation value for extracting the related word is 49.0 has been described. However, this threshold value is not limited to 49.0. If a threshold value for improving the extraction accuracy of upper frequent word pairs and related words is found based on experimental results and operation results, the numerical value is used. May be set as a threshold value. Further, the threshold value of the deviation value for extracting the top frequent word pairs and the threshold value of the deviation value for extracting the related word may be a common value or may be set individually.

（６）上述した実施の形態では、制御装置１０３は、網羅率とスコアとに基づいて、それぞれの話題構成単語群ごとに、その話題構成単語群が表す話題にマッチした文章情報を投稿情報から抽出する例について説明した。しかしながら、制御装置１０３は、網羅率とスコアのいずれか一方のみに基づいて、それぞれの話題構成単語群ごとに、その話題構成単語群が表す話題にマッチした文章情報を投稿情報から抽出するようにしてもよい。 (6) In the above-described embodiment, the control device 103, for each topic constituent word group, based on the coverage rate and the score, the sentence information that matches the topic represented by the topic constituent word group from the posted information. An example of extraction has been described. However, the control device 103 extracts, for each topic constituent word group, sentence information that matches the topic represented by the topic constituent word group from the posted information based on only one of the coverage rate and the score. May be.

（７）上述した実施の形態では、制御装置１０３は、投稿情報に含まれる文章情報を対象として形態素解析処理を行い、投稿情報から名詞、形容詞、形容動詞を抽出した上で、抽出した単語の中から非自立語、数字１文字、アルファベット１文字を除去対象語として特定して除去することにより、話題構成単語を特定するための処理で処理対象とする単語を抽出するようにした。また、制御装置１０３は、単語頻度スコアを算出するに当たって、投稿情報内の話題構成単語群に含まれるコア単語と関連単語のうちの少なくとも１つの単語を含む文章情報を対象文章情報として、対象文章情報から名詞、形容詞、形容動詞を抽出し、抽出した単語の中から非自立語、数字１文字、アルファベット１文字を除去した後の単語を対象として、全ての対象文章情報に含まれる各単語の出現頻度を算出するようにした。しかしながら、制御装置１０３は、これらの処理において、名詞、形容詞、形容動詞のみを抽出するのみで、抽出した単語の中から非自立語、数字１文字、アルファベット１文字を除去する処理は行わなくてもよい。 (7) In embodiment mentioned above, the control apparatus 103 performs the morphological analysis process on the text information contained in contribution information, extracts a noun, an adjective, and an adjective verb from contribution information, Then, By identifying and removing non-independent words, one numeric character, and one alphabetic character as removal target words, the word to be processed is extracted in the process for specifying the topic constituent word. Further, when calculating the word frequency score, the control device 103 uses the sentence information including at least one of the core word and the related word included in the topic constituent word group in the posted information as the target sentence information. Extract nouns, adjectives, and adjective verbs from information, and remove words from non-independent words, numbers, and alphabets from the extracted words. The appearance frequency was calculated. However, the control device 103 only extracts nouns, adjectives, and adjective verbs in these processes, and does not perform the process of removing non-independent words, single numeric characters, and single alphabetic characters from the extracted words. Also good.

なお、本発明の特徴的な機能を損なわない限り、本発明は、上述した実施の形態における構成に何ら限定されない。また、上述の実施の形態と複数の変形例を組み合わせた構成としてもよい。 Note that the present invention is not limited to the configurations in the above-described embodiments as long as the characteristic functions of the present invention are not impaired. Moreover, it is good also as a structure which combined the above-mentioned embodiment and a some modification.

１００話題特定装置
１０１操作部材
１０２接続ＩＦ
１０３制御装置
１０４記録装置 100 Topic Identification Device 101 Operation Member 102 Connection IF
103 control device 104 recording device

Claims

Post information acquisition means for acquiring post information accumulating text information posted by the user;
Analyzing the posted information acquired by the posted information acquiring means, and extracting a word included in the posted information;
Appearance frequency calculating means for calculating the appearance frequency in the posted information for each word extracted by the word extracting means;
Based on the appearance frequency for each word calculated by the appearance frequency calculation means, a high-frequency word specifying means for specifying a predetermined number of words having a high appearance frequency as a high-frequency word;
For each of the top frequent words specified by the top frequent word specifying means, a connection strength calculation that calculates the strength of the connection between two words indicating the strength of the connection with the other words included in the top frequent words. Means,
Upper frequent word pair extraction means for extracting a combination of upper frequent words having a connection strength calculated by the connection strength calculation means equal to or higher than a predetermined threshold value as a higher frequent word pair;
Of the top frequent word pairs extracted by the top frequent word pair extraction means, two words constituting the top frequent word pair having the largest connection strength, a core constituting a topic included in the post information, and Core word specifying means for specifying as a word (hereinafter referred to as “core word”),
For each of the two core words specified by the core word specifying means, the higher-order frequent word whose connection strength calculated by the connection strength calculation means is equal to or higher than a predetermined threshold is used as a related word related to the core word. Related word identification means to identify;
Topic constituent word group for specifying the two core words specified by the core word specifying means and the related word specified by the related word specifying means as a set of topic constituent word groups constituting the topic A topic specifying device comprising: a specifying unit.

In the topic identification device according to claim 1,
The core word specifying means excludes the upper frequent word pair that already includes the core word or the word identified as the related word, and among the remaining upper frequent word pairs, the upper frequent word having the highest connection strength 2. A topic identification device characterized in that two words constituting a pair are identified as the core word, and this is repeated until there is no higher frequent word pair that can identify the core word.

In the topic identification device according to claim 2,
The related word specifying means specifies the related word from the upper frequent words excluding the word already specified as the core word.

In the topic specific device according to any one of claims 1 to 3,
The topic identifying device, wherein the word extracting unit selects a noun, an adjective, and an adjective verb included in the posted information and extracts a word from the posted information.

In the topic identification device according to claim 4,
The topic identifying device, wherein the word extracting means removes a non-independent word, a single character, and a single alphabet from the word extracted from the posted information.

In the topic identification device according to any one of claims 1 to 5,
Matches the topic corresponding to the topic constituent word group based on the core word and the related word included in the topic constituent word group specified by the topic constituent word group specifying unit from the posted information. A topic identification device further comprising topic sentence information extraction means for extracting sentence information to be extracted as topic sentence information.

In the topic identification device according to claim 6,
The topic sentence information extraction unit is configured to detect the overlap included in the sentence information with respect to the total number of the core words and the number of related words included in the topic component word group specified by the topic component word group specifying unit. A ratio of the total number of the excluded core words and related words is calculated as a coverage rate of topic constituent words, and a predetermined number of the sentence information having a high coverage rate is extracted as the topic sentence information. Topic identification device.

In the topic identification device according to claim 6 or 7,
The topic sentence information extracting means uses sentence information including at least one of the core word and the related word included in the topic constituent word group specified by the topic constituent word group specifying means as target sentence information. Based on the appearance frequency of each word included in the target sentence information, a word frequency score is set by setting a weight value for each word and adding the weight value set for the word included in the target sentence information. The noun content rate score in the target sentence information is calculated based on the number of nouns included in the target sentence information with respect to the number of words included in the target sentence information, and the number of characters included in the target sentence information Hiragana content rate score in the target sentence information is calculated based on the number of hiragana included in the target sentence information for the word sentence score A sentence information extraction score is calculated based on the noun content ratio score and the hiragana content ratio score, and a predetermined number of the sentence information having a high sentence information extraction score is extracted as the topic sentence information. Topic identification device.

In the topic identification device according to claim 8,
The topic identifying device, wherein the topic sentence information extracting means calculates the word frequency score for nouns, adjectives and adjective verbs included in the target sentence information.

In the topic identification device according to claim 9,
The topic sentence information extracting unit calculates the word frequency score after removing non-independent words, one number character, and one alphabetic character from nouns, adjectives and adjective verbs included in the target sentence information. Topic identification device.

In the topic specific device according to any one of claims 6 to 10,
Topic sentence extraction result information for generating topic sentence extraction result information displaying a list of the topic sentence information extracted by the topic sentence information extracting means for each topic constituent word group specified by the topic constituent word group specifying means Generating means;
A topic identifying device further comprising topic sentence extraction result information recording means for recording the topic sentence extraction result information generated by the topic sentence extraction result information generating means in a recording device.

In the topic specific device according to any one of claims 1 to 11,
A topic identification device further comprising filtering means for performing a filtering process for removing the sentence information that is not necessary for identifying the topic constituent word group from the posted information.

Post information acquisition procedure for acquiring post information that accumulates text information posted by the user,
Analyzing the posting information acquired in the posting information acquisition procedure, and extracting a word included in the posting information;
For each word extracted in the word extraction procedure, an appearance frequency calculation procedure for calculating an appearance frequency in the post information;
Based on the appearance frequency for each word calculated in the appearance frequency calculation procedure, an upper frequent word identification procedure that identifies a predetermined number of words that are higher in appearance frequency as upper frequent words,
For each of the top frequent words identified in the top frequent word identification procedure, a connection strength calculation procedure for calculating a connection strength between two words indicating the strength of the connection with other words included in the top frequent words When,
An upper frequent word pair extraction procedure for extracting a combination of upper frequent words having a connection strength calculated in the connection strength calculation procedure equal to or higher than a predetermined threshold value as a higher frequent word pair;
Of the top frequent word pairs extracted in the top frequent word pair extraction procedure, the two words constituting the top frequent word pair having the largest connection strength serve as the core constituting the topic included in the posted information. A core word identification procedure for identifying as a word (hereinafter referred to as a “core word”);
For each of the two core words specified by the core word specifying procedure, the upper frequent words having the connection strength calculated by the connection strength calculation procedure equal to or higher than a predetermined threshold are specified as related words related to the core word. Related word identification procedure,
Topic constituent word group specifying procedure for specifying the two core words specified by the core word specifying procedure and the related word specified by the related word specifying procedure as a set of topic constituent word groups constituting the topic Topic identification method to make computer execute.

In the topic identification method according to claim 13,
In the core word specifying procedure, the top frequent words having the highest connection strength among the remaining top frequent word pairs, excluding the top frequent word pairs that already include the core word or the words identified as the related words. 2. A topic identification method characterized in that two words constituting a pair are identified as the core word, and this is repeated until there is no higher frequent word pair that can identify the core word.

In the topic identification method of Claim 14,
In the related word specifying procedure, the related word is specified from the upper frequent words excluding the word already specified as the core word.

In the topic specific method as described in any one of Claims 13-15,
In the word extracting procedure, a noun, an adjective, and an adjective verb included in the posted information are selected and a word is extracted from the posted information.

The topic identification method according to claim 16,
The word identifying procedure includes removing a non-independent word, a single numeral, and a single alphabet from a word extracted from the posted information.

In the topic identification method as described in any one of Claims 13-17,
Matches the topic corresponding to the topic constituent word group based on the core word and the related word included in the topic constituent word group specified by the topic constituent word group specifying procedure from the posted information. A topic identification method further comprising a topic sentence information extraction procedure for extracting sentence information as topic sentence information.

The topic identification method according to claim 18,
The topic sentence information extraction procedure excludes duplication included in the sentence information with respect to the total number of the core words and related words included in the topic constituent word group specified in the topic constituent word group specifying procedure. Calculating a ratio of the total number of the core words and the number of related words as a coverage rate of topic constituent words, and extracting a predetermined number of the sentence information having a high coverage rate as the topic sentence information, Topic identification method.

In the topic identification method according to claim 18 or 19,
The topic sentence information extraction procedure uses sentence information including at least one of the core word and the related word included in the topic constituent word group specified in the topic constituent word group specifying procedure as target sentence information, Based on the appearance frequency of each word included in the target sentence information, a weight value is set for each word, and a word frequency score is obtained by adding the weight value set to the word included in the target sentence information. Calculating, based on the number of nouns included in the target sentence information relative to the number of words included in the target sentence information, calculating a noun content rate score in the target sentence information, and for the number of characters included in the target sentence information Based on the number of hiragana included in the target sentence information, the hiragana content rate score in the target sentence information is calculated, the word frequency score, the name Topic specification characterized in that a sentence information extraction score is calculated based on a content ratio score and the hiragana content ratio score, and a predetermined number of the sentence information having a high sentence information extraction score is extracted as the topic sentence information Method.

In the topic identification method according to claim 20,
The topic sentence information extraction procedure calculates the word frequency score for nouns, adjectives and adjective verbs included in the target sentence information.

In the topic identification method according to claim 21,
The topic sentence information extraction procedure calculates the word frequency score after removing non-independent words, one number character, and one alphabetic character from nouns, adjectives, and adjective verbs included in the target sentence information. Topic identification method.

In the topic specific method as described in any one of Claims 18-22,
A topic sentence extraction result information generation procedure for generating topic sentence extraction result information that lists the topic sentence information extracted in the topic sentence information extraction procedure for each topic constituent word group specified in the topic composition word group specification procedure When,
A topic identification method further comprising a topic sentence extraction result information recording procedure for recording the topic sentence extraction result information generated in the topic sentence extraction result information generation procedure in a recording device.

In the topic identification method as described in any one of Claims 13-23,
A topic identification method further comprising a filtering procedure for performing filtering processing for removing the sentence information that is not necessary for identifying the topic constituent word group from the posted information.