JP6260678B2

JP6260678B2 - Information processing apparatus, information processing method, and information processing program

Info

Publication number: JP6260678B2
Application number: JP2016236549A
Authority: JP
Inventors: 一郎宍戸; 良子 ▲つじ▼
Original assignee: JVCKenwood Corp
Current assignee: JVCKenwood Corp
Priority date: 2016-12-06
Filing date: 2016-12-06
Publication date: 2018-01-17
Anticipated expiration: 2033-03-29
Also published as: JP2017068862A

Description

本発明は、データを解析する技術に関する。 The present invention relates to a technique for analyzing data.

近年、インターネットの普及を背景にして、インターネット上の掲示板やソーシャルネットワークサービス（SNS: Social Network Service）など、ユーザが手軽に口コミ情報等の文章をアップロードして、その文章を公開することができるサービスが増えている。また、このようなインターネット上の口コミ情報等を把握することは、企業のマーケティング戦略の面などから注目されている。
しかし、個々のユーザによってアップロードされたインターネット上の文章には、省略された語句や表記揺れが多いため、そのような文章から適切なキーワードを迅速に見つけにくいという問題があった。このような問題に対応する技術として、例えば特開２０１１−３１５７号公報（特許文献１）のような技術が存在する。 In recent years, with the spread of the Internet, services such as bulletin boards on the Internet and social network services (SNS: Social Network Service) that allow users to easily upload word-of-mouth information and publish the text Is increasing. In addition, grasping such word-of-mouth information on the Internet is attracting attention from the viewpoint of corporate marketing strategies.
However, since sentences on the Internet uploaded by individual users have many omitted phrases and notations, there is a problem that it is difficult to quickly find an appropriate keyword from such sentences. As a technique for dealing with such a problem, there is a technique such as Japanese Patent Application Laid-Open No. 2011-3157 (Patent Document 1).

特開２０１１−３１５７号公報JP 2011-3157 A

特許文献１には、テキストデータを解析し、商品またはサービスであるアイテムを特定し、アイテムごとにユーザの口コミ情報を要約する技術が記載されている。しかしながら、解析対象のテキストデータが、どのアイテムに対応するかの判定精度が必ずしも十分ではなかった。例えば、記述の対象が音楽や映画などの場合、その名称は非常に多様であり、名称を示す文字列に明確な規則性が存在しないため、記述の対象となっているアイテムを特定する精度が十分でない場合があった。このため、テキストデータにおいて記述の対象となっているアイテムを特定できなかったり、実際に記述の対象となっているアイテムとは異なるアイテムを特定したりしてしまう場合があった。 Patent Document 1 describes a technique for analyzing text data, specifying items that are products or services, and summarizing user's word-of-mouth information for each item. However, the accuracy of determining which item the text data to be analyzed corresponds to is not necessarily sufficient. For example, if the object of description is music or a movie, the names are very diverse, and there is no clear regularity in the character string indicating the name, so the accuracy of identifying the item that is the object of description is high In some cases it was not enough. For this reason, there are cases where the item to be described cannot be specified in the text data, or an item that is different from the item that is actually the description target may be specified.

本発明はこのような問題点に鑑みなされたものであり、テキストデータにおいて記述の対象となっている情報を精度良く特定することを目的とする。 The present invention has been made in view of such problems, and an object thereof is to accurately specify information to be described in text data.

本発明は上述した従来の技術の課題を解決するため、データベースを検索し、検索条件に対応した複数の情報を取得する検索部と、前記複数の情報間の類似度に基づくスコアを計算する類似度計算部と、前記スコアに基づいて、前記検索条件の妥当性を判定する妥当性判定部とを備えることを特徴とする情報処理装置を提供する。
また、本発明は上述した従来の技術の課題を解決するため、複数の検索条件を用いてデータベースを検索し、複数の情報からなり前記複数の検索条件それぞれに対応する検索結果セットを複数取得する検索部と、前記検索結果セットに含まれる複数の情報間の類似度に基づいて、複数の前記検索結果セットそれぞれに対応する複数のスコアを計算する類似度計算部と、前記スコアが高い前記検索結果セットに含まれる情報を優先的に出力する妥当性判定部とを備えることを特徴とする情報処理装置を提供する。
また、本発明は上述した従来の技術の課題を解決するため、１または複数のコンピュータが実行する情報処理方法であって、データベースを検索し、検索条件に対応した複数の情報を取得する検索ステップと、前記検索ステップにおいて取得した前記複数の情報間の類似度に基づくスコアを計算する類似度計算ステップと、前記類似度計算ステップで計算された前記スコアに基づいて、前記検索条件の妥当性を判定する妥当性判定ステップとを含むことを特徴とする情報処理方法を提供する。
また、本発明は上述した従来の技術の課題を解決するため、１または複数のコンピュータが実行する情報処理方法であって、複数の検索条件を用いてデータベースを検索し、複数の情報からなり前記複数の検索条件それぞれに対応する検索結果セットを複数取得する検索ステップと、前記検索ステップにおいて取得された前記検索結果セットに含まれる複数の情報間の類似度に基づいて、複数の前記検索結果セットそれぞれに対応する複数のスコアを計算する類似度計算ステップと、前記類似度計算ステップで計算された前記スコアが高い前記検索結果セットに含まれる情報を優先的に出力する妥当性判定ステップとを含むことを特徴とする情報処理方法を提供する。
また、本発明は上述した従来の技術の課題を解決するため、１または複数のコンピュータを、データベースを検索し、検索条件に対応した複数の情報を取得する検索部、前記検索部において取得した前記複数の情報間の類似度に基づくスコアを計算する類似度計算部、前記類似度計算部で計算された前記スコアに基づいて、前記検索条件の妥当性を判定する妥当性判定部として機能させることを特徴とする情報処理プログラムを提供する。
また、本発明は上述した従来の技術の課題を解決するため、１または複数のコンピュータを、複数の検索条件を用いてデータベースを検索し、複数の情報からなり前記複数の検索条件それぞれに対応する検索結果セットを取得する検索部、前記検索部において取得された前記検索結果セットに含まれる複数の情報間の類似度に基づいて、複数の前記検索結果セットそれぞれに対応する複数のスコアを計算する類似度計算部、前記類似度計算部で計算された前記スコアが高い前記検索結果セットに含まれる情報を優先的に出力する妥当性判定部として機能させることを特徴とする情報処理プログラムを提供する。

In order to solve the above-described problems of the prior art, the present invention searches a database and acquires a plurality of information corresponding to a search condition, and a similarity for calculating a score based on the similarity between the plurality of information There is provided an information processing apparatus comprising: a degree calculation unit; and a validity determination unit that determines validity of the search condition based on the score.
Further, since the present invention is to solve the problems of the prior art described above, it acquires a plurality of search result set searches the database, corresponding to the plurality of search conditions Ri Do a plurality of information using multiple search criteria a search unit that, said based on the similarity between a plurality of information included in the search result set, the similarity calculation unit for calculating a plurality of scores corresponding to a plurality of said search result set, the score is high There is provided an information processing apparatus comprising a validity determination unit that preferentially outputs information included in the search result set .
Further, the present invention is an information processing method executed by one or a plurality of computers to solve the above-described problems of the prior art, and a search step of searching a database and acquiring a plurality of information corresponding to a search condition A similarity calculation step for calculating a score based on the similarity between the plurality of pieces of information acquired in the search step, and the validity of the search condition based on the score calculated in the similarity calculation step. An information processing method characterized by including a validity determination step for determination.
Further, the present invention is to solve the problems of the prior art described above, an information processing method of one or more computers to perform searches the database by using a plurality of search conditions, Ri Do from multiple wherein the plurality of search conditions set of search results corresponding to the respective the search step acquires a plurality, and based on the similarity between a plurality of information included in the search result set obtained at the retrieval step, a plurality of the search A similarity calculation step for calculating a plurality of scores corresponding to each result set; and a validity determination step for preferentially outputting information included in the search result set having a high score calculated in the similarity calculation step; The information processing method characterized by including this is provided.
Further, in order to solve the above-described problems of the conventional technology, the present invention searches one or a plurality of computers for a database and acquires a plurality of information corresponding to a search condition, and the search unit acquires the information. A similarity calculation unit that calculates a score based on a similarity between a plurality of pieces of information, and a validity determination unit that determines the validity of the search condition based on the score calculated by the similarity calculation unit An information processing program is provided.
Further, since the present invention is to solve the problems of the prior art described above, one or more computers to search the database using a plurality of search conditions, corresponding to the plurality of search conditions Ri Do from multiple search unit acquires a search result set, and based on the similarity between a plurality of information contained in the acquired set of search results at the search unit, a plurality of scores corresponding to a plurality of said search result set An information processing program that functions as a validity determination unit that preferentially outputs information included in the search result set having a high score calculated by the similarity calculation unit, the similarity calculation unit to calculate provide.

本発明によれば、テキストデータにおいて記述の対象となっている情報を精度良く特定することができる。 According to the present invention, information to be described in text data can be specified with high accuracy.

各実施形態における全体構成を示すためのブロック図である。It is a block diagram for showing the whole structure in each embodiment. 記事テキスト（テキストデータ）の例を示す図である。It is a figure which shows the example of article text (text data). 第１実施形態のテキスト情報処理装置１の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the text information processing apparatus 1 of 1st Embodiment. 記事テキストからキーワードを抽出する方法の一例について説明するための図である。It is a figure for demonstrating an example of the method of extracting a keyword from article text. テキストデータ記憶部６が格納するデータの一例を示す図である。It is a figure which shows an example of the data which the text data storage part 6 stores. アイテムＤＢ３が格納するデータの一例を示す図である。It is a figure which shows an example of the data which item DB3 stores. キーワードグループ記憶部５が格納するデータの一例を示す図である。It is a figure which shows an example of the data which the keyword group memory | storage part 5 stores. スコア記憶部７が格納するデータの一例を示す図である。It is a figure which shows an example of the data which the score memory | storage part 7 stores. アイテム算出結果記憶部８が格納するデータの一例を示す図である。It is a figure which shows an example of the data which the item calculation result storage part 8 stores. アイテムランキング情報記憶部９が格納するデータの一例を示す図である。It is a figure which shows an example of the data which the item ranking information storage part 9 stores. 第２実施形態のテキスト情報処理装置１の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the text information processing apparatus 1 of 2nd Embodiment. 第２実施形態のテキスト情報処理装置１の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the text information processing apparatus 1 of 2nd Embodiment.

以下、本発明のテキスト情報処理装置、テキスト情報処理方法、及びテキスト情報処理プログラムについて、図面を参照して説明する。なお、図面中において、同一のものは同じ符号を付す。
また、以下の説明におけるアイテムは、音声、音楽、映像、ウェブページ等のコンテンツや様々な物品であってもよいし、金融商品、不動産、人物に関する情報等であってもよい。また、以下の説明におけるアイテムは、有形か無形かを問わず、有料か無料かも問わない。 Hereinafter, a text information processing apparatus, a text information processing method, and a text information processing program according to the present invention will be described with reference to the drawings. In the drawings, the same components are denoted by the same reference numerals.
In addition, the items in the following description may be contents such as voice, music, video, web pages and various articles, and may be information on financial products, real estate, persons, and the like. The items in the following description may be paid or free regardless of whether they are tangible or intangible.

＜第１実施形態＞
図１は、第１実施形態のテキスト情報処理装置１を含むシステム全体の構成例を示すブロック図である。
このシステムには、テキスト情報処理装置１や、テキストデータサーバ（ブログサーバ）２、アイテムデータベース（アイテムデータサーバ）３、利用者の端末装置４などが含まれ、それぞれがネットワーク２０を介して通信可能である。なお、テキスト情報処理装置１は例えばサーバである。
また、テキストデータサーバ２はテキストデータを記憶し、アイテムデータベース３はアイテムに関する情報を記憶する。
以下の説明では、テキスト情報処理装置１が処理するテキストデータの一例としてブログデータを用いて説明する。ブログデータとは、ユーザによって作成されたテキストデータを含むものである。例えば、ユーザが、ソーシャルネットワークサービスを利用して作成したテキストデータ（ブログ記事）を含むものである。ソーシャルネットワークサービスとして、例えば、Twitter（登録商標）、Facebook（登録商標）、mixi（登録商標）などがある。 <First Embodiment>
FIG. 1 is a block diagram illustrating a configuration example of the entire system including the text information processing apparatus 1 according to the first embodiment.
This system includes a text information processing apparatus 1, a text data server (blog server) 2, an item database (item data server) 3, a user terminal device 4, and the like, each of which can communicate via a network 20. It is. Note that the text information processing apparatus 1 is a server, for example.
The text data server 2 stores text data, and the item database 3 stores information about items.
In the following description, blog data is used as an example of text data processed by the text information processing apparatus 1. Blog data includes text data created by a user. For example, it includes text data (blog articles) created by a user using a social network service. Examples of social network services include Twitter (registered trademark), Facebook (registered trademark), and mixi (registered trademark).

また、テキストデータサーバ２とアイテムデータベース３はそれぞれ別の主体として記述しているが、それらの一部、または全てがテキスト情報処理装置１と同一の主体となるように構成されていてもよい。 Further, although the text data server 2 and the item database 3 are described as different entities, some or all of them may be configured to be the same entity as the text information processing apparatus 1.

テキスト情報処理装置１は、テキストデータ収集部１０、キーワード集合生成部１１、アイテム特定部１２、及びランキング情報作成部１３という４つの処理部を有して構成されている。これら４つの処理部は一体であってもよいし、それぞれ別体であってもよい。また、単一のＣＰＵやＤＳＰを用いて構成してもよいし、複数のＣＰＵやＤＳＰ等を用いて構成してもよい。
また、テキスト情報処理装置１は、キーワードグループ記憶部５、テキストデータ記憶部６、スコア記憶部７、アイテム算出結果記憶部８、及びアイテムランキング情報記憶部９を有して構成されている。これら５つの記憶部は一体であってもよいし、それぞれ別体であてもよい。また、単一のハードディスクドライブ（ＨＤＤ）やフラッシュメモリ等を用いて構成してもよいし、複数のHDDやフラッシュメモリ等を用いて構成してもよい。 The text information processing apparatus 1 includes four processing units: a text data collection unit 10, a keyword set generation unit 11, an item specification unit 12, and a ranking information creation unit 13. These four processing units may be integrated, or may be separate from each other. In addition, a single CPU or DSP may be used, or a plurality of CPUs or DSPs may be used.
The text information processing apparatus 1 includes a keyword group storage unit 5, a text data storage unit 6, a score storage unit 7, an item calculation result storage unit 8, and an item ranking information storage unit 9. These five storage units may be integrated, or may be separate from each other. In addition, a single hard disk drive (HDD), a flash memory, or the like may be used, or a plurality of HDDs, flash memories, or the like may be used.

テキストデータ収集部１０は、テキストデータを記憶しているテキストデータサーバ２より、ブログ等の記事テキスト（テキストデータ）と、その作成者を示すユーザ識別子、及び記事作成更新日といった属性情報を取得し、テキストデータ記憶部６に保存する。なお、ユーザ識別子とは、テキストデータの作成に関連するユーザ、又は、テキストデータの作成に関連する端末装置、を識別する識別子である。なお、テキストデータ記憶部６は必ずしも必要ではなく、テキストデータサーバ２が、テキストデータ記憶部６の役割を兼ね備えていてもよい。 The text data collecting unit 10 acquires attribute information such as an article text (text data) such as a blog, a user identifier indicating the creator, and an article creation update date from the text data server 2 storing the text data. And stored in the text data storage unit 6. The user identifier is an identifier for identifying a user related to creation of text data or a terminal device related to creation of text data. Note that the text data storage unit 6 is not always necessary, and the text data server 2 may also serve as the text data storage unit 6.

キーワード集合生成部１１は、不要文字列処理部１４と、キーワード抽出部１５と、グルーピング処理部１６とを有している。キーワード集合生成部１１は、テキストデータ収集部１０によって取得したテキストデータから、アイテムを特定するためのキーワードを抽出し、キーワードグループ（検索キー）を生成する役割を持つ。なお、詳しくは後述するが、このキーワードグループを用いて検索することとなる。
キーワード集合生成部１１の不要文字列処理部１４は、アイテム情報に関係しない不要な情報を除いたテキストデータを生成する。アイテム情報に関係しない不要な情報とは、例えば、文書リンク情報やメタタグなどの情報である。不要文字列処理部１４における処理については後に詳述する。 The keyword set generation unit 11 includes an unnecessary character string processing unit 14, a keyword extraction unit 15, and a grouping processing unit 16. The keyword set generation unit 11 has a role of extracting a keyword for specifying an item from the text data acquired by the text data collection unit 10 and generating a keyword group (search key). Although details will be described later, a search is performed using this keyword group.
The unnecessary character string processing unit 14 of the keyword set generation unit 11 generates text data excluding unnecessary information that is not related to item information. Unnecessary information not related to item information is information such as document link information and meta tags, for example. The processing in the unnecessary character string processing unit 14 will be described in detail later.

キーワード集合生成部１１のキーワード抽出部１５は、不要文字列処理部１４によって加工されたテキストデータからキーワードを抽出する。
キーワード集合生成部１１のグルーピング処理部１６は、キーワード抽出部１５によって切り出された１又は複数のキーワードをグループ化して、そのグループ化した、１又は複数のキーワードの集合であるキーワードグループを、キーワードグループ記憶部５へ保存する。なお、１つのキーワードしか含まない場合であってもキーワードグループと称することとする。 The keyword extraction unit 15 of the keyword set generation unit 11 extracts keywords from the text data processed by the unnecessary character string processing unit 14.
The grouping processing unit 16 of the keyword set generation unit 11 groups one or a plurality of keywords extracted by the keyword extraction unit 15, and groups the keyword group that is a set of one or more keywords into a keyword group. Save to the storage unit 5. Even when only one keyword is included, it is referred to as a keyword group.

アイテム特定部１２は、検索部１７と類似度計算部１８と妥当性判定部１９とを有しており、キーワード集合生成部１１によって生成されたキーワードグループを用いて、アイテムデータベース３からアイテム情報を検索し、その検索結果で得られた複数のアイテム情報間の類似度からキーワードの妥当性を判定する役割を持つ。 The item specifying unit 12 includes a search unit 17, a similarity calculation unit 18, and a validity determination unit 19, and uses the keyword group generated by the keyword set generation unit 11 to retrieve item information from the item database 3. It has a role of searching and determining the validity of a keyword from the similarity between a plurality of item information obtained from the search result.

アイテム特定部１２の検索部１７は、キーワード集合生成部１１によって生成されたキーワードグループを使用し、アイテムデータベース３を検索する。そして、複数のアイテム情報からなる検索結果セットが得られた場合、アイテム特定部１２の類似度計算部１８は、複数のアイテム情報間の類似度を計算する。さらに、類似度計算部１８は、複数のアイテム情報間の類似度を用いて、キーワードグループ毎に後述する算出式で検索結果セットに関するスコアを求め、スコア記憶部７へ記録する。 The search unit 17 of the item specifying unit 12 searches the item database 3 using the keyword group generated by the keyword set generation unit 11. And when the search result set which consists of several item information is obtained, the similarity calculation part 18 of the item specific | specification part 12 calculates the similarity between several item information. Further, the similarity calculation unit 18 obtains a score related to the search result set by a calculation formula to be described later for each keyword group using the similarity between the plurality of item information, and records the score in the score storage unit 7.

アイテム特定部１２の妥当性判定部１９は、類似度計算部１８が算出したスコアと閾値θとを比較して、アイテムデータベース３の検索に使用したキーワードグループの妥当性を判定する。そして、妥当であると判定されたキーワードグループに対応する検索結果セットを用いて、記事テキスト（テキストデータ）に関連するアイテムを特定する。妥当性判定部１９は、その特定したアイテム（アイテム識別子）と、そのキーワードグループを抽出した元のテキストデータのブログ識別子と、を関連付けて、アイテム算出結果記憶部８へ記録する。なお、妥当であると判定されたキーワードグループが複数存在する場合、その中で最もスコアの高いキーワードグループに対応する検索結果セットを用いてアイテムを特定してもよいし、その複数のキーワードグループに対応する複数の検索結果セット全てを用いてアイテムを特定してもよい。 The validity determination unit 19 of the item specifying unit 12 compares the score calculated by the similarity calculation unit 18 with the threshold θ, and determines the validity of the keyword group used for the search of the item database 3. Then, an item related to the article text (text data) is specified using a search result set corresponding to the keyword group determined to be valid. The validity determination unit 19 associates the identified item (item identifier) with the blog identifier of the original text data from which the keyword group is extracted, and records it in the item calculation result storage unit 8. In addition, when there are a plurality of keyword groups determined to be appropriate, an item may be specified using a search result set corresponding to the keyword group with the highest score among them, and the plurality of keyword groups An item may be specified using all of a plurality of corresponding search result sets.

ランキング情報作成部１３は、アイテム算出結果記憶部８のデータを用いて算出したアイテムの出現回数に基づき、順位付け（ランキング）を行い、アイテムランキング情報記憶部９へ記録する。なお、ランキング情報作成部１３を備えていなくともテキストデータにおける記述の対象となっている情報を精度良く特定することができるが、ランキング情報作成部１３を備えることで、テキスト情報処理装置１による解析結果をより有用な形式で出力することができる。 The ranking information creation unit 13 performs ranking (ranking) based on the number of appearances of items calculated using the data in the item calculation result storage unit 8 and records the ranking in the item ranking information storage unit 9. In addition, although it is not provided with the ranking information creation part 13, the information used as the description object in text data can be specified with a sufficient precision, By providing the ranking information creation part 13, the analysis by the text information processing apparatus 1 is possible. The results can be output in a more useful format.

なお、テキスト情報処理装置１は、ＣＰＵ（Central Processing Unit）、ＲＡＭ（Random Access Memory）、ＲＯＭ（Read Only Memory）、ＨＤＤ（Hard Disk Drive）、ネットワークインターフェース等を備える一般的なコンピュータを用いて構成してもよい。すなわち、後に説明するような処理を行うプログラムをコンピュータに実行させることにより、テキスト情報処理装置１として機能するようにしてもよい。 The text information processing apparatus 1 is configured using a general computer including a CPU (Central Processing Unit), a RAM (Random Access Memory), a ROM (Read Only Memory), an HDD (Hard Disk Drive), a network interface, and the like. May be. That is, you may make it function as the text information processing apparatus 1 by making a computer run the program which performs a process which is demonstrated later.

また、テキスト情報処理装置１を複数のコンピュータを用いて構成してもよい。例えば、負荷分散をするために、テキスト情報処理装置１のある処理ブロックに相当するコンピュータを複数台用いて、すなわち、同じ処理ブロックを備える複数台のコンピュータを用いて分散処理を行なうようにしてもよい。また、テキスト情報処理装置１の一部の処理ブロックをあるコンピュータで実施し、他の処理ブロックを別のコンピュータで実施する形態で分散処理を行なってもよい。 Further, the text information processing apparatus 1 may be configured using a plurality of computers. For example, in order to distribute the load, a plurality of computers corresponding to a certain processing block of the text information processing apparatus 1 may be used, that is, a plurality of computers having the same processing block may be used to perform the distributed processing. Good. Further, the distributed processing may be performed in such a manner that some processing blocks of the text information processing apparatus 1 are implemented by a computer and other processing blocks are implemented by another computer.

テキスト情報処理装置１の具体的な処理について、図２に示すテキストデータの一例と、図３に示すフローチャートと、図４〜９に示すデータ構成図とを用いて詳細に説明する。 Specific processing of the text information processing apparatus 1 will be described in detail using an example of text data shown in FIG. 2, a flowchart shown in FIG. 3, and data configuration diagrams shown in FIGS.

以下では、楽曲に関する記事テキスト（テキストデータ）に基づいて、その楽曲を示すアイテム情報を特定し、その特定したアイテム情報に基づくランキング情報を作成する例について説明する。なお、前述のように、アイテムは楽曲に限らず、様々なコンテンツ、物品、サービスであってもよい。 Below, the example which specifies the item information which shows the music based on the article text (text data) regarding a music, and produces the ranking information based on the specified item information is demonstrated. As described above, the items are not limited to music, but may be various contents, articles, and services.

図３に、テキスト情報処理装置１の処理フローを示す。
ステップＳ１において、テキストデータ収集部１０が、テキストデータサーバ２からテキストデータを取得し、その取得したテキストデータをテキストデータ記憶部６に格納する。 FIG. 3 shows a processing flow of the text information processing apparatus 1.
In step S 1, the text data collection unit 10 acquires text data from the text data server 2 and stores the acquired text data in the text data storage unit 6.

具体的には、テキストデータ収集部１０は、テキストデータサーバ２に対して、所定のリクエストコマンドを送信することで、ユーザ識別子、記事テキスト（テキストデータ）、及び記事作成更新日などを含むブログデータを受信（取得）する。この受信したデータをテキストデータ記憶部６のテキストデータテーブルに格納する。
この際に、テキストデータ収集部１０は、記事テキスト１件につき１つの識別情報（ブログ識別子）を付与する。テキストデータテーブルの格納形式の一例を図５に示す。ブログ識別子、ユーザ識別子、記事テキスト、及び記事作成更新日（例えばアップロードした日時）が関連付けられて格納される。例えば、ユーザが一度のアップロードでテキストデータサーバ２に送信したテキストデータ毎にブログ識別子が付されることとなる。 Specifically, the text data collection unit 10 transmits a predetermined request command to the text data server 2 to thereby include blog data including a user identifier, article text (text data), an article creation update date, and the like. Is received (acquired). The received data is stored in the text data table of the text data storage unit 6.
At this time, the text data collection unit 10 assigns one piece of identification information (blog identifier) to each article text. An example of the storage format of the text data table is shown in FIG. A blog identifier, a user identifier, an article text, and an article creation update date (for example, uploaded date and time) are stored in association with each other. For example, a blog identifier is attached to each text data transmitted to the text data server 2 by the user once uploading.

本実施形態におけるブログ識別子の表記は、記事作成更新日の順に、「BlogID」という文字列＋アンダースコア記号（_）＋数字の連番とするが、ユーザID＋数字の連番としてもよいし、記事取得日時＋数字の連番としてもよい。それぞれのブログデータを一意に特定できればよい。なお、テキストデータサーバ２が、ブログ識別子（またはブログ識別子に相当するデータ）を備えており、テキストデータ収集部１０が、そのデータを受信（取得）する場合は、テキストデータ収集部１０においてブログ識別子を付与する処理を省略し、受信したブログ識別子を利用してもよい。 The notation of the blog identifier in the present embodiment is a character string “BlogID” + underscore symbol (_) + number serial number in order of article creation update date, but may be user ID + number serial number, The article acquisition date and time may be a serial number. It is only necessary to uniquely identify each blog data. When the text data server 2 includes a blog identifier (or data corresponding to the blog identifier) and the text data collection unit 10 receives (acquires) the data, the text data collection unit 10 uses the blog identifier. May be omitted, and the received blog identifier may be used.

ブログデータの読み込みは、必要な記事作成更新日の範囲（期間）をリクエストコマンドで指定して、それに対応するデータを取得してもよい。同様に、リクエストコマンドで必要なユーザ識別子を指定して、そのユーザの記事データのみを取得してもよい。また、リクエストコマンドに文字列に関する検索式を含め、記事テキストに特定の文字列パターンが含まれるブログデータのみを取得してもよい。 The blog data may be read by specifying a required article creation update date range (period) with a request command and acquiring corresponding data. Similarly, a user identifier necessary for the request command may be specified to acquire only the article data of the user. Further, a search expression related to a character string may be included in the request command, and only blog data in which a specific character string pattern is included in the article text may be acquired.

（キーワード集合生成部１１の動作）
図３に戻り、ステップＳ２〜ステップＳ５にて、キーワード集合生成部１１によるキーワード集合生成処理が実行される。 (Operation of the keyword set generation unit 11)
Returning to FIG. 3, in step S 2 to step S 5, keyword set generation processing by the keyword set generation unit 11 is executed.

まず、ステップＳ２において、キーワード集合生成部１１は、テキストデータ記憶部６のテキストデータテーブルからブログ識別子毎のテキストデータを読み出す（取得する）。これ以降の処理においては、各々のテキストデータを対象にして処理を行う。 First, in step S 2, the keyword set generation unit 11 reads (acquires) text data for each blog identifier from the text data table in the text data storage unit 6. In the subsequent processing, processing is performed for each text data.

ステップＳ３において、不要文字列処理部１４は、テキストデータの先頭から末尾までの文字の内、アイテムの特定に役立たない文字列（不要文字列ＦＷと呼ぶ）を、所定の区切り記号Ｋに置換する。例えば「\\」といった、記事テキストに出現する可能性が少ない記号（複数の記号の組合せも含む）を区切り記号Ｋにするとよい。不要な文字列は置換せずに削除したり、空白文字（例えば、スペース記号、タブ記号など）に置換してもよいが、区切り記号Ｋに置換する方が、アイテムの特定に使用する文字列の切り出しに役立つため、好ましい。なお、所定の区切り記号Ｋは、常に同じ記号を使う必要はなく、テキストデータに応じて、適宜、変更してもよい。例えば、テキストデータの言語種別や文字種別に応じて、区切り記号を変えてもよい。 In step S 3, the unnecessary character string processing unit 14 replaces a character string (referred to as an unnecessary character string FW) that is not useful for specifying an item among characters from the beginning to the end of the text data with a predetermined delimiter K. . For example, a symbol (including a combination of a plurality of symbols) that is less likely to appear in the article text, such as “\\”, may be used as the delimiter K. Unnecessary character strings may be deleted without being replaced, or may be replaced with white space characters (for example, a space symbol, a tab symbol, etc.). This is preferable because it is useful for cutting out the. The predetermined delimiter K need not always use the same symbol, and may be changed as appropriate according to the text data. For example, the delimiter may be changed according to the language type or character type of the text data.

ここで、図２及び図５を用いて、ステップＳ３の不要文字列処理部１４による処理の詳細について説明する。
なお、図２はアイテム情報の特定に使用する記事テキスト（テキストデータ）の一例を示す図である。図２の例では、テキストデータの先頭Ｓから末尾Ｅまでの間に、1つ以上の通常文字列Ｗと、特定記号ＴＫと、不要文字列ＦＷとを含む。ただし、特定記号ＴＫと不要文字列ＦＷは必ずしもあるとは限らない。また、特定記号ＴＫと不要文字列ＦＷは複数存在する場合がある。キーワード抽出部１５によって、特定記号ＴＫと不要文字列ＦＷ以外の通常文字列Ｗを抽出することになるが、この抽出方法については後述する。なお、１文字の場合も文字列と称することとする。また、通常文字列Wは、アイテムの特定に役立つ可能性のある文字列であり、例えば、テキストデータの中の特定記号ＴＫと不要文字列ＦＷ以外の文字列である。 Here, the details of the processing performed by the unnecessary character string processing unit 14 in step S3 will be described with reference to FIGS.
FIG. 2 is a diagram showing an example of article text (text data) used for specifying item information. In the example of FIG. 2, one or more normal character strings W, a specific symbol TK, and an unnecessary character string FW are included between the beginning S and the end E of the text data. However, the specific symbol TK and the unnecessary character string FW are not always present. Further, there may be a plurality of specific symbols TK and unnecessary character strings FW. The keyword extraction unit 15 extracts the normal character string W other than the specific symbol TK and the unnecessary character string FW. This extraction method will be described later. A single character is also referred to as a character string. Further, the normal character string W is a character string that may be useful for specifying an item, for example, a character string other than the specific symbol TK and the unnecessary character string FW in the text data.

また、図５はテキストデータ記憶部６に格納されるデータ（テキストデータテーブル）の一例を示す図である。図５に示すように、テキストデータテーブルには、記事テキストと、記事テキストに付与されたブログ識別子と、記事テキストをアップロードしたユーザを示すユーザ識別子と、記事テキストをアップロードした更新日を示す記事作成更新日とが関連付けて格納される。図５の記事テキストに示すように、ブログなどのユーザが作成する様々なテキストでは、使われる単語や表現形式が非常に多様になる。 FIG. 5 is a diagram illustrating an example of data (text data table) stored in the text data storage unit 6. As shown in FIG. 5, in the text data table, the article text, the blog identifier assigned to the article text, the user identifier indicating the user who uploaded the article text, and the article creation indicating the update date when the article text was uploaded Stored in association with the update date. As shown in the article text in FIG. 5, the various words and expressions used in various texts created by users such as blogs are very diverse.

また、一般的には、アイテムの特定に役立つ文字列と、不要な文字列が混在している。図５に示す例において、「#NowPlaying」は、楽曲や映像コンテンツの再生に係わる記事であることを慣用的に示す文字列である。これは、どのアイテムに関する記事においても、同一の文字列になるため、アイテムの特定に役立たず、不要文字列ＦＷになる。 In general, a character string useful for specifying an item and an unnecessary character string are mixed. In the example shown in FIG. 5, “#NowPlaying” is a character string that conventionally indicates that the article is related to the reproduction of music or video content. Since this is the same character string in articles related to any item, it is not useful for specifying an item and becomes an unnecessary character string FW.

また、例えばTwitterなどの比較的短い記事テキストがアップロードされることが多いサービス（マイクロブログサービス）におけるテキストでは、他サイトへのリンクを示すＵＲＬ（Uniform Resource Locator）が頻繁に含まれているが、このＵＲＬの文字列にはアイテム名等が含まれていないことが多いので、アイテム特定に役立たないため、「http://」などで始まるＵＲＬ文字列を不要な文字列として扱う。なお、特に短縮ＵＲＬの文字列にはアイテム名等が含まれていないことが多いので、短縮ＵＲＬのときのみ不要文字列ＦＷと扱うようにしてもよい。 In addition, in the text in a service (microblog service) where a relatively short article text such as Twitter is often uploaded, a URL (Uniform Resource Locator) indicating a link to another site is frequently included. Since the URL character string often does not include an item name or the like, the URL character string starting with “http: //” or the like is treated as an unnecessary character string because it is not useful for item identification. In particular, since the item name or the like is often not included in the character string of the shortened URL, it may be handled as the unnecessary character string FW only for the shortened URL.

また、アイテム名が含まれていないことが多い、メタタグ（「<」と「>」とで囲まれている文字列）や、音符（♪）などのマークについても不要文字列ＦＷとして扱う。これらは、半角、全角いずれであってもよい。
不要文字列処理部１４は、不要文字列ＦＷの一覧表や、不要文字列ＦＷとすべき文字列の条件等を記憶したデータベースを参照して、テキストデータに上記の不要文字列ＦＷが含まれるか否かを判断する。不要文字列処理部１４は、所定の区切り記号Ｋに置き換える。 Also, meta tags (character strings enclosed by “<” and “>”) and marks (♪), which often do not include item names, are treated as unnecessary character strings FW. These may be either half-width or full-width.
The unnecessary character string processing unit 14 refers to a database that stores a list of unnecessary character strings FW, a condition of character strings to be regarded as unnecessary character strings FW, and the like, and the above-mentioned unnecessary character strings FW are included in text data. Determine whether or not. The unnecessary character string processing unit 14 replaces it with a predetermined delimiter K.

不要文字列処理部１４によって、不要な文字列を空白文字（例えば、スペース記号、タブ記号など）に置き換えるのではなく、ブログ記事等で使用される可能性の少ない所定の記号に置き換えることにより、アイテムの特定に役立つキーワードを精度よく抽出することができる。
例えば、図４（Ａ）に示すような「Ｍ１：タイトル，Ｍ２：空白，Ｍ３：ＵＲＬ，Ｍ４：空白，Ｍ５：アーティスト（姓），Ｍ６：空白，Ｍ７：アーティスト（名），Ｍ８：#NowPlaying」というパターンの記事テキストがあった場合、図４（Ｂ）に示すように、不要文字列ＦＷであるＭ３：ＵＲＬやＭ８：#NowPlayingを空白文字に置換すると、アイテムの特定に役立つキーワードである文字列Ｍ５と文字列Ｍ７との間に空白が入っていた場合、文字列Ｍ５と文字列Ｍ７とを１つのキーワードとして扱うか否かの判別は難しくなる。 The unnecessary character string processing unit 14 does not replace unnecessary character strings with blank characters (for example, space symbols, tab symbols, etc.), but replaces them with predetermined symbols that are less likely to be used in blog articles, etc. Keywords useful for identifying items can be extracted with high accuracy.
For example, as shown in FIG. 4A, “M1: title, M2: blank, M3: URL, M4: blank, M5: artist (last name), M6: blank, M7: artist (first name), M8: #NowPlaying If there is an article text with a pattern of “”, as shown in FIG. 4B, replacing the unnecessary character string FW M3: URL or M8: #NowPlaying with a blank character is a keyword useful for specifying an item. If there is a space between the character string M5 and the character string M7, it is difficult to determine whether the character string M5 and the character string M7 are handled as one keyword.

つまり、文字列Ｍ５：アーティスト（姓）と、文字列Ｍ７：アーティスト（名）とを１つのキーワードとして抽出した方がアイテムの特定には有利であるが、不要文字列ＦＷを空白へと置換した場合は、そのような文字列の統合が難しい。
これに対して、図４（Ｃ）のように、不要文字列ＦＷであるＭ３：ＵＲＬやＭ８：#NowPlayingを区切り記号Ｋ（本図の例では「\\」）に置換すれば、空白を無視してこの区切り記号Ｋでテキストデータを区切ればよいため、文字列Ｍ５と文字列Ｍ７とを統合して、１つのキーワードとして扱うことができ、より精度良くアイテムを特定できる。なお、不要文字列ＦＷの文字数に係らず、「\\」に置き換えるようにしているが、不要文字列ＦＷを構成する文字それぞれを「\\」に置き換えてもよい。 In other words, although it is more advantageous to specify the item by extracting the character string M5: artist (last name) and the character string M7: artist (first name) as one keyword, the unnecessary character string FW is replaced with a blank. If so, such string integration is difficult.
On the other hand, as shown in FIG. 4C, if the unnecessary character string FW M3: URL or M8: #NowPlaying is replaced with a delimiter K ("\\" in this example), a blank space Since it is only necessary to disregard the text data with this delimiter K, the character string M5 and the character string M7 can be integrated and handled as one keyword, and the item can be specified with higher accuracy. Note that, regardless of the number of characters in the unnecessary character string FW, “\\” is replaced, but each character constituting the unnecessary character string FW may be replaced with “\\”.

なお、前述の特許文献１記載の除外文字や句読点等についても不要文字列ＦＷとして扱うことができる。特許文献１記載の除外文字とは、例えば、「の」、「が」、「い」及び「く」などである。 Note that the excluded characters, punctuation marks, and the like described in Patent Document 1 can also be handled as unnecessary character strings FW. Examples of excluded characters described in Patent Document 1 include “NO”, “GA”, “I”, “KU”, and the like.

次に特定記号について説明する。本実施形態で対象としている、楽曲再生中に係わるテキストデータでは、楽曲名とアーティスト名を記述する順序やフォーマットに明確なルールは存在しないが、図２や図５のテキストデータに示すように、ハイフン「-」又は、スラッシュ「/」をテキストとアーティストを区切る記号として用いていることが多い。本実施形態では、この記号を特定記号ＴＫと称する。テキストデータの中に、特定記号ＴＫが存在する場合もあるし、存在しない場合もある。 Next, specific symbols will be described. There is no clear rule in the order and format for describing the song name and artist name in the text data related to the music playback that is the subject of this embodiment, but as shown in the text data of FIG. 2 and FIG. The hyphen "-" or slash "/" is often used as a symbol to separate text and artist. In the present embodiment, this symbol is referred to as a specific symbol TK. The specific symbol TK may or may not exist in the text data.

不要文字列処理部１４によって不要文字列ＦＷを所定の区切り記号Ｋに置き換える処理を行う場合、この特定記号ＴＫをそのまま残してもよいし、不要文字列ＦＷとして区切り記号Ｋに置き換えてもよい。特定記号の前後は、楽曲名やアーティスト名などアイテムの特定に役立つ文字列が存在する可能性が比較的高いため、特定記号ＴＫを残して利用することにより、精度よくキーワード抽出が行える場合がある。一方、特定記号ＴＫを区切り記号Ｋに置換することにより、キーワード抽出処理を簡略化できる。 When the unnecessary character string processing unit 14 performs processing for replacing the unnecessary character string FW with a predetermined delimiter K, the specific symbol TK may be left as it is, or may be replaced with the delimiter K as the unnecessary character string FW. Before and after the specific symbol, there is a relatively high possibility that there is a character string useful for specifying an item such as a song name or artist name. Therefore, there are cases where keyword extraction can be performed with high accuracy by using the specific symbol TK. . On the other hand, the keyword extraction process can be simplified by replacing the specific symbol TK with the delimiter K.

また、テキストデータにおける記述の対象であるアイテムのアイテム情報が日本語である場合、そのアイテム情報（例えば、音楽コンテンツであれば、日本語のタイトル、日本語のアーティスト名など）に空白文字が含まれる可能性が比較的低いという特徴を利用して、テキストデータが日本語の場合、空白文字を全て区切り記号に置換してもよい。あるいは、日本語の場合、空白文字を削除し、空白文字の前後の文字列をつなげる処理をしてもよい。
以上が、不要文字列処理部１４による処理の詳細である。 In addition, if the item information of the item to be described in the text data is in Japanese, the item information (for example, if it is music content, the Japanese title, the Japanese artist name, etc.) contains blank characters If the text data is Japanese, using the feature that the possibility of being generated is relatively low, all white space characters may be replaced with delimiters. Alternatively, in the case of Japanese, processing may be performed in which white space characters are deleted and character strings before and after the white space characters are connected.
The above is the details of the processing by the unnecessary character string processing unit 14.

図３の説明に戻り、ステップＳ４では、キーワード抽出部１５が、キーワードを抽出する。区切り記号Ｋを区切りとして、先頭Ｓから最初の区切り記号Ｋの一つ前の文字までのテキスト領域と、区切り記号Ｋに挟まれたテキスト領域と、最後の区切り記号Ｋの次の文字から文末Ｅまでのテキスト領域とに分割し、これらのテキスト領域に含まれる文字列をそれぞれキーワードとする。なお、区切り記号Ｋに挟まれたテキスト領域は複数存在することが多い。また、特定記号ＴＫを利用する場合は、特定記号ＴＫと区切り記号Ｋの間、または特定記号ＴＫと先頭Ｓの間、または特定記号ＴＫと文末Ｅの間のいずれかのテキスト領域に含まれる文字列を優先的にキーワードとしてもよい。このような処理を行うことにより、キーワード抽出の精度をさらに高めることができる。
また、不要文字列処理部１４によって不要文字列ＦＷを空白文字に置換する処理を行っていた場合は、テキストデータを空白文字の位置で区切ってキーワードを抽出する。 Returning to the description of FIG. 3, in step S 4, the keyword extraction unit 15 extracts keywords. With the delimiter K as a delimiter, the text area from the beginning S to the character immediately before the first delimiter K, the text area sandwiched by the delimiter K, and the character following the last delimiter K from the end of the sentence E And the character strings included in these text regions are used as keywords. Note that there are often a plurality of text regions sandwiched between the delimiters K. Further, when the specific symbol TK is used, characters included in any text area between the specific symbol TK and the delimiter K, between the specific symbol TK and the head S, or between the specific symbol TK and the sentence end E. A column may be preferentially a keyword. By performing such processing, the accuracy of keyword extraction can be further increased.
When the unnecessary character string processing unit 14 performs processing for replacing the unnecessary character string FW with a blank character, the text data is separated at the position of the blank character to extract keywords.

なお、テキスト領域の文字種（漢字、ひらがな、カタカナ、アルファベット、数字等）を判定して、キーワードに空白文字を含めるか否かを決定してもよい。例えば、テキスト領域が主にアルファベットの文字種で構成されている場合は、空白の前後の文字列をつなげる処理を行わずに、空白を含めた前後の文字列を１つのキーワードとして抽出する。例えば、図４（Ｃ）に示す例では「Ｍ５アーティスト（姓），Ｍ６空白，Ｍ７アーティスト（名）」を１つのキーワードとして抽出する。
一方、主に、漢字、ひらがな、カタカナで構成されている場合は、空白の前後の文字列をつなげる処理を行った上で、前後の文字列を１つのキーワードとして抽出する。例えば、図４（Ｃ）に示す例では「Ｍ５アーティスト（姓），Ｍ７アーティスト（名）」を１つのキーワードとして抽出する。 Note that the character type (kanji, hiragana, katakana, alphabet, numbers, etc.) of the text area may be determined to determine whether or not to include a blank character in the keyword. For example, when the text area is mainly composed of alphabetic character types, the character string before and after the blank is extracted as one keyword without performing the process of connecting the character strings before and after the blank. For example, in the example shown in FIG. 4C, “M5 artist (last name), M6 blank, M7 artist (name)” is extracted as one keyword.
On the other hand, in the case of mainly composed of kanji, hiragana, and katakana, the process of connecting the character strings before and after the blank is performed, and then the character strings before and after are extracted as one keyword. For example, in the example shown in FIG. 4C, “M5 artist (last name), M7 artist (name)” is extracted as one keyword.

なお、キーワードの先頭Ｓおよび末尾Ｅは、空白文字にならないようにすることが好ましい。また、空白文字をキーワードに含めない場合は、特定記号に最も近い空白以外の文字列をキーワードとして抽出することが好ましい。 It should be noted that it is preferable that the beginning S and end E of the keyword are not blank characters. When a blank character is not included in the keyword, it is preferable to extract a character string other than the blank closest to the specific symbol as a keyword.

また、特定の長さの文字列のみをキーワードにしてもよい。例えば、５文字以上１５文字以下の文字列をキーワードにする等の基準（条件）を設けて、キーワードを抽出してもよい。このようにする場合、文字種に応じて、キーワードにする文字列の長さの条件を変えてもよい。例えば、アルファベットを使った文字列では、１つの単語の文字列が多くなる傾向があるので、非空白文字と空白文字を合わせた長さが７文字以上２０文字以下である場合に、キーワードとするといった条件を設定してもよい。 Further, only a character string having a specific length may be used as a keyword. For example, a keyword may be extracted by providing a criterion (condition) such as using a character string of 5 to 15 characters as a keyword. In this case, the length condition of the character string used as the keyword may be changed according to the character type. For example, in a character string using alphabets, the character string of one word tends to increase. Therefore, when the total length of non-blank characters and blank characters is 7 to 20 characters, the keyword is used. Such a condition may be set.

また、漢字が多く含まれる文字列の場合は、キーワードとする文字列の長さを他の文字種の場合と比較して短めに設定して、２文字以上１０文字以下といった条件を設定してもよい。また、特定記号ＴＫを利用する場合は、特定記号に隣接するテキスト領域と、特定記号から離れた位置にあるテキスト領域とで、キーワード抽出の条件を変えてもよい。例えば、特定記号に隣接するテキスト領域では、キーワードとする文字列の長さの条件を緩くし（例えば、３文字以上２０文字以下）、特定記号から離れた位置にあるテキスト領域では、文字列の長さの条件を厳しく（例えば、６文字以上１２文字以下）する等の処理をしてもよい。 In the case of a character string containing many kanji characters, the length of the character string as a keyword is set shorter than in the case of other character types, and a condition such as 2 to 10 characters is set. Good. When the specific symbol TK is used, the keyword extraction condition may be changed between a text region adjacent to the specific symbol and a text region located away from the specific symbol. For example, in a text area adjacent to a specific symbol, the length condition of the character string as a keyword is relaxed (for example, 3 to 20 characters), and in a text area located away from the specific symbol, the character string Processing such as strict length conditions (for example, 6 to 12 characters) may be performed.

以上のようにして、ステップＳ４にて、１つの記事テキストから例えばＪ個（Ｊ≧１）のキーワードが抽出される As described above, for example, J (J ≧ 1) keywords are extracted from one article text in step S4.

次のステップＳ５では、ステップＳ４で作成した記事テキスト毎のキーワードを使用して、グルーピング処理部１６が、１つの記事テキスト毎に、キーワードグループを作成する。
キーワードの数が１つ（Ｊ＝１）である場合は、１つのキーワードグループが作成される。キーワードが複数（Ｊ≧２）の場合は、基本的に複数のキーワードグループを作成する。１つのキーワードグループに含めるキーワードの数は、１以上の任意の数である。
ここでは、図２に示すテキストデータから抽出された４つのキーワードＫ１，Ｋ２，Ｋ３，Ｋ４を例に、キーワードグループの作成方法を説明する。 In the next step S5, the grouping processing unit 16 creates a keyword group for each article text using the keyword for each article text created in step S4.
When the number of keywords is one (J = 1), one keyword group is created. When there are a plurality of keywords (J ≧ 2), a plurality of keyword groups are basically created. The number of keywords included in one keyword group is an arbitrary number of 1 or more.
Here, a method for creating a keyword group will be described by taking four keywords K1, K2, K3, and K4 extracted from the text data shown in FIG. 2 as an example.

まず、１つのキーワードグループにつき１つのキーワードを含むように、キーワードグループを作成した場合について説明する。このような場合もキーワードグループと称することとする。
グルーピング処理部１６は、キーワードＫ１，Ｋ２，Ｋ３，Ｋ４それぞれを、それぞれ別のキーワードグループとする。そして、作成したキーワードグループに、それぞれを識別可能なキーワードグループ識別子を付与し、図７に示すような形式で、キーワードグループ記憶部５に格納する。なお、図７は、図２に示すテキストデータに基づいた検索キーワードグループテーブルの例である。 First, the case where a keyword group is created so that one keyword group includes one keyword will be described. Such a case is also referred to as a keyword group.
The grouping processing unit 16 sets the keywords K1, K2, K3, and K4 as separate keyword groups. Then, each created keyword group is assigned a keyword group identifier that can be identified, and stored in the keyword group storage unit 5 in a format as shown in FIG. FIG. 7 is an example of a search keyword group table based on the text data shown in FIG.

グルーピング処理部１６は、キーワードＫ１，Ｋ２，Ｋ３，Ｋ４に、キーワードグループ識別子Gr001-001、Gr001-002、Gr001-003、Gr001-004をそれぞれ付与する。この例において、キーワードグループ識別子のハイフン「-」より前の部分は、ブログ識別子によって決まる文字列であり、「Gr001」は、「BlogID_001」と対応する。あるいは、キーワードグループ識別子の前半にブログ識別子を直接用いて、「BlogID_001-001」などとしてもよい。また、キーワードグループ識別子のハイフン「-」より後の部分は、数字の連番であるが、作成順の数字の連番としてもよいし、記事取得日時＋数字の連番としてもよい。そして、グルーピング処理部１６は、キーワードグループ識別子と、ブログ識別子と、キーワードグループに含まれるキーワードとを対応させて格納する。 The grouping processing unit 16 assigns keyword group identifiers Gr001-001, Gr001-002, Gr001-003, and Gr001-004 to the keywords K1, K2, K3, and K4, respectively. In this example, the part before the hyphen “-” of the keyword group identifier is a character string determined by the blog identifier, and “Gr001” corresponds to “BlogID_001”. Alternatively, the blog identifier may be directly used in the first half of the keyword group identifier, such as “BlogID_001-001”. The portion after the hyphen “-” of the keyword group identifier is a serial number, but it may be a serial number in the order of creation, or may be an article acquisition date + a serial number. Then, the grouping processing unit 16 stores the keyword group identifier, the blog identifier, and the keywords included in the keyword group in association with each other.

次に、１つのキーワードグループにつき２つのキーワードを含むように、キーワードグループを作成した場合について説明する。 Next, the case where a keyword group is created so that two keywords are included in one keyword group will be described.

グルーピング処理部１６は、４つのキーワードから２つのキーワードを選ぶ組合せである、「Ｋ１とＫ２」，「Ｋ１とＫ３」，「Ｋ１とＫ４」，「Ｋ２とＫ３」，「Ｋ２とＫ４」，「Ｋ３とＫ４」の６つのキーワードグループを作成する。
図７に示す例では、グルーピング処理部１６は、「Ｋ１とＫ２」，「Ｋ１とＫ３」，「Ｋ１とＫ４」，「Ｋ２とＫ３」，「Ｋ２とＫ４」，「Ｋ３とＫ４」に、キーワードグループ識別子Gr001-005、Gr001-006、Gr001-007、Gr001-008、Gr001-009、Gr001-010をそれぞれ付与する。
１つのキーワードグループに複数のキーワードが存在する場合、各キーワードを空白文字で連結した１つの文字列として格納してもよいし、各キーワードを分離して読み出せる形式で格納してもよい。 The grouping processing unit 16 is a combination of selecting two keywords from four keywords, “K1 and K2”, “K1 and K3”, “K1 and K4”, “K2 and K3”, “K2 and K4”, “ Six keyword groups “K3 and K4” are created.
In the example illustrated in FIG. 7, the grouping processing unit 16 includes “K1 and K2”, “K1 and K3”, “K1 and K4”, “K2 and K3”, “K2 and K4”, and “K3 and K4”. Keyword group identifiers Gr001-005, Gr001-006, Gr001-007, Gr001-008, Gr001-009, and Gr001-010 are assigned respectively.
When a plurality of keywords exist in one keyword group, each keyword may be stored as one character string concatenated with a blank character, or each keyword may be stored in a format that can be read out separately.

本実施形態のように、アイテムが楽曲である場合、アイテムの特定には、楽曲名とアーティスト名の２つの文字列が役立つ場合が多い。よって、２つのキーワードを含むキーワードグループであると、記述の対象となっている情報を精度良く特定することができることが多い。ただし、１つのキーワードグループにつき１つのキーワードを含むように、キーワードグループを作成した場合よりもキーワードグループの数が多くなり、処理量は多くなる。 As in the present embodiment, when an item is a song, two character strings of a song name and an artist name are often useful for specifying the item. Therefore, in the case of a keyword group including two keywords, it is often possible to accurately specify information to be described. However, the number of keyword groups is larger and the processing amount is larger than when a keyword group is created so that one keyword group includes one keyword.

なお、グルーピング処理部１６によって、第１の数（図７の例では１つ）のキーワードを含むキーワードグループと、第１の数よりも大きい第２の数（図７の例では２つ）のキーワードを含むキーワードグループとの両方を作成すると、後述のように、処理量を抑えながら、記述の対象となっている情報を精度良く特定することができる。なお、図７には示していないが、検索キーワードグループテーブルに、優先度又は優先順位を保持する列を追加し、キーワードグループそれぞれに優先度（優先順位）を付与してもよい。そして、後述のステップＳ６において、優先度（優先順位）に従って、アイテムデータベース３を検索してもよい。すなわち、まず優先度が最も高いキーワードグループを用いて検索を行った後、次に優先度が２番目に高いキーワードグループを用いて検索を行う等の処理を行ってもよい。 It should be noted that the grouping processing unit 16 has a keyword group including a first number (one in the example of FIG. 7) of keywords and a second number (two in the example of FIG. 7) larger than the first number. If both the keyword group including the keyword is created, the information to be described can be specified with high accuracy while suppressing the processing amount as will be described later. Although not shown in FIG. 7, a column holding priority or priority may be added to the search keyword group table to give priority (priority) to each keyword group. In step S6 described later, the item database 3 may be searched according to the priority (priority order). That is, after performing a search using the keyword group having the highest priority, the search may be performed using the keyword group having the second highest priority.

優先度（優先順位）を付与する方法としては、それぞれのキーワードが、文字列の長さや文字種などに関するキーワード基準（条件）をどの程度満たしているかの度合いを用いることができる。また、特定記号ＴＫの近くの文字列から抽出されたキーワードの優先度を高くする処理を行ってもよい。 As a method of assigning priority (priority order), it is possible to use the degree to which each keyword satisfies the keyword criteria (conditions) relating to the length of the character string and the character type. In addition, a process for increasing the priority of a keyword extracted from a character string near the specific symbol TK may be performed.

（アイテム特定部１２の動作）
図３に戻り、ステップＳ６にて、アイテム特定部１２の検索部１７は、キーワードグループ記憶部５に格納されているキーワードグループテーブルから、順次キーワードグループを読み出し、キーワードグループごとに検索式を作成して、アイテムデータベース３に検索リクエストを送信する。 (Operation of the item specifying unit 12)
Returning to FIG. 3, in step S 6, the search unit 17 of the item specifying unit 12 sequentially reads out the keyword groups from the keyword group table stored in the keyword group storage unit 5 and creates a search expression for each keyword group. Then, a search request is transmitted to the item database 3.

本実施形態におけるアイテムデータベース３は、図６に示すような構成のアイテムテーブルを格納している。アイテムデータベース３は、検索リクエストを受信すると、アイテムテーブルを検索し、タイトル列およびアーティスト列のうちの少なくとも一方が、検索リクエストで指定された条件（検索式）に合致した場合に、そのアイテムの情報（タイトル、アーティスト名など）をテキスト情報処理装置１に送信する。なお、テキスト情報処理装置１に送信する情報にアイテム識別子を含めてもよい。
また、ベクトル空間モデル等の検索モデルを用いれば、アイテム情報に検索キーワードが含まれない場合であっても、そのアイテム情報を検索出力とすることも可能である。アイテム特定部１２の検索部１７は、検索リクエストに含まれる検索式に基づく、アイテム情報のリストを取得する。 The item database 3 in the present embodiment stores an item table configured as shown in FIG. When the item database 3 receives the search request, the item database 3 searches the item table, and when at least one of the title column and the artist column matches the condition (search formula) specified in the search request, information on the item is stored. (Title, artist name, etc.) is transmitted to the text information processing apparatus 1. The item identifier may be included in the information transmitted to the text information processing apparatus 1.
Further, if a search model such as a vector space model is used, even if a search keyword is not included in the item information, the item information can be used as a search output. The search unit 17 of the item specifying unit 12 acquires a list of item information based on the search formula included in the search request.

１つの検索式（１回の検索）に対応して、アイテムデータベース３から取得できるデータ（アイテム情報のリスト）を、検索結果セット（検索結果リスト）と称する。検索式に合致するアイテムが存在する場合、検索結果セットには、１つ又は複数のアイテム情報が含まれている。なお、検索により取得されたアイテム情報を単に「検索結果」とも称する。
また、本実施形態におけるアイテムデータベース３は、検索式の中でＡＮＤ又はＯＲ条件が明示されずに、複数のキーワードが指定された場合、複数のキーワードがＡＮＤ条件で結合されたものとして解釈する。また、アイテムデータベース３は、検索式に合致するアイテムが複数存在する場合、優先順位を付けて検索結果を送信してもよい。例えば、優先順位の最も高いアイテムを１番目の検索結果とし、優先順位が２番目に高いアイテムを２番目の検索結果とし、以下同様に検索結果の順番を決めてもよい。 Data (item information list) that can be acquired from the item database 3 in correspondence with one search expression (one search) is referred to as a search result set (search result list). When there is an item that matches the search formula, the search result set includes one or more item information. The item information acquired by the search is also simply referred to as “search result”.
Further, the item database 3 according to the present embodiment interprets that a plurality of keywords are combined with the AND condition when the AND or OR condition is not specified in the search formula and a plurality of keywords are specified. Further, the item database 3 may transmit search results with priorities when there are a plurality of items that match the search formula. For example, the item with the highest priority may be the first search result, the item with the second highest priority may be the second search result, and the order of the search results may be determined in the same manner.

ここで、優先順位は、検索式とアイテム情報とが合致する度合いを用いて算出されてもよいし、アイテムの人気度を用いて算出されてもよい。例えば、検索結果として出力する回数をアイテムごとにカウントし、この回数を人気度とし、人気度の高いアイテムの優先順位を高くする処理を行ってもよい。また、外部から取得可能な、アイテムの利用回数、アイテムの売り上げ金額などの情報を用いて人気度を算出してもよい。また更に、テキスト情報処理装置１が、アイテムデータベース３に対して、後述するランキング情報に基づき、アイテムごとの人気度を算出し、この情報を定期的にアイテムデータベース３に提供し、アイテムデータベース３が優先順位の決定に用いてもよい。
このように、検索部１７とアイテムデータベース３とが協働して検索処理を行うようにしているが、どちらかが単独で行ってもよい。 Here, the priority order may be calculated using the degree of matching between the search expression and the item information, or may be calculated using the item popularity. For example, the number of times to be output as a search result may be counted for each item, and the number of times may be set as a popularity level, and processing for increasing the priority of items having a high popularity level may be performed. Further, the popularity may be calculated using information such as the number of times the item is used and the sales amount of the item, which can be acquired from the outside. Furthermore, the text information processing apparatus 1 calculates the popularity for each item based on ranking information to be described later with respect to the item database 3, and periodically provides this information to the item database 3. It may be used to determine priority.
Thus, although the search part 17 and the item database 3 cooperate and perform a search process, either may carry out independently.

検索部１７は、１回の検索につき、１つのキーワードグループを用いる。また、キーワードグループが複数のキーワードを含む場合は、それらを使ってＡＮＤ条件となるように、検索式を作成する。１つの検索式に使われるキーワードの集合を検索キーと称する。本実施形態においては、キーワードグループが検索キーに相当する。また、検索式にＡＮＤ又はＯＲ条件が含まれず、検索式が１つ以上のキーワードのみで構成されている場合は、検索式と検索キーは等価であるといえる。
例えば、１つのキーワードのみが含まれるキーワードグループで検索を行った場合は、タイトルおよびアーティストの内の少なくとも一方にそのキーワードが含まれるアイテム情報（ここでは、タイトルとアーティスト名）が出力される。 The search unit 17 uses one keyword group for each search. When the keyword group includes a plurality of keywords, a search expression is created so as to satisfy an AND condition using them. A set of keywords used in one search expression is called a search key. In the present embodiment, the keyword group corresponds to the search key. If the search expression does not include an AND or OR condition and the search expression is composed of only one or more keywords, it can be said that the search expression and the search key are equivalent.
For example, when a search is performed with a keyword group including only one keyword, item information (here, title and artist name) including the keyword in at least one of the title and artist is output.

図７に示すように、キーワードグループ識別子Gr001-001のキーワード「歌」のみのキーワードグループで検索した場合、タイトルやアーティスト名に「歌」が含まれる検索結果が出力される。例えば、「愛の歌／Ｚ山Ｔ朗」，「卒業の歌／Ｙバンド」，「歌ソング／Ｃ＆Ａ」，「春歌／Ａバンド」，「夏歌／Ａバンド」などを含む、タイトルとアーティスト名のリストが、検索結果として出力される。 As shown in FIG. 7, when a search is made with a keyword group having only the keyword “song” with the keyword group identifier Gr001-001, a search result including “song” in the title and artist name is output. For example, titles including “Love Song / Z Yama Toro”, “Graduation Song / Y Band”, “Song Song / C & A”, “Spring Song / A Band”, “Summer Song / A Band”, etc. A list of artist names is output as a search result.

また、（Ｋ１，Ｋ２）の２つのキーワードがキーワードグループに含まれる場合、（Ｋ１ＡＮＤＫ２）といった検索式を作成する。つまり、タイトルかアーティストの内の少なくともどちらかにキーワードＫ１が含まれ、かつ、タイトルかアーティストの内の少なくともどちらかにキーワードＫ２が含まれるアイテムを示す情報が出力される。
例えば、図７に示すように、キーワードグループ識別子Gr001-006の「歌」及び「Ａバンド」が含まれるキーワードグループで検索した場合、タイトルやアーティスト名に「歌」が含まれ、かつ、タイトルやアーティスト名に「Ａバンド」が含まれる検索結果が出力される。例えば、「春歌／Ａバンド」及び「夏歌／Ａバンド」という、タイトルとアーティスト名のリストが出力される。 When two keywords (K1, K2) are included in the keyword group, a search expression such as (K1 AND K2) is created. That is, information indicating an item in which the keyword K1 is included in at least one of the title and the artist and the keyword K2 is included in at least one of the title and the artist is output.
For example, as shown in FIG. 7, when searching for a keyword group including “song” and “A band” of the keyword group identifier Gr001-006, “song” is included in the title and artist name, A search result including “A band” in the artist name is output. For example, a list of titles and artist names “Haruka / A band” and “Natsuka / A band” is output.

次のステップＳ７において、アイテム特定部１２は、検索結果セットに含まれる各アイテム情報の正規化を行う。この正規化は、アイテムデータベース３が、実質的に同じアイテムを別々の検索結果として返すことがあるため、これに対応する処理として行う。アイテムが楽曲である場合、実質的に同じ楽曲であっても、複数パターンの曲名表記が使われている場合がある。 In the next step S7, the item specifying unit 12 normalizes each item information included in the search result set. This normalization is performed as a process corresponding to this because the item database 3 may return substantially the same item as different search results. When the item is a song, even if the item is substantially the same, a plurality of patterns of song name notation may be used.

例えば、アイテムデータベース３が１つの楽曲「タイトルＡ／アーティストＢ」に関して、「タイトルＡ（version Ｃ）／アーティストＢ］、「タイトルＡ／アーティストＢ（featuring Ｘ）」、「タイトルＡ／アーティストＢ with Ｘ」などの複数の検索結果を返す場合がある。特に、多数のユーザが作成、提供した楽曲情報をもとにアイテムデータテーブルが作成されている場合には、このような現象が起こりやすい。アイテム情報の正規化を行うことにより、上記のような楽曲表記のバリエーションを１つにまとめることが可能になる。具体的には、検索結果セットに含まれる各アイテム情報（タイトルおよびアーティスト名）の文字列に対して、所定の文字列を消去したり、文字種の変換を行って正規化文字列を作成したりする。例えば、括弧で囲われた文字列（「（」と「）」とで囲われた文字列）を削除してもよい。 For example, in the item database 3 for one song “title A / artist B”, “title A (version C) / artist B”, “title A / artist B (featuring X)”, “title A / artist B with X” May return multiple search results. In particular, when an item data table is created based on music information created and provided by many users, such a phenomenon is likely to occur. By normalizing the item information, it is possible to combine the above music notation variations into one. Specifically, for each item information (title and artist name) character string included in the search result set, a predetermined character string is deleted, or a character string is converted to create a normalized character string. To do. For example, a character string enclosed in parentheses (a character string surrounded by “(” and “)”) may be deleted.

また、「featuring」、「with」などアーティスト名を補足するのに多用される文字列をあらかじめ登録しておき、検索結果のアーティスト名からその文字列以降を消去してもよい。また、半角カタカナを全角カタカナに、全角アルファベットを半角アルファベットに、全角数字を半角数字に等の文字種の変換処理を行ってもよい。正規化処理は必ずしも行わなくてもよいが、このような検索結果の正規化処理を行うことで、テキストデータとアイテムとをさらに精度良く対応付けることができる。 Alternatively, a character string frequently used to supplement the artist name such as “featuring” or “with” may be registered in advance, and the character string and subsequent characters may be deleted from the artist name as a search result. Also, conversion processing of character types such as half-width katakana to full-width katakana, full-width alphabet to half-width alphabet, and full-width number to half-width number may be performed. The normalization process does not necessarily have to be performed, but by performing such a search result normalization process, the text data and the item can be associated with each other more accurately.

次にステップＳ８では、アイテム特定部１２は、ステップＳ７で作成された正規化されたアイテム情報を用いて、検索結果セットに含まれるアイテム情報間それぞれの類似度計算を行い、算出結果の平均値をスコアとして算出する。そして、算出したスコアをキーワードグループ識別子に対応させて、図８に示す検索結果スコアテーブルのスコア列に格納する。なお、検索した結果、該当するアイテムが見つからなかった場合（検索結果セットが空集合の場合）、そのキーワードグループについてのスコアは格納されない。 Next, in step S8, the item specifying unit 12 calculates the similarity between each item information included in the search result set using the normalized item information created in step S7, and calculates the average value of the calculation results. Is calculated as a score. Then, the calculated score is stored in the score column of the search result score table shown in FIG. 8 in association with the keyword group identifier. As a result of the search, if the corresponding item is not found (when the search result set is an empty set), the score for the keyword group is not stored.

次に、スコア算出方法について説明する。例えば、検索結果セットとして、（１）「春歌／Ａバンド」、（２）「Ａソング／Ａバンド」、及び（３）「夏歌／Ａバンド」の３つのアイテム情報が出力された場合は、（１）「春歌／Ａバンド」と（２）「Ａソング／Ａバンド」との類似度、（１）「春歌／Ａバンド」と（３）「夏歌／Ａバンド」との類似度、（２）「Ａソング／Ａバンド」と（３）「夏歌／Ａバンド」との類似度、の３つの類似度を算出する。そしてこの３つの類似度の平均値をスコアとして算出してもよい。このように、検索結果セットに含まれるアイテム情報の全ての組合せについて類似度を算出すると、スコアを精度よく算出することができるが、処理量は多くなる。 Next, a score calculation method will be described. For example, when three items of information of (1) “Spring Song / A Band”, (2) “A Song / A Band”, and (3) “Summer Song / A Band” are output as a search result set. (1) “Haruka / A band” and (2) “Song / A band” similarity, (1) “Haruka / A band” and (3) “Natsuka / A band” The similarity of (2) “A song / A band” and (3) “Natsuka / A band” is calculated. Then, an average value of these three similarities may be calculated as a score. Thus, if the similarity is calculated for all combinations of item information included in the search result set, the score can be calculated with high accuracy, but the processing amount increases.

また、検索結果セットの中のアイテムから１つの基準アイテム（基準検索結果）を選び、その基準アイテムと、検索結果セットの中の他のアイテムとの類似度を算出し、それらの平均値をスコアとして算出してもよい。例えば、（１）「春歌／Ａバンド」を基準アイテムとし、（１）「春歌／Ａバンド」と（２）「Ａソング／Ａバンド」との類似度、（１）「春歌／Ａバンド」と（３）「夏歌／Ａバンド」との類似度、の２つの類似度を算出し、それらの平均値をスコアとしてもよい。このようにすると、検索結果セットに含まれるアイテム情報の全ての組合せについて類似度を算出する場合と比べ、スコアの精度は低下するが、処理量は少ない。検索結果セットに含まれるアイテム情報が多い場合には、基準アイテムを使う方法が望ましい。 Also, select one reference item (reference search result) from the items in the search result set, calculate the similarity between the reference item and other items in the search result set, and score the average value of them May be calculated as For example, (1) “Haruka / A band” is a reference item, (1) “Haruka / A band” and (2) “A song / A band” similarity, (1) “Haruka / A band” Two similarities of “A band” and (3) “Natsuka / A band” similarity may be calculated, and the average value thereof may be used as a score. In this way, compared to the case where the similarity is calculated for all combinations of item information included in the search result set, the accuracy of the score is reduced, but the processing amount is small. When there is a lot of item information included in the search result set, a method using a reference item is desirable.

検索結果セットに２つのアイテム情報しか含まれない場合は、その２つのアイテム情報間の類似度をそのままスコアとして用いればよい。また、検索結果セットに１つのアイテム情報しか含まれない場合には、類似度およびスコアの計算処理は行わず、その検索結果セットのアイテム情報をブログ識別子に対応付けるようにしてもよい。 When only two item information items are included in the search result set, the similarity between the two item information items may be used as a score as it is. If only one item information is included in the search result set, the similarity and score calculation processing may not be performed, and the item information in the search result set may be associated with the blog identifier.

類似度計算には種々の方法を用いることができる。例えば、正規化された検索結果Ｎ件（Ｎ≧２）を対象に、形態素解析処理を行い、単語を抽出する。この際に、名詞や形容詞など特定の品詞を抽出対象としたり、助詞や助動詞を除外する等の処理を行ってもよい。合計でＭ種類の単語（１〜Ｍ語）が抽出できた場合、検索結果（アイテム情報）を行列の行に、単語を列に対応させて、ある検索結果にある単語が出現する頻度（回数）を行列要素とするＮ×Ｍ生起行列を作成する。あるいは、行列要素を、ある検索結果にある単語が出現した場合に「１」、出現しない場合に「０」としてもよい。
以下では、生起行列の要素をｄijと表わす（i＝１〜Ｎ、j＝１〜Ｍ）。iは行列のi番目の行、jは行列のj番目の列を示す。 Various methods can be used for the similarity calculation. For example, morphological analysis processing is performed on N normalized search results (N ≧ 2), and words are extracted. At this time, processing such as extracting a specific part of speech such as a noun or an adjective or excluding a particle or an auxiliary verb may be performed. When a total of M types of words (1 to M words) can be extracted, the search results (item information) correspond to the rows of the matrix and the words correspond to the columns. ) Is created as a matrix element. Alternatively, the matrix element may be “1” when a certain word appears in a certain search result and “0” when it does not appear.
In the following, the elements of the occurrence matrix are represented as dij (i = 1 to N, j = 1 to M). i represents the i-th row of the matrix, and j represents the j-th column of the matrix.

ここで、Ｎ件全ての組み合わせについて類似度を算出してもよいが、処理を簡便化するために、生起行列のＮ行の中から１つの行を基準検索結果（基準アイテム）として選び、基準検索結果と他の検索結果（他の行）との類似度を算出する。基準検索結果は、乱数を使ってランダムに選択してもよいが、本実施形態では１行目の検索結果（アイテムデータベース３が最初に出力したアイテム情報）を基準検索結果とする。 Here, the similarity may be calculated for all N combinations, but in order to simplify the process, one of the N rows of the occurrence matrix is selected as the reference search result (reference item), and the reference The similarity between the search result and other search results (other rows) is calculated. The reference search result may be selected randomly using a random number, but in the present embodiment, the search result on the first line (item information output first by the item database 3) is used as the reference search result.

本実施形態において、類似度の計算には、下記の数１に示す式に示すとおりコサイン類似度を使用する。基準検索結果をk番目の行とすると、基準検索結果とi番目の検索結果（i番目の行）との類似度Ｓikは、数１に示す式で求められる。ただし、i＝１〜Ｎ、i≠k、j＝１〜Ｍである。 In the present embodiment, the cosine similarity is used for calculating the similarity as shown in the following equation (1). If the reference search result is the k-th row, the similarity Sik between the reference search result and the i-th search result (i-th row) is obtained by the equation shown in Equation 1. However, i = 1 to N, i ≠ k, and j = 1 to M.

本実施形態においては、コサイン類似度を使用するが、類似度算出の式はこれに限らない。例えば、公知のＪａｃｃａｒｄ係数、Ｓｉｍｐｓｏｎ係数、ピアソン積率相関係数などを用いて類似度を算出してもよい。また、形態素解析を用いて単語を抽出するのではなく、検索結果同士を文字単位で比較して類似度を算出してもよい。例えば、２つの正規化された検索結果に対して、それぞれ先頭からｐ番目の文字が一致するか否かを判定し、それを用いて類似度を算出してもよい。また、レーベンシュタイン距離等、一般的に文字列の類似度として用いられている尺度を算出してもよい。 In the present embodiment, the cosine similarity is used, but the similarity calculation formula is not limited to this. For example, the similarity may be calculated using a known Jaccard coefficient, Simpson coefficient, Pearson product moment correlation coefficient, or the like. Further, instead of extracting words using morphological analysis, the similarity may be calculated by comparing the search results in units of characters. For example, it may be determined whether or not the p-th character from the top matches each of two normalized search results, and the similarity may be calculated using that. Also, a scale generally used as the similarity of character strings, such as Levenshtein distance, may be calculated.

そして、１つの検索結果セットにつき得られる類似度の平均値を算出して、スコアとする。例えば、Ｎ件（Ｎ≧３）の検索結果が得られた場合、基準検索結果と、他の（Ｎ−１）件の検索結果との間の類似度が（Ｎ−１）個算出される。この（Ｎ−１）個の類似度の平均値を算出すればよい。なお、ここでは、類似度の平均値を算出してスコアとするが、類似度の最小値、平均値、中央値、最頻値、四分位値などを算出してスコアとしてもよい。このスコアが大きければ大きい程、複数の検索結果が類似していることを意味する。また、１つの検索結果セットから算出された複数の類似度の内、所定値以上であった類似度の個数をカウントし、その個数を、類似度計算の対象とした検索結果セットに含まれるアイテム数Ｎや、その検索結果セットから算出された類似度の個数で割った値をスコアとして用いてもよい。 Then, an average value of similarities obtained for one search result set is calculated and used as a score. For example, when N search results (N ≧ 3) are obtained, (N−1) similarities between the reference search result and the other (N−1) search results are calculated. . What is necessary is just to calculate the average value of these (N-1) similarities. Here, the average value of the similarity is calculated as the score, but the minimum value, the average value, the median value, the mode value, the quartile value, and the like of the similarity may be calculated and used as the score. A larger score means that a plurality of search results are similar. In addition, among a plurality of similarities calculated from one search result set, the number of similarities that are equal to or greater than a predetermined value is counted, and the number is included in the search result set for which the similarity is calculated. The number N or a value divided by the number of similarities calculated from the search result set may be used as the score.

ブログ記事で使われる一般的な単語と、楽曲のタイトルで使われる単語は、重なっていることが多く、事前にルールを作っておいて、これらを区別することが難しい。このため、抽出されたキーワードには、アイテムとは関係ない一般的な単語が入る場合もある。 Common words used in blog posts and words used in song titles often overlap, making it difficult to distinguish between them by creating rules in advance. For this reason, a general word unrelated to the item may be included in the extracted keyword.

キーワードが一般的な単語である場合、それを使ってアイテムデータベース３を検索すると、１つの楽曲ではなく、複数の楽曲に関する検索結果が返ってくる可能性が高い。例えば、「愛」といった一般的な単語を楽曲名に含む楽曲は多いため、「愛」を検索キーにしてアイテムデータベース３を検索すると、複数の楽曲に関する検索結果が得られる可能性が非常に高い。このような場合、多様な検索結果が得られ、検索結果どうしの類似度は低くなり、スコアも低い値となる。 If the keyword is a general word and the item database 3 is searched using the keyword, there is a high possibility that a search result related to a plurality of pieces of music is returned instead of one piece of music. For example, since there are many songs that include a general word such as “love” in the song name, if the item database 3 is searched using “love” as a search key, it is very likely that search results relating to a plurality of songs will be obtained. . In such a case, various search results are obtained, the similarity between the search results is low, and the score is also low.

一方、キーワードが、ある１つの楽曲に特有の語句であったり、一般的な使用頻度が低い語句であったりする場合、検索結果が複数であっても、実質的には１つの楽曲に関することが多い。この場合は、検索結果どうしの類似度が高くなり、スコアも高い値となる。従って、上述した方法でスコアを算出することにより、検索に用いたキーワード（キーワードグループ）によって、１つのアイテムが特定できたか否かを的確に判定することができる。 On the other hand, when a keyword is a phrase unique to a certain piece of music or a phrase that is not frequently used, even if there are a plurality of search results, it is substantially related to one piece of music. Many. In this case, the similarity between the search results is high, and the score is also high. Therefore, by calculating the score by the method described above, it is possible to accurately determine whether one item has been identified by the keyword (keyword group) used for the search.

次のステップＳ９にて、アイテム特定部１２の妥当性判定部１９は、スコアが所定の閾値θより大きいかどうかを判定する。θの値は、あらかじめ試験的に収集した検索結果を用いて設定してもよいし、状況に応じて設定値を変更してもよい。妥当性判定部１９は、スコアがθ以上であれば、アイテム特定に結び付くキーワードグループであると判断してステップＳ１０に移って真を返すと共に、検索結果セットの中からブログ記事と対応させるアイテムの候補である候補アイテムを選択し、図８に示す検索結果スコアテーブルの「候補アイテムのアイテム識別子」列に、候補アイテムのアイテム識別子を登録する。スコアがθより小さければ、アイテム特定に結び付くキーワードグループではないと判断してステップＳ１１に移って偽を返す。 In the next step S9, the validity determination unit 19 of the item specifying unit 12 determines whether the score is greater than a predetermined threshold value θ. The value of θ may be set using search results collected in advance on a trial basis, or the set value may be changed according to the situation. If the score is equal to or greater than θ, the validity determination unit 19 determines that the keyword group is associated with item identification, moves to step S10, returns true, and sets the item associated with the blog article from the search result set. A candidate item that is a candidate is selected, and the item identifier of the candidate item is registered in the “item identifier of candidate item” column of the search result score table shown in FIG. If the score is smaller than θ, it is determined that the keyword group is not linked to item identification, and the process moves to step S11 to return false.

ここで、図８は、キーワードグループ識別子Gr001-001 〜Gr001-010のキーワードグループのスコアを示す検索結果スコアテーブルを示すものである。検索結果スコアテーブルはスコア記憶部７に格納される。
閾値θを「０．４」とすると、図８の例では、３つのキーワードグループGr001-006、 Gr001-008、Gr001-010が閾値θ以上のスコアとなっている。スコアが閾値以上となったキーワードグループについては、検索結果セットの内の一のアイテム識別子が関連付けられる。なお、キーワードグループに含まれるキーワード数に応じて閾値θを変更してもよい。この場合、キーワード数が多いほど大きな閾値（真となりにくい閾値）を用いるとよい。検索結果セットの中から一のアイテム（候補アイテム）を選択する方法としては、以下の方法を用いることができる。 Here, FIG. 8 shows a search result score table indicating the scores of the keyword groups having the keyword group identifiers Gr001-001 to Gr001-010. The search result score table is stored in the score storage unit 7.
Assuming that the threshold θ is “0.4”, in the example of FIG. 8, the three keyword groups Gr001-006, Gr001-008, and Gr001-010 have scores equal to or higher than the threshold θ. A keyword group whose score is equal to or greater than a threshold is associated with one item identifier in the search result set. Note that the threshold θ may be changed according to the number of keywords included in the keyword group. In this case, a larger threshold value (threshold value that is less likely to be true) may be used as the number of keywords increases. As a method for selecting one item (candidate item) from the search result set, the following method can be used.

第１の方法は、アイテムデータベース３が検索結果として出力する最初（１番目）のアイテム（検索部１７が最初に取得したアイテム）を選択する方法である。この方法は、アイテムデータベース３が、優先順位付きの検索結果を出力する場合に用いることができる。テキスト情報処理装置１は、取得した検索結果の順番の情報を記憶しておく。 The first method is a method of selecting the first (first) item (the item acquired first by the search unit 17) that the item database 3 outputs as a search result. This method can be used when the item database 3 outputs search results with priorities. The text information processing apparatus 1 stores information on the order of the acquired search results.

第２の方法は、キーワードグループ（検索キー）と、それに基づく検索結果それぞれとの類似度を算出して、類似度が最も高かった検索結果（アイテム）を選択する方法である。例えば、「Ａソング」と「Ａバンド」が含まれるキーワードグループGr001-010については、キーワード「Ａソング」及び「Ａバンド」と、各検索結果である「Ａソング／Ａバンド」，「Ａソングsingle ver.／Ａバンド」，「Ａソング／ＡバンドwithT」それぞれとの類似度を算出する。ここでの類似度は、２つの文字列を１文字単位で比較するタイプの方法を用いて算出するとよい。この例の場合、検索結果「Ａソング／Ａバンド」が最も類似度が高くなるため、妥当性判定部１９は、この「Ａソング／Ａバンド」を候補アイテムとして決定し、図６に示すアイテムテーブルを参照しながら、「Ａソング／Ａバンド」のアイテム識別子であるＡ００１を特定し、検索結果スコアテーブルのキーワードグループ（Gr001-010）に対応する候補アイテム識別子として登録する。なお、類似度の代わりに、キーワードグループと検索結果それぞれとの差（違いの度合い）や距離を算出してもよい。 The second method is a method of calculating a similarity between a keyword group (search key) and each search result based on the keyword group and selecting a search result (item) having the highest similarity. For example, for the keyword group Gr001-010 including “A song” and “A band”, the keywords “A song” and “A band”, and “A song / A band” and “A song” as search results. Calculate the similarity to “single ver./A band” and “A song / A band withT”. The similarity here may be calculated using a method of a type in which two character strings are compared in units of one character. In this example, since the search result “A song / A band” has the highest similarity, the validity determination unit 19 determines this “A song / A band” as a candidate item, and the items shown in FIG. While referring to the table, A001 which is an item identifier of “A song / A band” is specified and registered as a candidate item identifier corresponding to the keyword group (Gr001-010) of the search result score table. Instead of the similarity, a difference (degree of difference) or a distance between the keyword group and each search result may be calculated.

第３の方法は、ステップＳ７で正規化されたアイテム情報と、正規化する前のアイテム情報との差が最も小さいアイテムを選択する方法である。例えば、アイテムデータベース３が、（１）「Ａソング／Ａバンド」、（２）「Ａソングsingle ver.／Ａバンド」、（３）「Ａソング／ＡバンドwithT」の３つのアイテムを出力し、これらを正規化した結果が全て、「Ａソング／Ａバンド」となった場合、正規化の前後で文字列が変わらない（１）「Ａソング／Ａバンド」を選択する。 The third method is a method of selecting an item having the smallest difference between the item information normalized in step S7 and the item information before normalization. For example, the item database 3 outputs (1) “A song / A band”, (2) “A song single ver./A band”, and (3) “A song / A band withT”. When all of these normalization results are “A song / A band”, the character string does not change before and after normalization. (1) “A song / A band” is selected.

第４の方法は、後述するランキング情報を利用し、過去に作成されたランキング情報の順位が最も高いアイテムを選択する方法である。過去にブログ記事に多く登場したアイテムほど、新たなブログ記事にも登場する可能性が高いためである。 The fourth method is a method of selecting an item having the highest ranking information created in the past using ranking information described later. This is because the more items that have appeared in blog articles in the past, the more likely they will appear in new blog articles.

次のステップＳ１２にて、妥当性判定部１９は、全てのキーワードグループについて妥当性の判定を行ったか判断し、まだ妥当性を行っていないキーワードグループがあれば、ステップＳ９に戻り、次のキーワードグループのスコアと閾値とを比較する。ステップＳ１２にて、全てのキーワードグループについて妥当性の判定が終わっていた場合は、次のステップＳ１３に移る。なお、このステップＳ１２において、全てのキーワードグループについて妥当性判定を行ったかを判定するのではなく、妥当性判定の結果が真となったキーワードグループが１つ存在した時点で、次のステップＳ１３に移ってもよい。こうすることで計算負荷が低減される。 In the next step S12, the validity determination unit 19 determines whether the validity determination has been performed for all the keyword groups. If there is a keyword group that has not been validated yet, the validity determination unit 19 returns to step S9, and the next keyword Compare group scores to thresholds. If it is determined in step S12 that the validity of all keyword groups has been determined, the process proceeds to the next step S13. In step S12, it is not determined whether the validity determination has been performed for all the keyword groups, but when there is one keyword group for which the result of the validity determination is true, the process proceeds to the next step S13. You may move. By doing so, the calculation load is reduced.

次のステップＳ１３において、アイテム特定部１２の妥当性判定部１９は、妥当性判定の結果が真であったアイテム識別子とブログ識別子とを、検索結果スコアテーブルに基づき、図９に示すようなアイテム算出結果記憶部８が備えるアイテム算出結果テーブルに登録する。このように、妥当性判定部１９は、妥当性判定の結果が真であったアイテム識別子をそのブログ識別子に対応するアイテム情報であると特定している。 In the next step S13, the validity determination unit 19 of the item specifying unit 12 sets the item identifier and blog identifier for which the validity determination result is true based on the search result score table as shown in FIG. It registers in the item calculation result table with which the calculation result memory | storage part 8 is provided. As described above, the validity determination unit 19 specifies the item identifier for which the result of the validity determination is true as item information corresponding to the blog identifier.

なお、図８に示す例においては、閾値以上のスコアとなっているキーワードグループが複数（３つ）存在し、それぞれアイテム識別子（候補アイテム識別子）が関連付けられているが、その中で最もスコアの高いキーワードグループのアイテム識別子を採用して、アイテム算出結果テーブルに登録してもよいし、閾値以上のスコアとなっているキーワードグループ全てのアイテム識別子を登録してもよい。これは、１つのテキストデータにおいて、複数のアイテムが記述されることもあるためである。ただし、アイテムを特定する精度を重視する場合は、最もスコアの高いキーワードグループのアイテム識別子のみを採用した方がよい。なお、スコアの高い順に複数のキーワードグループを選択し、それらに対応する複数のアイテム識別子をアイテム算出結果テーブルに登録してもよい。また、ステップＳ８で算出されたスコアが閾値未満である場合でも検索結果スコアテーブルに候補アイテムを登録するようにした上で、最もスコアの高いキーワードグループに対応する候補アイテムのアイテム識別子をアイテム算出結果テーブルに登録してもよい。 In the example shown in FIG. 8, there are a plurality (three) of keyword groups having a score equal to or higher than the threshold value, and item identifiers (candidate item identifiers) are associated with each of them. An item identifier of a high keyword group may be adopted and registered in the item calculation result table, or item identifiers of all keyword groups having a score equal to or higher than a threshold value may be registered. This is because a plurality of items may be described in one text data. However, when emphasizing the accuracy of specifying an item, it is better to adopt only the item identifier of the keyword group with the highest score. A plurality of keyword groups may be selected in descending order of score, and a plurality of item identifiers corresponding to them may be registered in the item calculation result table. In addition, even when the score calculated in step S8 is less than the threshold, the candidate item is registered in the search result score table, and the item identifier of the candidate item corresponding to the keyword group with the highest score is obtained as the item calculation result. You may register in the table.

最もスコアの高いキーワードグループのアイテム識別子を採用する場合、図７及び図８に示す例では、ブログ識別子BlogID_001とアイテム識別子Ａ００１とがアイテム算出結果テーブルへ出力される。
以上のようにして、ブログ識別子に対して、記述の対象となっているアイテム識別子を精度良く対応させることができる。なお、上述の説明では、１つの検索結果セットの中から１つの候補アイテムを選択して、検索結果スコアテーブルに登録しているが、１つの検索セットから複数の候補アイテムを選択して登録するようにしてもよい。 When the item identifier of the keyword group with the highest score is adopted, the blog identifier BlogID_001 and the item identifier A001 are output to the item calculation result table in the example shown in FIGS.
As described above, an item identifier that is a description target can be associated with a blog identifier with high accuracy. In the above description, one candidate item is selected from one search result set and registered in the search result score table. However, a plurality of candidate items are selected and registered from one search set. You may do it.

以上のようにして特定したアイテムを示す情報を、対応するブログ記事そのものやブログ記事を示す情報（例えばブログ識別子やブログの題名）とともに表示部に表示させる表示制御部２１を備えるようにしてもよい。例えば、アイテム名とともにブログ記事を表示することで、そのアイテムに関する口コミ情報であることがすぐに識別できる。なお、表示部はテキスト情報処理装置１が有する表示部（図示せず）でもよいし、端末装置４が有する表示部（図示せず）でもよい。
また、アイテム名と、そのアイテムに関連付けられた複数のブログ記事とを同じ画面で表示するようにすれば、そのアイテムに関する複数の口コミ情報などが一度に見ることができるため有用である。 You may make it provide the display control part 21 which displays on the display part the information which shows the item specified as mentioned above with the information (for example, blog identifier or blog title) which shows the corresponding blog article itself or a blog article. . For example, by displaying a blog article together with an item name, it is possible to immediately identify word-of-mouth information related to the item. The display unit may be a display unit (not shown) included in the text information processing apparatus 1 or a display unit (not shown) included in the terminal device 4.
Also, displaying the item name and a plurality of blog articles associated with the item on the same screen is useful because a plurality of word-of-mouth information related to the item can be viewed at a time.

（ランキング情報作成部１３の動作）
図３に戻り、ランキング情報作成部１３によって行われる処理について説明する。
ステップＳ１４にて、ランキング情報作成部１３は、図９に示すアイテム算出結果テーブルと、図５に示すテキストデータテーブルと、図６に示すアイテムテーブルとを参照して、アイテム算出結果テーブルに登録されている（ブログ識別子、アイテム識別子）の組み合わせに対応する、アイテム情報（タイトル、アーティストなど）、ユーザ識別子、及び記事作成更新日を抽出する。 (Operation of the ranking information creation unit 13)
Returning to FIG. 3, processing performed by the ranking information creation unit 13 will be described.
In step S14, the ranking information creation unit 13 is registered in the item calculation result table with reference to the item calculation result table shown in FIG. 9, the text data table shown in FIG. 5, and the item table shown in FIG. The item information (title, artist, etc.), user identifier, and article creation update date corresponding to the combination (blog identifier, item identifier) are extracted.

ステップＳ１５にて、ランキング情報作成部１３は、アイテム算出結果テーブルにおけるアイテム識別子の出現回数をカウントし、出現回数が多い順（降順）にソートした（アイテム識別子、出現回数）の組み合わせリスト（第１のリスト）を作成し、アイテムランキング情報記憶部９に記憶する。なお、ある１人のユーザがあるアイテムについてのブログ記事を所定回数以上書いていた場合、そのアイテムの出現回数を所定の規則に従って、元の出現回数より少なくするといった処理を追加してもよい。 In step S15, the ranking information creation unit 13 counts the number of occurrences of the item identifier in the item calculation result table, and sorts the items (item identifier, the number of appearances) in the descending order (item identifier, appearance count). ) Is created and stored in the item ranking information storage unit 9. When a blog article about a certain item is written a predetermined number of times or more, a process of making the number of appearances of the item less than the original number of appearances according to a predetermined rule may be added.

ステップＳ１６において、ランキング情報作成部１３は、ステップＳ１４で作成したデータを使用し、アイテム算出結果テーブルに登録されているアイテム識別子それぞれについて、ユーザ識別子の種類数（異なるユーザ識別子の出現回数）をカウントする。すなわち、あるアイテムが何人のユーザのブログに記述されているかをカウントする。そして、出現回数が多い順（降順）にソートした（アイテム識別子、ユーザ識別子の種類数）の組み合わせリスト（第２のリスト）を作成し、アイテムランキング情報記憶部９に記憶する。 In step S16, the ranking information creation unit 13 uses the data created in step S14 and counts the number of types of user identifiers (number of appearances of different user identifiers) for each item identifier registered in the item calculation result table. To do. That is, it counts how many users' blogs describe an item. Then, a combination list (second list) sorted in descending order of appearance count (descending order) is created and stored in the item ranking information storage unit 9.

ステップＳ１７において、ランキング情報作成部１３は、ステップＳ１５で作成した第１のリストと、ステップＳ１６で作成した第２のリストとを用いて、図１０に示す形式のランキングテーブルを作成する。ランキングテーブルは、ランキング情報記憶部９に格納される。ランキングテーブルは、順位と、アイテム識別子と、アイテム識別子の出現回数とを対応させたテーブルであり、種々の方法で作成することができる。 In step S17, the ranking information creation unit 13 creates a ranking table in the format shown in FIG. 10 using the first list created in step S15 and the second list created in step S16. The ranking table is stored in the ranking information storage unit 9. The ranking table is a table in which the rank, the item identifier, and the number of appearances of the item identifier are associated with each other, and can be created by various methods.

具体的には、まず第１のリストに従って、アイテムの出現回数の多い順にアイテムに順位を付ける。次に、アイテムの出現回数が同じアイテムが存在する場合は、それらのアイテムに関して、第２のリストに従って、ユーザ識別子の種類数が多い順に順位を付ける。すなわち、アイテム識別子の出現回数を第１優先項目、ユーザ識別子の種類数を第２優先項目として、それぞれ多い順にアイテムをソートして、順位を付与すればよい。また、ユーザ識別子の種類数を第１優先項目、アイテム識別子の出現回数を第２優先項目としてソートして、順位を付与してもよい。 Specifically, according to the first list, items are ranked in descending order of the number of appearances of items. Next, when there are items having the same number of appearances of items, the items are ranked in descending order according to the second list according to the second list. In other words, the items may be sorted in order of increasing number of appearances of item identifiers as the first priority item and the number of types of user identifiers as the second priority item, and the ranks may be given. The number of types of user identifiers may be sorted as the first priority item, and the number of appearances of the item identifier may be sorted as the second priority item, and the ranking may be given.

なお、上述のランキングテーブル作成方法は、あくまでも一例であり、種々の方法でランキングを作成することができる。例えば、リスト１の出現回数とリスト２のユーザ識別子の種類数とに基づいて、総合点数を算出し、総合点数の多い順に順位を付与してもよい。この総合点数をランキングテーブルに登録してもよい。また、特定したアイテムに係る種々の数値に基づいて統計的な処理を行うようにしてもよい。例えば、複数の集計期間を設定し、それぞれの集計期間でのアイテム出現回数を比較して、出現回数の増減率等を算出し、増加率の高いアイテムに対して、「赤丸急上昇」などの情報を付与するようにしてもよい。 The ranking table creation method described above is merely an example, and rankings can be created by various methods. For example, the total score may be calculated based on the number of appearances of list 1 and the number of types of user identifiers of list 2, and the rank may be given in the order of the total score. This total score may be registered in the ranking table. Moreover, you may make it perform a statistical process based on the various numerical value which concerns on the specified item. For example, set multiple counting periods, compare the number of occurrences of items in each counting period, calculate the rate of increase / decrease in the number of appearances, etc. May be given.

また、表示制御部２１は、以上のようにして作成したランキング等を表示部に表示させてもよい。また、ランキングとともに、そのランキングに含まれるアイテムと関連付けられたブログ記事や、ブログ記事を書いたユーザの情報を表示させてもよい。なお、表示部については、テキスト情報処理装置１が有する表示部（図示せず）でもよいし、端末装置４が有する表示部（図示せず）でもよい。 The display control unit 21 may display the ranking created as described above on the display unit. Moreover, you may display the blog article linked | related with the item contained in the ranking with the ranking, and the information of the user who wrote the blog article. In addition, about a display part, the display part (not shown) which the text information processing apparatus 1 has may be sufficient, and the display part (not shown) which the terminal device 4 has may be sufficient.

以上説明したように、本実施形態のテキスト情報処理装置により、ブログ等のテキストデータから商品またはサービスであるアイテムを精度良く抽出することができる。 As described above, the text information processing apparatus according to the present embodiment can accurately extract items that are products or services from text data such as a blog.

また、本実施形態のテキスト情報処理装置によれば、抽出したアイテム情報について統計的に処理することができる。 Moreover, according to the text information processing apparatus of this embodiment, the extracted item information can be statistically processed.

例えば、所定期間（例えば、１週間、１日、１時間など）内においてマイクロブログサービス等で記述の対象となっている曲を抽出し、曲ごとの記事数やユーザ数をカウントし、そのカウント数に従って、曲の順位付けを行うことで、市場動向の統計データとしてマーケティングに活かすことができる。また、ユーザへそれらの情報を提示することで、ユーザの購買意欲を高めたりすることが期待できる。 For example, within a predetermined period (for example, one week, one day, one hour, etc.), a song that is the object of description by a microblog service or the like is extracted, and the number of articles and the number of users for each song are counted, By ranking the songs according to the numbers, it can be used for marketing as statistical data of market trends. In addition, by presenting such information to the user, it can be expected to increase the user's willingness to purchase.

＜第２実施形態＞
次に図１１及び図１２のフローチャートを用いて、テキスト情報処理装置１における処理の他の実施形態について説明する。
第１実施形態においては、第１の数のキーワードを含むキーワードグループによる検索と、第１の数よりも大きい第２の数のキーワードを含むキーワードグループによる検索の両方を行うか、またはどちらか一方のみを行っていたが、本実施形態においては、アイテムを特定できたか否かに応じて、キーワードグループに含まれるキーワード数を多くしていくことで、処理量を抑えながら、記述の対象となっている情報を精度良く特定することができるように構成したものである。 Second Embodiment
Next, another embodiment of the processing in the text information processing apparatus 1 will be described using the flowcharts of FIGS. 11 and 12.
In the first embodiment, either a search by a keyword group including a first number of keywords and a search by a keyword group including a second number of keywords greater than the first number are performed, or either However, in this embodiment, the number of keywords included in the keyword group is increased according to whether or not the item has been identified. The information can be specified with high accuracy.

なお、図１１のフローチャートにおけるステップＳ５ａ，ステップＳ１２ａ，ステップＳ１２ｂ、及び、図１２のフローチャートにおけるステップＳ５ｂ，ステップＳ１２ｃ，ステップＳ１２ｄ以外は第１実施形態と基本的に同様な処理である。よって、第１実施形態と同様な処理については説明を省略する。 Note that the processing is basically the same as that of the first embodiment except for step S5a, step S12a, step S12b in the flowchart of FIG. 11 and step S5b, step S12c, and step S12d in the flowchart of FIG. Therefore, description of the same processing as in the first embodiment is omitted.

本実施形態において、図１１のステップＳ５ａにて、グルーピング処理部１６は、１つの記事テキスト毎に、第１の数のキーワードを含むキーワードグループを作成する。例えば、図７のキーワードグループ識別子Gr001-001 〜Gr001-004のキーワードグループのように、グルーピング処理部１６は、１つのキーワードを含むキーワードグループを作成する。 In the present embodiment, in step S5a of FIG. 11, the grouping processing unit 16 creates a keyword group including the first number of keywords for each article text. For example, like the keyword group identifiers Gr001-001 to Gr001-004 in FIG. 7, the grouping processing unit 16 creates a keyword group including one keyword.

ステップＳ６〜Ｓ１１については、第１実施形態と同様の方法で、検索結果の妥当性を判断する。
次のステップＳ１２ａにおいて、妥当性判定部１９は、第１の数のキーワードを含むキーワードグループ全てについて妥当性の判定を行ったか判断し、まだ妥当性を行っていないキーワードグループがあれば、ステップＳ９に戻り、次のキーワードグループのスコアと閾値とを比較する。ステップＳ１２ａにて、第１の数のキーワードを含む全てのキーワードグループについて妥当性の判定が終わっていた場合は、次のステップＳ１２ｂに移る。 About steps S6-S11, the validity of a search result is judged by the method similar to 1st Embodiment.
In the next step S12a, the validity determination unit 19 determines whether the validity determination has been performed for all the keyword groups including the first number of keywords, and if there is a keyword group that has not yet been validated, the validity determination unit 19 performs step S9. Returning to, the score of the next keyword group is compared with the threshold value. If it is determined in step S12a that the validity of all keyword groups including the first number of keywords has been determined, the process proceeds to the next step S12b.

次のステップＳ１２ｂにおいて、真となったキーワードグループがあったか否か判断し、真となったキーワードグループがあれば、ステップＳ１３に移り、妥当であったキーワードグループとアイテムを出力する。一方、ステップＳ１２ｂにおいて、真となったキーワードグループがなければ、図１２のフローチャートに示すステップＳ５ｂに移る。 In the next step S12b, it is determined whether or not there is a keyword group that has become true. If there is a keyword group that has become true, the process proceeds to step S13, and a valid keyword group and item are output. On the other hand, if there is no true keyword group in step S12b, the process proceeds to step S5b shown in the flowchart of FIG.

図１２のステップＳ５ｂにて、グルーピング処理部１６は、１つの記事テキスト毎に、第１の数よりも大きい第２の数のキーワードを含むキーワードグループを作成する。例えば、図７のキーワードグループ識別子Gr001-005 〜Gr001-010のキーワードグループのように、グルーピング処理部１６は、２つのキーワードを含むキーワードグループを作成する。キーワードグループを作成する処理は、検索処理に比べてシステムの負荷が小さいため、第２の数のキーワードを含むキーワードグループについては予め作成しておくようにしてもよい。 In step S5b of FIG. 12, the grouping processing unit 16 creates a keyword group including a second number of keywords larger than the first number for each article text. For example, like the keyword group identifiers Gr001-005 to Gr001-010 in FIG. 7, the grouping processing unit 16 creates a keyword group including two keywords. Since the process for creating a keyword group has a smaller system load than the search process, a keyword group including the second number of keywords may be created in advance.

その後、ステップＳ６〜Ｓ１１については、第１実施形態と同様の方法で、検索結果の妥当性を判断する。
次のステップＳ１２ｃにおいて、妥当性判定部１９は、第２の数のキーワードを含むキーワードグループ全てについて妥当性の判定を行ったか判断し、まだ妥当性を行っていないキーワードグループがあれば、ステップＳ９に戻り、次のキーワードグループのスコアと閾値とを比較する。ステップＳ１２ｃにて、第２の数のキーワードを含む全てのキーワードグループについて妥当性の判定が終わっていた場合は、次のステップＳ１２ｄに移る。 Then, about steps S6-S11, the validity of a search result is judged by the method similar to 1st Embodiment.
In the next step S12c, the validity determination unit 19 determines whether the validity determination has been performed for all the keyword groups including the second number of keywords. If there is a keyword group that has not yet been validated, the validity determination unit 19 performs step S9. Returning to, the score of the next keyword group is compared with the threshold value. If it is determined in step S12c that the validity of all keyword groups including the second number of keywords has been determined, the process proceeds to next step S12d.

次のステップＳ１２ｄにおいて、真となったキーワードグループがあったか否か判断し、真となったキーワードグループがあれば、図１１のステップＳ１３に移り、妥当であったキーワードグループとアイテムを出力する。一方、ステップＳ１２ｂにおいて、真となったキーワードグループがなければ、ステップＳ１８に移り、妥当性判定部１９は、当該記事テキストは、アイテムについて記述していないと判断する。
なお、ステップＳ１８において処理を終了せずに、グルーピング処理部１６は、１つの記事テキスト毎に、第２の数よりも大きい第３の数のキーワードを含むキーワードグループを作成し、同様な処理を続けてもよい。どの程度の数のキーワードを含むキーワードグループまで作成するかは、例えば、特定したいアイテムの種類等に応じて適宜決めればよい。 In the next step S12d, it is determined whether or not there is a keyword group that has become true. If there is a keyword group that has become true, the process proceeds to step S13 in FIG. 11 to output a valid keyword group and item. On the other hand, if there is no true keyword group in step S12b, the process proceeds to step S18, and the validity determination unit 19 determines that the article text does not describe an item.
Note that the grouping processing unit 16 creates a keyword group including a third number of keywords larger than the second number for each article text without terminating the processing in step S18, and performs similar processing. You may continue. The number of keyword groups including a certain number of keywords may be appropriately determined according to, for example, the type of item to be specified.

以上のように、アイテムを特定できたか否かに応じて、キーワードグループに含まれるキーワード数を多くして検索していくことで、処理量を抑えながら、記述の対象となっている情報を精度良く特定することができる As described above, depending on whether or not an item has been identified, the number of keywords included in the keyword group is increased so that the information to be described can be accurately managed while reducing the amount of processing. Can be identified well

上述した本発明の実施形態は、説明のための例示であり、上記実施形態に限定されるものではない。本発明は、ブログ以外のテキスト、例えばアンケートなどのデータに対しても適用可能である。また、音楽に係わるブログ記事を使って処理を行う例を示したが、音楽だけでなくその他の分野の記事についても、同様に処理できることはもちろんである。 The above-described embodiments of the present invention are illustrative examples, and are not limited to the above-described embodiments. The present invention can also be applied to texts other than blogs, for example, data such as questionnaires. Moreover, although the example which processes using the blog article about music was shown, it cannot be overemphasized that it can process similarly about the article of other fields other than music.

なお、本発明は各部の機能をコンピュータに実現させるためのプログラムを含むものである。これらのプログラムは、記録媒体から読み取られてコンピュータに取り込まれてもよいし、通信ネットワークを介して伝送されてコンピュータに取り込まれてもよい。 The present invention includes a program for causing a computer to realize the functions of the respective units. These programs may be read from a recording medium and loaded into a computer, or may be transmitted via a communication network and loaded into a computer.

また、本発明は以上説明した実施形態に限定されることはなく、本発明の要旨を逸脱しない範囲において種々変更が可能である。例えば、各実施形態や変形例等を組み合わせてもよい。また、テキスト情報処理装置１の一部の構成を別体にし、ネットワーク等を介してその別体とした構成と通信するようにして、テキスト情報処理装置１の機能を実現してもよい。 The present invention is not limited to the above-described embodiment, and various modifications can be made without departing from the gist of the present invention. For example, you may combine each embodiment, a modification, etc. Further, the functions of the text information processing apparatus 1 may be realized by making a part of the configuration of the text information processing apparatus 1 separate and communicating with the separate configuration via a network or the like.

１テキスト情報処理装置（サーバ）
２テキストデータサーバ
３アイテムデータベース
４端末装置
５キーワードグループ記憶部
６テキストデータ記憶部
７スコア記憶部
８アイテム算出結果記憶部
９アイテムランキング情報記憶部
１０テキストデータ収集部
１１キーワード集合生成部
１２アイテム特定部
１３ランキング情報作成部
１４不要文字列処理部
１５キーワード抽出部
１６グルーピング処理部
１７検索部
１８類似度計算部
１９妥当性判定部
２０ネットワーク 1 Text information processing device (server)
2 Text data server 3 Item database 4 Terminal device 5 Keyword group storage unit 6 Text data storage unit
7 score storage unit 8 item calculation result storage unit 9 item ranking information storage unit 10 text data collection unit 11 keyword set generation unit 12 item specification unit 13 ranking information creation unit 14 unnecessary character string processing unit 15 keyword extraction unit 16 grouping processing unit 17 Search unit 18 Similarity calculation unit 19 Validity determination unit 20 Network

Claims

A search unit that searches the database and obtains a plurality of information corresponding to the search condition;
A similarity calculation unit that calculates a score based on the similarity between the plurality of pieces of information;
A validity determination unit that determines the validity of the search condition based on the score;
An information processing apparatus comprising:

The validity determination unit determines that the search condition is valid when the score is equal to or greater than a predetermined value.
The information processing apparatus according to claim 1.

The search condition includes one or more keywords, and the validity determination unit changes a criterion for determining validity according to the number of keywords.
The information processing apparatus according to claim 1, wherein the information processing apparatus is an information processing apparatus.

When the validity determination unit determines that the search condition is not valid, the search unit searches the database using a search condition including a number of keywords different from the keyword included in the search condition.
The information processing apparatus according to claim 1, wherein the information processing apparatus is an information processing apparatus.

A search unit which searches the database by using a plurality of search conditions, acquires a plurality of search result sets corresponding to each Do Ri said plurality of search conditions from a plurality of information,
Said based on the similarity between a plurality of information included in the search result set, the similarity calculation unit for calculating a plurality of scores corresponding to a plurality of said search result set,
A validity determination unit that preferentially outputs information included in the search result set having a high score;
An information processing apparatus comprising:

The validity determination unit identifies a score having the highest value from a plurality of scores calculated by the similarity calculation unit, and outputs at least one information included in the search result set corresponding to the identified score To
The information processing apparatus according to claim 5.

The validity determination unit outputs at least one information included in the search result set when the score is equal to or greater than a predetermined value, and is included in the search result set when the score is less than a predetermined value. Output no information,
The information processing apparatus according to claim 5.

The search condition includes one or more keywords, and the validity determination unit changes the predetermined value according to the number of keywords included in the search condition.
The information processing apparatus according to claim 7.

The similarity calculation unit calculates a plurality of similarities corresponding to a combination of two pieces of information included in the search result set, an average value, a median value, a mode value, a quartile value of the plurality of similarities , Or, the minimum value is the score,
The information processing apparatus according to claim 5 , wherein the information processing apparatus is an information processing apparatus.

An information processing method executed by one or more computers,
A search step of searching the database and obtaining a plurality of information corresponding to the search condition;
A similarity calculation step of calculating a score based on the similarity between the plurality of pieces of information acquired in the search step;
A validity determination step of determining the validity of the search condition based on the score calculated in the similarity calculation step;
An information processing method comprising:

An information processing method executed by one or more computers,
A search step of searching the database using a plurality of search conditions, acquires a plurality of search result sets corresponding to each Do Ri said plurality of search conditions from a plurality of information,
A similarity calculation step of calculating a plurality of scores are based on the similarity between a plurality of information, corresponding to a plurality of said search result set that is included in the acquired set of search results in the search step,
A validity determination step for preferentially outputting information included in the search result set having a high score calculated in the similarity calculation step;
An information processing method comprising:

One or more computers,
A search unit that searches a database and acquires a plurality of information corresponding to the search condition,
A similarity calculation unit that calculates a score based on the similarity between the plurality of pieces of information acquired in the search unit;
A validity determination unit that determines the validity of the search condition based on the score calculated by the similarity calculation unit;
An information processing program that functions as a computer program.

One or more computers,
Search unit searches the database by using a plurality of search conditions, acquires a plurality of search result sets corresponding to each Do Ri said plurality of search conditions from a plurality of information,
Said based on the similarity between a plurality of information contained in the acquired set of search results in the search unit, the similarity calculation unit for calculating a plurality of scores corresponding to a plurality of said search result set,
A validity determination unit that preferentially outputs information included in the search result set having a high score calculated by the similarity calculation unit;
An information processing program that functions as a computer program.