JP2012003603A

JP2012003603A - Information retrieval system

Info

Publication number: JP2012003603A
Application number: JP2010139512A
Authority: JP
Inventors: 壽 ▲高▼取; Hisashi Takatori; Tomohiro Nakagaki; 智宏中垣
Original assignee: Hitachi Systems and Services Ltd
Current assignee: Hitachi Systems and Services Ltd
Priority date: 2010-06-18
Filing date: 2010-06-18
Publication date: 2012-01-05

Abstract

PROBLEM TO BE SOLVED: To display an accurate retrieval result by focusing on a relationship between words.SOLUTION: In a retrieval system consisting of one or more computers to retrieve text information, the computers include: means for obtaining a first correlation between words included in a retrieval condition by analyzing the retrieval condition input by a user; means for obtaining a second correlation between words included in text information by analyzing the stored text information; and means for selecting the text information to be output as the one matched with the input retrieval condition on the basis of the similarity between the obtained first correlation and the second correlation.

Description

本発明は、情報検索システムに関し、特に、検索精度を向上する技術に関する。 The present invention relates to an information search system, and more particularly to a technique for improving search accuracy.

一般的な検索システム（検索エンジン）では、前処理と本処理とを行なうことによってコンテンツの検索を実現している。ここで、コンテンツとは、電子文書など電子的に表現されたコンピュータで処理可能な情報である。 In a general search system (search engine), content search is realized by performing pre-processing and main processing. Here, the content is information that can be processed by an electronically expressed computer such as an electronic document.

前処理では、検索インデックスを生成する。具体的には、検索対象のコンテンツを解析し、コンテンツに含まれる特徴的な文字列（特徴文字列）を取得し、取得した特徴文字列とコンテンツをマッピングし、検索インデックスとして格納領域に保存している。本処理では、検索クエリを解析し、検索処理に用いるキーワード群（検索ターム）を抽出し、抽出された検索タームを用いて、前処理にて生成した検索インデックスを走査し、該当するコンテンツの集合を取得する。その後、該当するコンテンツとの類似度を計算することによって、検索結果をソーティングする。 In the preprocessing, a search index is generated. Specifically, the content to be searched is analyzed, a characteristic character string (characteristic character string) included in the content is acquired, the acquired characteristic character string and the content are mapped, and stored in the storage area as a search index. ing. In this process, a search query is analyzed, a keyword group (search term) used for the search process is extracted, the search index generated in the pre-process is scanned using the extracted search term, and a set of corresponding contents To get. Thereafter, the search results are sorted by calculating the similarity with the corresponding content.

特表２００７−５０７８０１号公報Special table 2007-507801 gazette 特開２００５−１６５６３２号公報JP 2005-165632 A 特開２０００−２００２８１号公報JP 2000-200221 A

しかし、言葉は複数の意味を持つことが多い。特に、新語や略語は複数の意味を持つ傾向が顕著である。このため、あるキーワードで検索した場合に、検索結果の上位に予想しない意味の情報が多く並ぶ結果になることもある。このように、検索者の意図と異なる結果では検索者の目的は達成されない。 However, words often have multiple meanings. In particular, new words and abbreviations tend to have multiple meanings. For this reason, when a search is performed with a certain keyword, there may be a case where a lot of information having an unexpected meaning is arranged at the top of the search result. Thus, the searcher's purpose cannot be achieved with a result different from the searcher's intention.

これに対し、検索者は、検索キーワードを追加するなど、検索クエリを見直して検索を再度実行したり、下位の検索結果まで内容を閲覧するなどの作業が必要となる。このような再検索は、検索者に検索スキル及び作業時間を要求することになり、結果として作業効率や作業の質を下げる要因となっている。 On the other hand, the searcher needs to re-execute the search by reviewing the search query, such as adding a search keyword, or browsing the content up to the lower search result. Such re-searching requires the searcher to have search skills and work time, resulting in a decrease in work efficiency and work quality.

これは、コンテンツと特徴文字列（語彙）との関係性のみで検索を行なっているために生じる問題であり、この問題を現状の検索エンジンで回避することは困難である。 This is a problem that occurs because the search is performed only by the relationship between the content and the characteristic character string (vocabulary), and it is difficult to avoid this problem with current search engines.

このような問題を解決するために、例えば、特許文献１には、ユーザの個人情報をプロフィール情報として保持し、ユーザの嗜好や関心への適合度を考慮して検索結果リストへの出現順を調整する検索方法が記載されている。しかし、特許文献１に記載された方法では、ユーザから見た同音異義語の重みが異なる場合に対応できない。 In order to solve such a problem, for example, in Patent Document 1, the personal information of a user is stored as profile information, and the order of appearance in a search result list is determined in consideration of the degree of fitness for user preference and interest. The search method to be adjusted is described. However, the method described in Patent Document 1 cannot cope with the case where the weights of homonyms as viewed from the user are different.

また、特許文献２には、コンテンツの特徴ベクトルとユーザの嗜好ベクトルとの類似度に従って、コンテンツを推薦する情報検索方法が提案されている。しかし、特許文献２に記載された方法では、ユーザから見た同音異義語の重みが異なる場合に対応できない。 Patent Document 2 proposes an information search method for recommending content according to the similarity between the content feature vector and the user preference vector. However, the method described in Patent Document 2 cannot cope with the case where the weights of homonyms seen from the user are different.

また、同義語辞書やオントロジを用いて、この問題を解決しようとする方法も存在する（例えば、特許文献３参照）。特許文献３に記載された方法は、検索タームを拡張することによる方法であり、検索漏れを低減することは可能である。しかし、検索者が意図しない文書（検索ノイズ）を排除することは困難であり、むしろ、検索ノイズが増大する要因にもなる。 There is also a method for solving this problem using a synonym dictionary or ontology (see, for example, Patent Document 3). The method described in Patent Document 3 is a method by extending the search term, and it is possible to reduce search omissions. However, it is difficult to eliminate a document (search noise) that is not intended by the searcher, and rather, it is a factor that increases search noise.

そこで、本発明は、言葉と言葉との関係性に着目することによって、的確な検索結果を表示する情報検索システムを提供することを目的とする。 Accordingly, an object of the present invention is to provide an information search system that displays an accurate search result by paying attention to the relationship between words.

本発明の代表的な一例によると、語彙と語彙の関係性に着目し、コンテンツと語彙との関係性だけでなく、語彙と語彙の関係性を考慮した検索をする。 According to a typical example of the present invention, focusing on the relationship between vocabulary and vocabulary, a search is performed in consideration of not only the relationship between content and vocabulary but also the relationship between vocabulary and vocabulary.

具体的には、前処理として、検索対象のコンテンツを解析し、コンテンツに含まれる語彙を取得し、コンテンツと語彙の関連性をコンテンツインデックスとしてマッピングし、語彙と語彙との関連性を分析し、分析された関連性を示すコンテンツ語彙モデルを生成する。そして、検索者（ユーザ）の行動履歴に基づいて、ユーザに関連する語彙を抽出しユーザインデックスを生成する。そして、ユーザがアクセスしたファイルに含まれる語彙間の関連性も分析し、分析された関連性を示すユーザ語彙モデルを生成する。 Specifically, as preprocessing, the content to be searched is analyzed, the vocabulary included in the content is acquired, the relationship between the content and the vocabulary is mapped as a content index, the relationship between the vocabulary and the vocabulary is analyzed, A content vocabulary model indicating the analyzed relevance is generated. And based on a searcher's (user) action history, the vocabulary relevant to a user is extracted and a user index is produced | generated. Then, the relationship between vocabularies contained in the file accessed by the user is also analyzed, and a user vocabulary model indicating the analyzed relationship is generated.

その後、本処理として、検索クエリから抽出した検索タームを用いて、ユーザ語彙モデルを走査し、検索タームに関連する語彙集合（以下、ユーザ拡張検索ターム）を取得し、検索ターム及び関連する語彙集合の語彙間の関連性を取得し、取得した関連度が登録された検索クエリ行列を生成する。さらに、ユーザ拡張タームを用いて、前処理にて生成したコンテンツインデックスを走査し、ユーザ拡張タームを含むコンテンツの集合を取得する。そして、各コンテンツのコンテンツ語彙モデルを走査し、コンテンツに含まれる語彙間の関連性を取得し、取得した関連度が登録されたコンテンツ行列を生成する。その後、検索クエリ行列とコンテンツ行列との間で類似度を計算し、類似度の計算結果に基づいて、検索結果をソーティングする。 After that, as a main process, the search term extracted from the search query is used to scan the user vocabulary model to obtain a vocabulary set related to the search term (hereinafter referred to as a user extended search term), and the search term and the related vocabulary set. Is obtained, and a search query matrix in which the obtained degree of association is registered is generated. Further, the user extension term is used to scan the content index generated in the preprocessing, and a set of contents including the user extension term is acquired. Then, the content vocabulary model of each content is scanned, the relevance between vocabularies included in the content is acquired, and a content matrix in which the acquired relevance is registered is generated. Thereafter, the similarity is calculated between the search query matrix and the content matrix, and the search result is sorted based on the calculation result of the similarity.

本発明の実施の形態によると、コンテンツと語彙との関係性だけでなく、語彙と語彙の関係性を考慮することによって、多義的な語彙であっても、検索者が日常使用している語彙と関連するコンテンツの類似度が高くなる。このため、検索者が嗜好するコンテンツを上位に表示することができ、ユーザの嗜好に近いコンテンツを上位に表示することができ、検索精度を向上することができる。 According to the embodiment of the present invention, by considering not only the relationship between content and vocabulary but also the relationship between vocabulary and vocabulary, the vocabulary that searchers use everyday even for ambiguous vocabulary. The similarity of content related to is increased. For this reason, the content that the searcher likes can be displayed at the top, the content close to the user's preference can be displayed at the top, and the search accuracy can be improved.

本発明の実施の形態の情報検索システムの全体の構成を示すブロック図である。It is a block diagram which shows the whole structure of the information search system of embodiment of this invention. 本発明の実施の形態のコンテンツ分析サブシステム１０１の構成を示すブロック図である。It is a block diagram which shows the structure of the content analysis subsystem 101 of embodiment of this invention. 本発明の実施の形態のコンテンツ分析サブシステム１０１によって実行される処理のフローチャートである。It is a flowchart of the process performed by the content analysis subsystem 101 of embodiment of this invention. 本発明の実施の形態のコンテンツインデックスＤ１３１の例の説明図である。It is explanatory drawing of the example of the content index D131 of embodiment of this invention. 本発明の実施の形態のコンテンツ語彙モデルＤ１３２の例の説明図である。It is explanatory drawing of the example of the content vocabulary model D132 of embodiment of this invention. 本発明の実施の形態のユーザ行動履歴Ｄ１３７の例の説明図である。It is explanatory drawing of the example of the user action log | history D137 of embodiment of this invention. 本発明の実施の形態のユーザ分析サブシステム１０２の構成を示すブロック図である。It is a block diagram which shows the structure of the user analysis subsystem 102 of embodiment of this invention. 本発明の実施の形態のユーザ分析サブシステム１０２によって実行される処理のフローチャートである。It is a flowchart of the process performed by the user analysis subsystem 102 of embodiment of this invention. 本発明の実施の形態のユーザインデックスＤ１３５の例の説明図である。It is explanatory drawing of the example of the user index D135 of embodiment of this invention. 本発明の実施の形態のユーザ語彙モデルＤ１３６の例の説明図である。It is explanatory drawing of the example of the user vocabulary model D136 of embodiment of this invention. 本発明の実施の形態のリポジトリ管理サブシステム１０３の構成を示すブロック図である。It is a block diagram which shows the structure of the repository management subsystem 103 of embodiment of this invention. 本発明の実施の形態のリポジトリ管理サブシステム１０３によって実行される処理のフローチャートである。It is a flowchart of the process performed by the repository management subsystem 103 of embodiment of this invention. 本発明の実施の形態の情報検索サーバ１０４の構成を示すブロック図である。It is a block diagram which shows the structure of the information search server 104 of embodiment of this invention. 本発明の実施の形態の情報検索サーバ１０４によって実行される処理のフローチャートである。It is a flowchart of the process performed by the information search server 104 of embodiment of this invention. 本発明の実施の形態の検索クエリ行列生成処理（Ｓ４１２）の詳細な手順を示すフローチャートである。It is a flowchart which shows the detailed procedure of the search query matrix production | generation process (S412) of embodiment of this invention. 本発明の実施の形態の検索クエリ行列生成処理（Ｓ４１２）の詳細な手順を示すフローチャートである。It is a flowchart which shows the detailed procedure of the search query matrix production | generation process (S412) of embodiment of this invention. 本発明の実施の形態のステップＳ４１２における検索クエリ行列の生成の例の説明図である。It is explanatory drawing of the example of a production | generation of the search query matrix in step S412 of embodiment of this invention. 本発明の実施の形態の類似コンテンツ抽出処理（Ｓ４１３）の詳細な手順を示すフローチャートである。It is a flowchart which shows the detailed procedure of the similar content extraction process (S413) of embodiment of this invention. 本発明の実施の形態のコンテンツ行列生成手順（Ｓ４１３４）の詳細な手順を示すフローチャートである。It is a flowchart which shows the detailed procedure of the content matrix production | generation procedure (S4134) of embodiment of this invention. 本発明の実施の形態のステップＳ４１３におけるコンテンツ行列の生成の例の説明図である。It is explanatory drawing of the example of the production | generation of the content matrix in step S413 of embodiment of this invention. 本発明の実施の形態の類似度算出手順（Ｓ４１３５）の詳細な手順を示すフローチャートである。It is a flowchart which shows the detailed procedure of the similarity calculation procedure (S4135) of embodiment of this invention. 本発明の実施の形態の情報検索クライアント１０５の構成を示すブロック図である。It is a block diagram which shows the structure of the information search client 105 of embodiment of this invention. 本発明の実施の形態の情報検索クライアント１０５によって実行される処理のフローチャートである。It is a flowchart of the process performed by the information search client 105 of embodiment of this invention. 本発明の実施の形態の情報検索クライアント１０５によって実行される処理のフローチャートである。It is a flowchart of the process performed by the information search client 105 of embodiment of this invention.

＜システムの全体の構成＞
図１は、本発明の実施の形態の情報検索システムの全体の構成を示すブロック図である。 <Overall system configuration>
FIG. 1 is a block diagram showing the overall configuration of the information search system according to the embodiment of this invention.

本発明の実施の形態の情報検索システムは、コンテンツ分析サブシステム１０１、ユーザ分析サブシステム１０２、リポジトリ管理サブシステム１０３及び情報検索サーバ１０４を備える。これらの各装置はネットワーク（ＬＡＮ又はＷＡＮ）１０６によって接続されている。本情報検索システムには、ネットワーク１０６を介して、情報検索クライアント１０５が接続される。 The information search system according to the embodiment of the present invention includes a content analysis subsystem 101, a user analysis subsystem 102, a repository management subsystem 103, and an information search server 104. Each of these devices is connected by a network (LAN or WAN) 106. An information search client 105 is connected to the information search system via a network 106.

コンテンツ分析サブシステム１０１は、検索対象のコンテンツの内容を、検索の実行前にあらかじめ分析する計算機であり、その詳細な構成は図２を用いて後述する。すなわち、コンテンツ分析サブシステム１０１は、コンテンツ内の語彙の関係を分析・抽出する。この語彙の関係は、情報検索サーバ１０４における処理に必要であり、リポジトリ管理サブシステム１０３に保存される。 The content analysis subsystem 101 is a computer that analyzes in advance the content of the search target content before executing the search, and the detailed configuration thereof will be described later with reference to FIG. That is, the content analysis subsystem 101 analyzes and extracts the vocabulary relationships in the content. This vocabulary relationship is necessary for processing in the information search server 104 and is stored in the repository management subsystem 103.

ユーザ分析サブシステム１０２は、検索を行うユーザのプロファイル情報を、検索の実行前にあらかじめ分析する計算機であり、その詳細な構成は図７を用いて後述する。すなわち、ユーザ分析サブシステム１０２は、ユーザ固有の情報（嗜好、興味範囲など）を語彙の関係として洗い出すことによって、ユーザのプロファイル情報を分析する。この分析結果は、情報検索サーバ１０４における処理に必要であり、リポジトリ管理サブシステム１０３に保存される。 The user analysis subsystem 102 is a computer that analyzes in advance the profile information of the user who performs the search before executing the search, and the detailed configuration thereof will be described later with reference to FIG. That is, the user analysis subsystem 102 analyzes user profile information by identifying user-specific information (preference, range of interest, etc.) as a vocabulary relationship. This analysis result is necessary for processing in the information search server 104 and is stored in the repository management subsystem 103.

リポジトリ管理サブシステム１０３は、コンテンツ分析サブシステム１０１およびユーザ分析サブシステム１０２で分析された情報を格納するデータベースシステムであり、その詳細な構成は図１１を用いて後述する。すなわち、リポジトリ管理サブシステム１０３は、コンテンツ分析サブシステム１０１及びユーザ分析サブシステム１０２によって分析された情報を、情報検索サーバ１０４からの要求に応じて提示する。 The repository management subsystem 103 is a database system that stores information analyzed by the content analysis subsystem 101 and the user analysis subsystem 102, and a detailed configuration thereof will be described later with reference to FIG. That is, the repository management subsystem 103 presents information analyzed by the content analysis subsystem 101 and the user analysis subsystem 102 in response to a request from the information search server 104.

情報検索サーバ１０４は、コンテンツを探索する計算機であり、その詳細な構成は図１３を用いて後述する。すなわち、情報検索サーバ１０４は、情報検索クライアント１０５からの要求に応じて、リポジトリ管理サブシステム１０３から検索対象を抽出し、抽出された分析データから該検索ユーザの分析結果と類似するコンテンツを探索する。この探索結果は情報検索クライアント１０５に提示される。 The information search server 104 is a computer that searches for content, and a detailed configuration thereof will be described later with reference to FIG. That is, the information search server 104 extracts a search target from the repository management subsystem 103 in response to a request from the information search client 105, and searches for content similar to the analysis result of the search user from the extracted analysis data. . This search result is presented to the information search client 105.

情報検索クライアント１０５は、検索を行うユーザが操作する計算機であり、その詳細な構成は図２１を用いて後述する。すなわち、情報検索クライアント１０５は、ユーザからの検索要求を受け付け、情報検索サーバ１０４に検索を要求し、情報検索サーバ１０４から検索結果を受信し、検索結果を出力装置を介してユーザに提示する。 The information search client 105 is a computer operated by a user who performs a search, and a detailed configuration thereof will be described later with reference to FIG. That is, the information search client 105 receives a search request from the user, requests a search from the information search server 104, receives the search result from the information search server 104, and presents the search result to the user via the output device.

なお、図１には、１台の検索クライアントのみを図示したが、複数台（２台以上）の検索クライアントを備えてもよい。 Although only one search client is shown in FIG. 1, a plurality (two or more) of search clients may be provided.

ネットワーク１０６は、その一部又は全部がＬＡＮ又はＷＡＮで構成されるネットワークであり、コンテンツ分析サブシステム１０１、ユーザ分析サブシステム１０２、リポジトリ管理サブシステム１０３、情報検索サーバ１０４及び情報検索クライアント１０５を接続する。 The network 106 is a network that is partially or entirely configured by a LAN or WAN, and connects the content analysis subsystem 101, the user analysis subsystem 102, the repository management subsystem 103, the information search server 104, and the information search client 105. To do.

なお、コンテンツ分析サブシステム１０１及びユーザ分析サブシステム１０２から、ネットワーク１０６を経由して、リポジトリ管理サブシステム１０３に分析結果を転送するものとしたが、可搬型記憶媒体（例えば、光ディスク、不揮発性メモリなど）を使用して、分析結果を転送してもよい。 Note that the analysis result is transferred from the content analysis subsystem 101 and the user analysis subsystem 102 to the repository management subsystem 103 via the network 106, but a portable storage medium (for example, an optical disk, a non-volatile memory) Etc.) may be used to transfer the analysis results.

また、コンテンツ分析サブシステム１０１、ユーザ分析サブシステム１０２、リポジトリ管理サブシステム１０３及び情報検索サーバ１０４のうち２以上のシステムを１台の物理的な計算機上に実装することもできる。この場合、各システム間では内部バスを経由してデータを転送することができる。 Two or more systems among the content analysis subsystem 101, the user analysis subsystem 102, the repository management subsystem 103, and the information search server 104 can be mounted on a single physical computer. In this case, data can be transferred between the systems via the internal bus.

また、情報検索クライアント１０５と、他のサブシステム（例えば、情報検索サーバ１０４など）を、１台の物理計算機に実装することもできる。 Further, the information search client 105 and other subsystems (for example, the information search server 104) can be mounted on one physical computer.

＜コンテンツ分析サブシステム１０１＞
図２は、本発明の実施の形態のコンテンツ分析サブシステム１０１の構成を示すブロック図である。 <Content analysis subsystem 101>
FIG. 2 is a block diagram showing a configuration of the content analysis subsystem 101 according to the embodiment of this invention.

コンテンツ分析サブシステム１０１は、メモリ１１０、記憶装置１２０、ＣＰＵ１３０、出力装置１４０、入力装置１５０及び通信インターフェース１６０を備え、これらの各構成がバス１７０によって接続される一般的な構成の計算機である。 The content analysis subsystem 101 includes a memory 110, a storage device 120, a CPU 130, an output device 140, an input device 150, and a communication interface 160, and is a computer having a general configuration in which these components are connected by a bus 170.

メモリ１１０は、ＣＰＵ１３０によって実行されるプログラムを格納する。具体的には、システム制御プログラムＰ１０及びコンテンツ分析プログラムＰ１１がメモリ１１０に格納される。また、メモリ１１０には、ＣＰＵ１３０によるプログラムの実行時に、データを一時的に格納するワークエリアが設けられる。 The memory 110 stores a program executed by the CPU 130. Specifically, the system control program P10 and the content analysis program P11 are stored in the memory 110. In addition, the memory 110 is provided with a work area for temporarily storing data when the CPU 130 executes the program.

システム制御プログラムＰ１０は、いわゆるオペレーティングシステムであり、コンテンツ分析サブシステム１０１の全体を制御する。 The system control program P10 is a so-called operating system, and controls the entire content analysis subsystem 101.

コンテンツ分析プログラムＰ１１は、検索対象のコンテンツの内容を分析するプログラムである。 The content analysis program P11 is a program for analyzing the content of the search target content.

記憶装置１２０は、電源遮断時にも記憶内容を保持可能な不揮発性の記憶素子であり、例えば、磁気ディスクドライブ（ＨＤＤ）や、フラッシュメモリ（ＳＳＤ）によって構成される。記憶装置１２０には、各種プログラムＤ１００が格納される。この各種プログラムＤ１００には、前述したシステム制御プログラムＰ１０及びコンテンツ分析プログラムＰ１１が含まれており、ＣＰＵ１３０によって実行される際にメモリ１１０にロードされる。 The storage device 120 is a non-volatile storage element that can retain stored contents even when the power is shut off, and is configured by, for example, a magnetic disk drive (HDD) or a flash memory (SSD). The storage device 120 stores various programs D100. The various programs D100 include the system control program P10 and the content analysis program P11 described above, and are loaded into the memory 110 when executed by the CPU 130.

ＣＰＵ１３０は、メモリ１１０に格納されたプログラムを実行する。 CPU 130 executes a program stored in memory 110.

出力装置１４０は、入力装置１５０及びインターフェース１６０を備え、これらの各構成がバス１７０によって接続される計算機である。 The output device 140 is a computer that includes an input device 150 and an interface 160, and each of these components is connected by a bus 170.

インターフェース１６０は、所定のプロトコルに従ってネットワーク１０６に接続された装置との間でデータを送受信するネットワークインターフェース（ＮＩＣ）である。 The interface 160 is a network interface (NIC) that transmits and receives data to and from devices connected to the network 106 according to a predetermined protocol.

出力装置１４０は、処理の結果を画面に表示するディスプレイや、紙に出力するプリンタなどである。 The output device 140 is a display that displays the processing result on a screen, a printer that outputs the result to paper, or the like.

入力装置１５０は、ユーザがコンテンツ分析サブシステム１０１に指示を与えるためのキーボード、マウスなどである。 The input device 150 is a keyboard, a mouse, or the like for a user to give an instruction to the content analysis subsystem 101.

コンテンツ分析サブシステム１０１は、コンテンツ分析プログラムＰ１１を実行することによって得られたコンテンツの分析結果を、ネットワーク１０６を介して、リポジトリ管理サブシステム１０３に転送する。次に、この処理の詳細を説明する。 The content analysis subsystem 101 transfers the content analysis result obtained by executing the content analysis program P11 to the repository management subsystem 103 via the network 106. Next, details of this processing will be described.

図３は、本発明の実施の形態のコンテンツ分析サブシステム１０１によって実行される処理のフローチャートである。 FIG. 3 is a flowchart of processing executed by the content analysis subsystem 101 according to the embodiment of this invention.

まず、コンテンツ分析サブシステム１０１は、分析対象となるコンテンツの指定を受けると、分析対象となる全てのコンテンツのリストを生成し（Ｓ１０１）、生成されたリストに含まれるコンテンツのインデックスＤ１３１（図１１参照）をリポジトリ管理サブシステム１０３からを取得する（Ｓ１０２）。ステップＳ１０２において取得されるコンテンツインデックスには、コンテンツの今までの分析結果が含まれており、その構成は図４を用いて後述する。 First, when the content analysis subsystem 101 receives specification of content to be analyzed, the content analysis subsystem 101 generates a list of all content to be analyzed (S101), and an index D131 of content included in the generated list (FIG. 11). Reference) is acquired from the repository management subsystem 103 (S102). The content index acquired in step S102 includes the analysis result of the content so far, and the configuration thereof will be described later with reference to FIG.

コンテンツ分析サブシステム１０１によって分析されるコンテンツは、リポジトリ管理サブシステム１０３の記憶装置１２０に格納されているコンテンツ情報Ｄ１３０である。各コンテンツ情報は、情報コンテンツ内に含まれるテキストデータ及び前記情報コンテンツに付加されるテキストデータである。この場合、各コンテンツ情報（各テキストデータ）は、複数の語によって構成される。 The content analyzed by the content analysis subsystem 101 is content information D130 stored in the storage device 120 of the repository management subsystem 103. Each content information is text data included in the information content and text data added to the information content. In this case, each content information (each text data) is composed of a plurality of words.

また、コンテンツ情報は、情報コンテンツ内に含まれるテキストデータ及び前記情報コンテンツに付加されるテキストデータのうち複数のテキストデータを含む集合体でもよい。この場合、各テキストデータは一つ以上の語によって構成され、各テキストデータが集合体を構成することによって、各コンテンツ情報が複数の語によって構成される。 Further, the content information may be an aggregate including a plurality of text data among text data included in the information content and text data added to the information content. In this case, each text data is constituted by one or more words, and each piece of content information is constituted by a plurality of words by constituting each text data.

その後、ループを制御するパラメータｎを１に初期設定する（Ｓ１０３）。 Thereafter, the parameter n for controlling the loop is initialized to 1 (S103).

そして、ｎ番目のコンテンツをリポジトリ管理サブシステム１０３のコンテンツ情報Ｄ１３０から読み出し（Ｓ１０４）、読み出したコンテンツにコンテンツ識別子が付与されていない場合、新たにコンテンツ識別子を割り当る（Ｓ１０５）。割り当てられたコンテンツ識別子は、ステップＳ１１２において、リポジトリ管理サブシステム１０３のコンテンツ情報Ｄ１３０（図１１参照）に格納される。 Then, the nth content is read from the content information D130 of the repository management subsystem 103 (S104), and if no content identifier is assigned to the read content, a new content identifier is assigned (S105). In step S112, the allocated content identifier is stored in the content information D130 (see FIG. 11) of the repository management subsystem 103.

その後、読み出したコンテンツにテキスト情報が含まれるか否かを判定する（Ｓ１０６）。その結果、読み出されたコンテンツにテキスト情報が含まれない場合、コンテンツを解析する必要がないので、ステップＳ１０９に進む。 Thereafter, it is determined whether or not text information is included in the read content (S106). As a result, when text information is not included in the read content, it is not necessary to analyze the content, and the process proceeds to step S109.

一方、読み出されたコンテンツにテキスト情報が含まれる場合、形態素解析によって、テキスト情報中に登場する語彙を抽出し、抽出された語彙の出現頻度に基づいてコンテンツインデックスを更新する（Ｓ１０７）。更新されたコンテンツインデックスは、ステップＳ１１２において、リポジトリ管理サブシステム１０３のコンテンツインデックスＤ１３１（図１１参照）に格納される。本明細書において、「語彙」とは、文章を構成する一つ一つの単語又は複合語となる文字列であり、後述する形態素解析によって分解される単位である。 On the other hand, when text information is included in the read content, vocabulary appearing in the text information is extracted by morphological analysis, and the content index is updated based on the appearance frequency of the extracted vocabulary (S107). In step S112, the updated content index is stored in the content index D131 (see FIG. 11) of the repository management subsystem 103. In this specification, a “vocabulary” is a character string that is a single word or compound word constituting a sentence, and is a unit that is decomposed by morphological analysis described later.

次に、ステップＳ１０７において抽出された語彙の前後の所定の範囲内で登場する語彙を共起語彙として、共起語彙の組み合わせの出現頻度に基づいて、コンテンツ語彙モデルを更新する（Ｓ１０８）。更新されたコンテンツ語彙モデルは、ステップＳ１１２において、リポジトリ管理サブシステム１０３のコンテンツ語彙モデルＤ１３２（図１１参照）に格納される。 Next, the content vocabulary model is updated based on the appearance frequency of the combination of the co-occurrence vocabulary, with the vocabulary appearing within the predetermined range before and after the vocabulary extracted in step S107 as the co-occurrence vocabulary (S108). The updated content vocabulary model is stored in the content vocabulary model D132 (see FIG. 11) of the repository management subsystem 103 in step S112.

ここで、コンテンツ語彙モデルを取得する方法について説明する。 Here, a method for acquiring the content vocabulary model will be described.

例えば、下記の文章を考える。
「私の上司は笠松さんです。笠松さんは部長をしています。」
この文章は形態素解析によって下記のように単語又は複合語単位の文字列（語彙）に分解される。 For example, consider the following sentence:
“My boss is Mr. Kasamatsu. Mr. Kasamatsu is the general manager.”
This sentence is divided into character strings (vocabulary) in units of words or compound words by morphological analysis as follows.

「私（私）／の（の）／上司（上司）／は（は）／笠松（笠松）／さん（さん）／です（です）／。（。）／笠松（笠松）／さん（さん）／は（は）／部長（部長）／を（を）／し（する）／て（て）／い（いる）／ます（ます）／。（。）」 "I (I) / no (no) / boss (boss) / ha (ha) / Kasamatsu (Kasamatsu) / san (san) / (is) /. (.) / Kasamatsu (Kasamatsu) / san (san) / Ha (ha) / department manager (department manager) / (en) / do (do) / te (te) / is (is) / do (do) /.

なお、カッコ内の文字列は語彙の基本活用形を示す。 The character string in parentheses indicates the basic vocabulary usage.

その後、分解された文字列（語彙）に対し、共起情報取得処理を実行する。具体的には、語彙のテキスト内出現位置情報に基づき、ある語彙において関連性があると考えられる別の語彙を共起語彙とし、その共起語彙及びその度数を取得する。例えば、ある語彙において所定の範囲内（直前、直後から数語離れた位置）に現れる別の語彙が、統計的に有意に多く現れているのであれば、その二つの語彙間には強い関連性があると考えられる。このことから、ある語彙の前後ｎ個（ただし、ｎ＞０）までの語彙を共起語彙として収集し、収集された語彙の共起する頻度を統計処理することによって、共起する頻度が高い語彙、つまり、関連性の高い語彙を取得することができる。ただし、共起語彙を取得する際には、語彙の基本活用形によって収集・分析を行なう。これは、動詞や形容詞のような活用する語彙を活用した状態のまま処理すると、同じ意味を持つ語彙を別の語彙と判断されることを防ぐためである。また、句読点などの記号が含まれていた場合は、それらは共起語彙として収集しない。 Thereafter, the co-occurrence information acquisition process is executed on the decomposed character string (vocabulary). Specifically, based on the appearance position information of the vocabulary in the text, another vocabulary that is considered relevant in a certain vocabulary is used as a co-occurrence vocabulary, and the co-occurrence vocabulary and its frequency are acquired. For example, if there is a statistically significant increase in another vocabulary that appears within a certain range (a position several words away from immediately before or immediately after) in a certain vocabulary, there is a strong relationship between the two vocabularies. It is thought that there is. From this, n words (n> 0) before and after a certain vocabulary are collected as co-occurrence vocabulary, and the frequency of co-occurrence of the collected vocabulary is statistically processed to increase the frequency of co-occurrence. Vocabulary, that is, highly relevant vocabulary can be acquired. However, when acquiring co-occurrence vocabulary, it is collected and analyzed according to the basic vocabulary usage. This is to prevent a vocabulary having the same meaning from being determined as a different vocabulary when the vocabulary to be utilized such as a verb or an adjective is utilized. If symbols such as punctuation marks are included, they are not collected as a co-occurrence vocabulary.

なお、ここでは、ｎ＝４と設定した場合を考える。 Here, a case where n = 4 is set is considered.

そうすると、第１文の「笠松」との共起する語彙は、処理の対象となっている語彙の前４個までの「私」、「の」、「上司」、「は」と、後４個までの「さん」、「です」となる。また、第２文の「笠松」との共起する語彙は、処理の対象となっている語彙の前４個までの「さん」、「です」、と、「さん」、「は」、「部長」、「を」となる。 Then, the vocabulary that co-occurs with “Kasamatsu” in the first sentence is up to four “I”, “No”, “boss”, “ha”, and four after the vocabulary subject to processing. It becomes "san", "is" up to. In addition, the vocabulary that co-occurs with “Kasamatsu” in the second sentence is “san”, “is”, “san”, “ha”, “ “General Manager” and “O”.

そして、語彙のペアが共起している頻度を計算し、それを関連度とする。なお、語彙間の関連度を算出する方法として、語彙間の共起回数を計算して設定する方法の他に、統計学におけるｔ検定手法を利用することもできる。これにより、共起語彙の頻度が有意に大きいかを推定することが可能になる。具体的には下式を用いるとよい。 Then, the frequency with which lexical pairs co-occur is calculated and used as the relevance. As a method of calculating the degree of association between vocabularies, a t-test method in statistics can be used in addition to a method of calculating and setting the number of co-occurrence between vocabularies. This makes it possible to estimate whether the frequency of the co-occurrence vocabulary is significantly high. Specifically, the following formula may be used.

また、語彙間の関連度を算出する方法として、語彙の出現位置情報に基づく算出方法について説明したが、文章の構造（係り受け）に基づいて関連度を算出することもできる。この場合、構文解析器を用いて係り受け構造を解析し、解析された係り受け関係の頻度情報を関連度とすることができる。 Further, as a method for calculating the degree of association between vocabularies, the calculation method based on the appearance position information of the vocabulary has been described, but the degree of association can also be calculated based on the structure (dependency) of the sentence. In this case, the dependency structure can be analyzed using a syntax analyzer, and the analyzed frequency information of the dependency relationship can be used as the relevance level.

なお、前述した他にも、特開２００３−１６７８９４号公報に開示されている方法を用いて、関連する語彙を取得することもできる。ただし、この場合は事前にコンテンツを適切なカテゴリに分類する手順が必要になる。 In addition to the above, related vocabulary can also be acquired using the method disclosed in Japanese Patent Laid-Open No. 2003-167894. However, in this case, a procedure for classifying the content into an appropriate category in advance is required.

さらに、これらの方式を組み合わせて関連度を算出することもできる。本実施の形態では、語彙の出現位置の情報に基づいて語彙間の共起回数を計算して、関連度を設定する方法を用いて説明する。 Furthermore, the degree of association can be calculated by combining these methods. In the present embodiment, description will be made using a method of calculating the number of co-occurrence between vocabularies based on the information on the appearance positions of vocabularies and setting the degree of association.

次に、コンテンツの関係者をユーザ行動履歴に登録することによって、ユーザ行動履歴を更新する（Ｓ１０９）。例えば、コンテンツの作成者や、更新者の情報をコンテンツの属性情報（例えば、ファイルのプロパティや、ファイルのメタデータ）から取得し、取得したユーザのユーザＩＤ、行動種別及びコンテンツＩＤを、新規のエントリ（頻度＝１）としてユーザ行動履歴Ｄ１３７に登録する。なお、ユーザＩＤ、行動種別及びコンテンツＩＤが同一のエントリが既に登録されている場合、頻度に１を加算する。更新されたユーザ行動履歴は、ステップＳ１１２において、リポジトリ管理サブシステム１０３のユーザ行動履歴Ｄ１３７（図１１参照）に格納される。 Next, the user action history is updated by registering the parties related to the content in the user action history (S109). For example, information on the creator or updater of content is acquired from content attribute information (for example, file properties or file metadata), and the acquired user ID, action type, and content ID are updated The entry (frequency = 1) is registered in the user action history D137. When an entry having the same user ID, action type, and content ID has already been registered, 1 is added to the frequency. The updated user behavior history is stored in the user behavior history D137 (see FIG. 11) of the repository management subsystem 103 in step S112.

その後、パラメータｎとステップＳ１０１で生成したリストの要素数とを比較して、次の分析対象のコンテンツが存在するか否かを判定する（Ｓ１１０）。 Thereafter, the parameter n is compared with the number of elements in the list generated in step S101 to determine whether or not the next content to be analyzed exists (S110).

その結果、次（ｎ＋１番目）のコンテンツが存在する場合、パラメータｎに１を加算し（Ｓ１１１）、ステップＳ１０４に戻り、次のコンテンツを分析する。 As a result, when the next (n + 1) th content exists, 1 is added to the parameter n (S111), and the process returns to step S104 to analyze the next content.

一方、次（ｎ＋１番目）のコンテンツが存在しない場合、ステップＳ１０１で生成したリストの中の全てのコンテンツの分析が終了しているので、更新されたコンテンツインデックス、コンテンツ語彙モデル、ユーザ行動履歴をリポジトリ管理サブシステム１０３に転送する（Ｓ１１２）。 On the other hand, if there is no next (n + 1) th content, the analysis of all the content in the list generated in step S101 has been completed, so the updated content index, content vocabulary model, and user behavior history are stored in the repository. The data is transferred to the management subsystem 103 (S112).

リポジトリ管理サブシステム１０３は、これらのデータをコンテンツ分析サブシステム１０１から受信すると、受信したコンテンツ識別子をコンテンツ情報Ｄ１３０に格納し、コンテンツインデックスをコンテンツインデックスＤ１３１に格納し、コンテンツ語彙モデルをコンテンツ語彙モデルＤ１３２に格納し、ユーザ行動履歴をユーザ行動履歴Ｄ１３７に格納する。 Upon receiving these data from the content analysis subsystem 101, the repository management subsystem 103 stores the received content identifier in the content information D130, stores the content index in the content index D131, and stores the content vocabulary model in the content vocabulary model D132. And the user action history is stored in the user action history D137.

図４は、本発明の実施の形態のコンテンツインデックスの例を説明する図である。 FIG. 4 is a diagram illustrating an example of a content index according to the embodiment of this invention.

コンテンツ分析サブシステム１０１がステップＳ１０７において更新するコンテンツインデックスは、リポジトリ管理サブシステム１０３の記憶装置１２０にコンテンツインデックスＤ１３１として格納されており、語彙、コンテンツＩＤ及び重みを含む。 The content index updated by the content analysis subsystem 101 in step S107 is stored as a content index D131 in the storage device 120 of the repository management subsystem 103, and includes a vocabulary, a content ID, and a weight.

語彙は、コンテンツにテキスト情報として含まれる単語又は複合語単位の文字列である。コンテンツＩＤは、コンテンツを一意に識別するための識別子であり、当該語彙が含まれるコンテンツを示す。重みは、当該語彙がこのコンテンツに出現する度数を示す。 The vocabulary is a word or a compound character string included in the content as text information. The content ID is an identifier for uniquely identifying the content, and indicates the content including the vocabulary. The weight indicates how often the vocabulary appears in this content.

なお、重みは、当該語彙がこのコンテンツに出現する回数を用いて設定する方法の他、当該語彙がこのコンテンツに出現する回数をコンテンツ内で出現する語彙総数で除することによって正規化した指標を用いる方法や、このコンテンツが含まれるコンテンツ集合における当該語彙の出現確率を用いる方法や、文中の役割（主語、述語、目的語）などに応じて決定する方法や、これらの方式を組み合わせて重みを設定する方法など様々な公知の方法を用いることができる。本実施の形態では、当該語彙がこのコンテンツに出現する方法を用いて説明する。 In addition to the method of setting the weight using the number of times that the vocabulary appears in the content, an index normalized by dividing the number of times the vocabulary appears in the content by the total number of vocabulary appearing in the content. The method to use, the method to use the probability of appearance of the vocabulary in the content set including this content, the method to determine according to the role (subject, predicate, object) in the sentence, etc. Various known methods such as a setting method can be used. In the present embodiment, description will be made using a method in which the vocabulary appears in this content.

図５は、本発明の実施の形態のコンテンツ語彙モデルの例を説明する図である。 FIG. 5 is a diagram illustrating an example of the content vocabulary model according to the embodiment of this invention.

コンテンツ分析サブシステム１０１がステップＳ１０８において生成するコンテンツ語彙モデルは、コンテンツに含まれる語彙の共起情報であり、リポジトリ管理サブシステム１０３の記憶装置１２０にコンテンツ語彙モデルＤ１３２として格納されており、語彙１、語彙２、コンテンツＩＤ及び関連度を含む。 The content vocabulary model generated by the content analysis subsystem 101 in step S108 is co-occurrence information of the vocabulary included in the content, and is stored as the content vocabulary model D132 in the storage device 120 of the repository management subsystem 103. , Vocabulary 2, content ID and relevance.

語彙１及び語彙２は、コンテンツ中の所定の範囲内で共起する語彙の組である。所定の範囲内で共起する語彙とは、前述したように、例えば、ある語彙の前後４個以内に存在する語彙を共起する語彙と定めることができる。 Vocabulary 1 and vocabulary 2 are a set of vocabularies that co-occur within a predetermined range in the content. As described above, the vocabulary that co-occurs within a predetermined range can be determined as, for example, a vocabulary that co-occurs with vocabularies existing within four words before and after a certain vocabulary.

コンテンツＩＤは、語彙１及び語彙２が共起しているコンテンツの識別子である。関連度は、語彙１及び語彙２の組が共起する度数であり、例えば、当該コンテンツ内で語彙１及び語彙２の組が共起した回数を用いることができる。すなわち、共起する回数が多い語彙の組は、その語彙の関連性が高いといえる。 The content ID is an identifier of content in which the vocabulary 1 and the vocabulary 2 co-occur. The relevance is the frequency at which the set of vocabulary 1 and vocabulary 2 co-occurs. For example, the number of times that the set of vocabulary 1 and vocabulary 2 co-occur in the content can be used. That is, it can be said that a vocabulary set having a large number of times of co-occurrence is highly related.

図６は、本発明の実施の形態のユーザ行動履歴の例を説明する図である。 FIG. 6 is a diagram illustrating an example of a user behavior history according to the embodiment of this invention.

コンテンツ分析サブシステム１０１がステップＳ１０８において生成するユーザ行動履歴は、リポジトリ管理サブシステム１０３の記憶装置１２０にユーザ行動履歴Ｄ１３７として格納されており、ユーザＩＤ、行動種別、コンテンツＩＤ、キーワード及び頻度を含む。 The user action history generated by the content analysis subsystem 101 in step S108 is stored as the user action history D137 in the storage device 120 of the repository management subsystem 103, and includes the user ID, action type, content ID, keyword, and frequency. .

ユーザＩＤは、本情報検索システムを使用するユーザを一意に識別する識別子であり、図示した例では名前が用いられている。 The user ID is an identifier for uniquely identifying a user who uses the information search system, and a name is used in the illustrated example.

行動種別は、ユーザがコンテンツにアクセスした行動の種別を一意に識別する識別子である。なお、図６に示すユーザ行動履歴には、コンテンツの「作成」、「更新」の他、「閲覧」、「検索」が記録されているが、一部の種別のアクセス（例えば、「作成」、「更新」のみ）が記録されてもよい。また、その他の行動種別（例えば、「メタタグ付与」など）を設定し、該当する行動を記録してもよい。 The action type is an identifier that uniquely identifies the type of action that the user has accessed the content. The user action history shown in FIG. 6 records “view” and “search” in addition to “create” and “update” of content, but some types of access (for example, “create”) , “Update” only) may be recorded. Further, other action types (for example, “meta tag assignment”) may be set and the corresponding action may be recorded.

コンテンツＩＤは、ユーザがアクセスしたコンテンツを一意に識別する識別子である。キーワードは、ユーザがコンテンツに対して行動した際に付随するキーワード（例えば、検索キーワード、メタタグなど）である。頻度は、そのユーザが、その行動によって、そのコンテンツにアクセスした回数である。 The content ID is an identifier that uniquely identifies the content accessed by the user. The keyword is a keyword (for example, a search keyword, a meta tag, etc.) that accompanies when the user acts on the content. The frequency is the number of times the user has accessed the content by the action.

例えば、図６に示すユーザ行動履歴では、ユーザ「高取」が「作成」したコンテンツ「ＣＯＮＴ−１」を、ユーザ「高取」が「２」回「更新」し、ユーザ「野崎」が「３」回閲覧し、ユーザ「野崎」が「１」回検索していることが分かる。 For example, in the user action history shown in FIG. 6, the content “CONT-1” “created” by the user “Takatori” is “updated” twice by the user “Takatori”, and the user “Nozaki” is “3”. It can be seen that the user “Nozaki” has searched “1” times.

なお、行動種別が「作成」又は「更新」など、コンテンツ作成に付随する行動である場合、コンテンツ分析サブシステム１０１がユーザ行動履歴を操作する。また、行動種別が「閲覧」又は「検索」など、コンテンツ作成に直接的には付随しない行動である場合、情報検索サーバ１０４がユーザ行動履歴を操作する。 When the action type is an action accompanying content creation such as “creation” or “update”, the content analysis subsystem 101 operates the user action history. When the action type is an action that is not directly associated with content creation, such as “browse” or “search”, the information search server 104 operates the user action history.

なお、本実施の形態では、コンテンツインデックスＤ１３１及びコンテンツ語彙モデルＤ１３２の構築方法として、対象となるコンテンツ内に存在するテキスト情報を使用する方法について説明したが、コンテンツ内のテキスト情報だけでなく、コンテンツに付随するテキスト情報（コンテンツの属性、メタタグなど）を利用して、コンテンツインデックスＤ１３１及びコンテンツ語彙モデルＤ１３２を構築することもできる。例えば、コンテンツの属性やメタタグに基づいて、ステップＳ１０７を実行することによって、コンテンツインデックスを構築することができる。 In the present embodiment, as a method of constructing the content index D131 and the content vocabulary model D132, a method of using text information existing in the target content has been described. However, not only the text information in the content but also the content The content index D131 and the content vocabulary model D132 can also be constructed using text information (content attributes, meta tags, etc.) attached to. For example, a content index can be constructed by executing step S107 based on content attributes and meta tags.

また、同様に、例えば、同じコンテンツに付随するメタタグ同士は関連性があるものと判定して、ステップＳ１０８を実行することによって、コンテンツ語彙モデルを構築することもできる。 Similarly, for example, it is possible to construct a content vocabulary model by determining that meta tags attached to the same content are related and executing step S108.

これらの情報は、リポジトリ管理サブシステム１０３のコンテンツ語彙モデルＤ１３２に格納されているため、必要な情報をここから適宜抽出し、該当処理を実行すればよい。 Since these pieces of information are stored in the content vocabulary model D132 of the repository management subsystem 103, necessary information may be appropriately extracted from here and the corresponding process may be executed.

＜ユーザ分析サブシステム１０２＞
図７は、本発明の実施の形態のユーザ分析サブシステム１０２の構成を示すブロック図である。 <User analysis subsystem 102>
FIG. 7 is a block diagram showing a configuration of the user analysis subsystem 102 according to the embodiment of this invention.

ユーザ分析サブシステム１０２は、前述したコンテンツ分析サブシステム１０１（図２）と、格納されているプログラムが異なる以外は同じ構成を有する。このため、前述したコンテンツ分析サブシステム１０１と同じ構成には同じ符号を付し、その説明は省略する。 The user analysis subsystem 102 has the same configuration as the content analysis subsystem 101 (FIG. 2) described above except that the stored program is different. For this reason, the same components as those of the content analysis subsystem 101 described above are denoted by the same reference numerals, and description thereof is omitted.

すなわち、ユーザ分析サブシステム１０２は、メモリ１１０、記憶装置１２０、ＣＰＵ１３０、出力装置１４０、入力装置１５０及び通信インターフェース１６０を備え、これらの各構成がバス１７０によって接続される計算機である。 That is, the user analysis subsystem 102 includes a memory 110, a storage device 120, a CPU 130, an output device 140, an input device 150, and a communication interface 160, and these components are computers connected by a bus 170.

メモリ１１０は、ＣＰＵ１３０によって実行されるプログラムを格納する。具体的には、システム制御プログラムＰ１０及びユーザ分析プログラムＰ１２がメモリ１１０に格納される。ユーザ分析プログラムＰ１２は、ユーザがアクセスしたコンテンツの内容に基づいてユーザによるコンテンツの嗜好を分析するプログラムである。 The memory 110 stores a program executed by the CPU 130. Specifically, the system control program P10 and the user analysis program P12 are stored in the memory 110. The user analysis program P12 is a program for analyzing the user's preference for content based on the content accessed by the user.

記憶装置１２０に格納される各種プログラムＤ１００には、システム制御プログラムＰ１０及びユーザ分析プログラムＰ１２が含まれており、ＣＰＵ１３０によって実行される際にメモリ１１０にロードされる。 The various programs D100 stored in the storage device 120 include a system control program P10 and a user analysis program P12, and are loaded into the memory 110 when executed by the CPU.

ユーザ分析サブシステム１０２は、ユーザ分析プログラムＰ１２を実行することによって得られたユーザの分析結果を、ネットワーク１０６を介して、リポジトリ管理サブシステム１０３に転送する。次に、この処理の詳細を説明する。 The user analysis subsystem 102 transfers the user analysis result obtained by executing the user analysis program P12 to the repository management subsystem 103 via the network 106. Next, details of this processing will be described.

図８は、本発明の実施の形態のユーザ分析サブシステム１０２によって実行される処理のフローチャートである。 FIG. 8 is a flowchart of processing executed by the user analysis subsystem 102 according to the embodiment of this invention.

図８に示す処理は、定期的に全データを分析し、分析結果は上書き更新される。 The process shown in FIG. 8 periodically analyzes all data, and the analysis result is overwritten and updated.

まず、ユーザ分析サブシステム１０２は、分析対象となるユーザを取得し、分析対象となる全てのユーザのリストを生成する（Ｓ２０１）。そして、生成されたリストに含まれるユーザのインデックスＤ１３５（図１１参照）をリポジトリ管理サブシステム１０３からを取得する（Ｓ２０２）。ステップＳ２０２において取得されるユーザインデックスファイルには、ユーザの今までの分析結果が含まれており、その構成は図９を用いて後述する。 First, the user analysis subsystem 102 acquires users to be analyzed, and generates a list of all users to be analyzed (S201). Then, the index D135 (see FIG. 11) of the user included in the generated list is acquired from the repository management subsystem 103 (S202). The user index file acquired in step S202 includes the analysis results of the user so far, and the configuration will be described later with reference to FIG.

さらに、ユーザ行動履歴Ｄ１３７（図１１参照）をリポジトリ管理サブシステム１０３からを取得する（Ｓ２０３）。 Further, the user action history D137 (see FIG. 11) is acquired from the repository management subsystem 103 (S203).

その後、ループを制御するパラメータｎを１に初期設定する（Ｓ２０４）。 Thereafter, the parameter n for controlling the loop is initialized to 1 (S204).

そして、ｎ番目のユーザ情報をリポジトリ管理サブシステム１０３のユーザ情報Ｄ１３４から読み出す（Ｓ２０５）。そして、該当するユーザ情報が存在するか否か、すなわち、ステップＳ２０１で生成されたリストに含まれるユーザの情報がリポジトリ管理サブシステム１０３のユーザ情報Ｄ１３４から読み出すことができたか否かを判定する（Ｓ２０６）。 Then, the nth user information is read from the user information D134 of the repository management subsystem 103 (S205). Then, it is determined whether or not the corresponding user information exists, that is, whether or not the user information included in the list generated in step S201 has been read from the user information D134 of the repository management subsystem 103 ( S206).

その結果、ユーザ情報が存在する、すなわち、ユーザ情報がリポジトリ管理サブシステム１０３のユーザ情報Ｄ１３４から読み出すことができた場合、ステップＳ２０８に進む。一方、ユーザ情報が存在しない、すなわち、ユーザ情報がリポジトリ管理サブシステム１０３のユーザ情報Ｄ１３４から読み出すことができなかった場合、当該ユーザの情報はリポジトリ管理サブシステム１０３に登録されていないので、新たにユーザ情報を作成する。このため、ユーザ識別子をユーザに割り当て、このユーザのユーザ情報をリポジトリ管理サブシステム１０３のユーザ情報Ｄ１３４に格納する（Ｓ２０７）。 As a result, when the user information exists, that is, when the user information can be read from the user information D134 of the repository management subsystem 103, the process proceeds to step S208. On the other hand, when the user information does not exist, that is, when the user information cannot be read from the user information D134 of the repository management subsystem 103, the user information is not registered in the repository management subsystem 103. Create user information. Therefore, a user identifier is assigned to the user, and the user information of this user is stored in the user information D134 of the repository management subsystem 103 (S207).

次に、ユーザインデックスファイルを更新する（Ｓ２０８）。ユーザインデックスファイルは、図９に示すように、ユーザに関連した（当該ユーザが作成、閲覧等のアクセスをした）コンテンツに含まれる語彙の出現度数を示す。具体的には、ステップＳ２０３で取得したユーザ行動履歴Ｄ１３７を参照して、ユーザがアクセスしたコンテンツを特定する。そして、特定されたコンテンツを形態素解析によって抽出した語彙の出現度数を、全コンテンツにわたって集計して重みを求める。なお、既に該当コンテンツが分析済みであり、コンテンツインデックスに該当コンテンツの情報が保持されている場合、それを参照しながら重みを算出することができる。 Next, the user index file is updated (S208). As shown in FIG. 9, the user index file indicates the frequency of appearance of vocabulary included in content related to the user (accessed by the user such as creation and browsing). Specifically, the content accessed by the user is specified with reference to the user action history D137 acquired in step S203. Then, the appearance frequency of the vocabulary extracted from the identified content by morphological analysis is totaled over the entire content to obtain a weight. If the corresponding content has already been analyzed and information on the corresponding content is stored in the content index, the weight can be calculated while referring to the content.

その後、ユーザ語彙モデル情報（図１０参照）を更新する（Ｓ２０９）。ユーザ語彙モデルは、前述したコンテンツ語彙モデルの更新と同様の方法によって更新することができる。すなわち、ステップＳ２０８において抽出された語彙の前後の所定の範囲内で登場する語彙を共起語彙として、共起語彙の組み合わせの出現頻度を算出する。そして、算出されたコンテンツ毎の共起語彙の組み合わせの出現頻度を、ユーザ行動履歴Ｄ１３７を参照することによって、ユーザ毎にアクセスしたコンテンツを特定し、当該特定されたコンテンツの共起語彙の組み合わせの出現頻度をユーザ毎に集計する。これは、ユーザ行動履歴Ｄ１３７を参照することによって、ユーザ毎にアクセスしたコンテンツを特定し、当該特定されたコンテンツの共起語彙の組み合わせの出現頻度をユーザ毎に集計する。更新されたユーザ語彙モデルは、ステップＳ２１２において、リポジトリ管理サブシステム１０３のユーザ語彙モデルＤ１３６（図１１参照）に格納される。なお、既に該当コンテンツが分析済みであり、コンテンツ語彙モデルに該当コンテンツの情報が保持されている場合、それを参照しながら関連度を算出することができる。 Thereafter, the user vocabulary model information (see FIG. 10) is updated (S209). The user vocabulary model can be updated by the same method as that for updating the content vocabulary model described above. That is, the frequency of appearance of a combination of co-occurrence vocabularies is calculated using the vocabulary appearing within a predetermined range before and after the vocabulary extracted in step S208 as a co-occurrence vocabulary. Then, the appearance frequency of the calculated combination of co-occurrence vocabulary for each content is identified by referring to the user action history D137, the content accessed for each user is identified, and the combination of the co-occurrence vocabulary of the identified content is identified. The appearance frequency is totaled for each user. This refers to the user action history D137, identifies the accessed content for each user, and totals the appearance frequency of the combination of the co-occurrence vocabulary of the identified content for each user. In step S212, the updated user vocabulary model is stored in the user vocabulary model D136 (see FIG. 11) of the repository management subsystem 103. If the corresponding content has already been analyzed and information on the corresponding content is stored in the content vocabulary model, the relevance can be calculated while referring to the content.

その後、パラメータｎとステップＳ２０１で生成したリストの要素数とを比較して、次の分析対象のユーザが存在するか否かを判定する（Ｓ２１０）。 Thereafter, the parameter n is compared with the number of elements in the list generated in step S201 to determine whether or not there is a next user to be analyzed (S210).

その結果、次（ｎ＋１番目）のユーザが存在する場合、パラメータｎに１を加算し（Ｓ２１１）、ステップＳ２０５に戻り、次のユーザを分析する。 As a result, when the next (n + 1) th user exists, 1 is added to the parameter n (S211), and the process returns to step S205 to analyze the next user.

一方、次（ｎ＋１番目）のユーザが存在しない場合、ステップＳ２０１で取得したリストの中の全てのユーザの分析が終了しているので、更新されたユーザインデックス、ユーザ語彙モデルをリポジトリ管理サブシステム１０３に転送する（Ｓ２１２）。 On the other hand, if the next (n + 1) th user does not exist, the analysis of all the users in the list acquired in step S201 is completed, and thus the updated user index and user vocabulary model are stored in the repository management subsystem 103. (S212).

リポジトリ管理サブシステム１０３は、これらのデータをユーザ分析サブシステム１０１から受信すると、受信したユーザインデックスをユーザインデックスＤ１３５に格納し、ユーザ語彙モデルをユーザ語彙モデルＤ１３６に格納する。 Upon receiving these data from the user analysis subsystem 101, the repository management subsystem 103 stores the received user index in the user index D135 and stores the user vocabulary model in the user vocabulary model D136.

図９は、本発明の実施の形態のユーザインデックスの例を説明する図である。 FIG. 9 is a diagram illustrating an example of a user index according to the embodiment of this invention.

ユーザ分析サブシステム１０２がステップＳ２０８において更新するユーザインデックスは、リポジトリ管理サブシステム１０３の記憶装置１２０にユーザインデックスＤ１３５として格納されており、語彙、ユーザＩＤ及び重みを含む。 The user index updated by the user analysis subsystem 102 in step S208 is stored as a user index D135 in the storage device 120 of the repository management subsystem 103, and includes a vocabulary, a user ID, and a weight.

語彙は、コンテンツにテキスト情報として含まれる単語又は複合語単位の文字列である。ユーザＩＤは、ユーザを一意に識別するための識別子であり、当該語彙が含まれるコンテンツにアクセスしたユーザを示す。 The vocabulary is a word or a compound character string included in the content as text information. The user ID is an identifier for uniquely identifying the user, and indicates the user who has accessed the content including the vocabulary.

重みは、当該語彙がこのユーザがアクセスしたコンテンツに出現する度数を示す。なお、この重みを算出する際には、アクセスの種別によって重み付け係数を変えて出現回数を加算してもよい。例えば、「作成」には係数１を乗じ、「更新」には係数０．５を乗じることができる。 The weight indicates the frequency that the vocabulary appears in the content accessed by the user. When calculating this weight, the number of appearances may be added by changing the weighting coefficient depending on the type of access. For example, “creation” can be multiplied by a factor of 1 and “update” can be multiplied by a factor of 0.5.

図９に示すユーザインデックスによると、ユーザ「高取」がアクセスした全てのコンテンツには語彙「ＢＴ」が１０回出現していることが分かる。 According to the user index shown in FIG. 9, it can be seen that the vocabulary “BT” appears 10 times in all contents accessed by the user “Takatori”.

図１０は、本発明の実施の形態のユーザ語彙モデルの例を説明する図である。 FIG. 10 is a diagram illustrating an example of a user vocabulary model according to the embodiment of this invention.

ユーザ分析サブシステム１０２がステップＳ２０９において生成するユーザ語彙モデルは、ユーザがアクセスしたコンテンツに含まれる語彙の共起情報であり、リポジトリ管理サブシステム１０３の記憶装置１２０にユーザ語彙モデルＤ１３６として格納されており、語彙１、語彙２、ユーザＩＤ及び関連度を含む。 The user vocabulary model generated in step S209 by the user analysis subsystem 102 is vocabulary co-occurrence information included in the content accessed by the user, and is stored as the user vocabulary model D136 in the storage device 120 of the repository management subsystem 103. Vocabulary 1, vocabulary 2, user ID and relevance.

語彙１及び語彙２は、コンテンツ中の所定の範囲内で共起する語彙の組である。所定の範囲とは、前述したように、例えば、語彙１の前後４個以内に存在する語彙を共起する語彙と定めることができる。 Vocabulary 1 and vocabulary 2 are a set of vocabularies that co-occur within a predetermined range in the content. As described above, the predetermined range can be determined, for example, as a vocabulary that co-occurs with vocabularies existing within four words before and after the vocabulary 1.

ユーザＩＤは、語彙１及び語彙２が共起しているコンテンツにアクセスしたユーザの識別子である。関連度は、語彙１及び語彙２の組が共起する度数であり、例えば、当該ユーザがアクセスしたコンテンツ内で語彙１及び語彙２の組が共起した回数を用いることができる。すなわち、関連度が大きい語彙の組は、その語彙の関連性が高く、そのユーザによる嗜好性が高い語彙の組であるといえる。 The user ID is an identifier of a user who has accessed content in which vocabulary 1 and vocabulary 2 co-occur. The relevance is the frequency at which the set of vocabulary 1 and vocabulary 2 co-occurs. For example, the number of times the set of vocabulary 1 and vocabulary 2 co-occurs in the content accessed by the user can be used. That is, it can be said that a vocabulary set having a high degree of relevance is a vocabulary set having high relevance of the vocabulary and high preference by the user.

図１０に示すユーザ語彙モデルによると、ユーザ「高取」がアクセスした全てのコンテンツには、「ＢＴ」「ＳＯＡ」の組が共起語彙として５回出現していることが分かる。 According to the user vocabulary model shown in FIG. 10, it can be seen that a set of “BT” and “SOA” appears five times as a co-occurrence vocabulary in all contents accessed by the user “Takatori”.

なお、本実施の形態では、ユーザインデックスＤ１３５及びユーザ語彙モデルＤ１３６の構築方法として、当該ユーザが作成、閲覧等のアクセスを行なったコンテンツ情報に基づいて構築する方法について説明したが、アクセスしたコンテンツの情報だけでなく、検索やメタタグ付与などの別の行動の対象となったコンテンツの情報を利用して構築することもできる。 In the present embodiment, as a method of constructing the user index D135 and the user vocabulary model D136, a method of constructing based on content information that the user has made, accessed, etc. has been described. In addition to information, it can also be constructed using information on content that has been subject to other actions such as search and meta tagging.

例えば、検索を実行した際に入力される検索キーワードや、ユーザがコンテンツに付与したメタタグは、ユーザの嗜好を表していると考えられる。このため、これらの情報に基づいて、ステップＳ２０８を実行することによって、ユーザインデックスを構築することができる。 For example, it is considered that a search keyword input when a search is executed and a meta tag given to the content by the user represent the user's preference. Therefore, a user index can be constructed by executing step S208 based on these pieces of information.

また、同様に、例えば、ユーザが一回の検索時に入力された語彙同士、又は、同じユーザがアクセスしたコンテンツに付随するメタタグ同士は関連性があると判定して、ステップＳ２０９を実行しすることによって、ユーザ語彙モデルを構築することもできる。 Similarly, for example, it is determined that the vocabulary input by the user at the time of one search or the meta tags attached to the content accessed by the same user are related, and step S209 is executed. A user vocabulary model can also be constructed.

これらの情報は、リポジトリ管理サブシステム１０３のユーザ行動履歴Ｄ１３７に格納されているため、必要な情報をここから適宜抽出し、該当処理を実行すればよい。 Since these pieces of information are stored in the user action history D137 of the repository management subsystem 103, necessary information may be appropriately extracted from here and the corresponding process may be executed.

＜リポジトリ管理サブシステム１０３＞
図１１は、本発明の実施の形態のリポジトリ管理サブシステム１０３の構成を示すブロック図である。 <Repository management subsystem 103>
FIG. 11 is a block diagram showing a configuration of the repository management subsystem 103 according to the embodiment of this invention.

リポジトリ管理サブシステム１０３は、前述したコンテンツ分析サブシステム１０１（図２）と、格納されているプログラムが異なる以外は同じ構成を有する。このため、前述したコンテンツ分析サブシステム１０１と同じ構成には同じ符号を付し、その説明は省略する。 The repository management subsystem 103 has the same configuration as the content analysis subsystem 101 (FIG. 2) described above except that the stored program is different. For this reason, the same components as those of the content analysis subsystem 101 described above are denoted by the same reference numerals, and description thereof is omitted.

すなわち、リポジトリ管理サブシステム１０３は、メモリ１１０、記憶装置１２０、ＣＰＵ１３０、出力装置１４０、入力装置１５０及び通信インターフェース１６０を備え、これらの各構成がバス１７０によって接続される計算機である。 That is, the repository management subsystem 103 includes a memory 110, a storage device 120, a CPU 130, an output device 140, an input device 150, and a communication interface 160, and these components are computers connected by a bus 170.

メモリ１１０は、ＣＰＵ１３０によって実行されるプログラムを格納する。具体的には、システム制御プログラムＰ１０及びリポジトリ管理プログラムＰ１３がメモリ１１０に格納される。リポジトリ管理プログラムＰ１３は、リポジトリ管理サブシステム１０３に格納される情報を管理するプログラムである。 The memory 110 stores a program executed by the CPU 130. Specifically, the system control program P10 and the repository management program P13 are stored in the memory 110. The repository management program P13 is a program for managing information stored in the repository management subsystem 103.

記憶装置１２０には、各種プログラムＤ１００が格納される。この各種プログラムＤ１００には、システム制御プログラムＰ１０及びリポジトリ管理プログラムＰ１３が含まれており、ＣＰＵ１３０によって実行される際にメモリ１１０にロードされる。 The storage device 120 stores various programs D100. The various programs D100 include a system control program P10 and a repository management program P13, which are loaded into the memory 110 when executed by the CPU 130.

また、記憶装置１２０には、コンテンツ情報Ｄ１３０、コンテンツインデックスＤ１３１、コンテンツ語彙モデルＤ１３２、コンテンツ付加情報Ｄ１３３、ユーザ情報Ｄ１３４、ユーザインデックスＤ１３５、ユーザ語彙モデルＤ１３６及びユーザ行動履歴Ｄ１３７が格納される。 The storage device 120 also stores content information D130, content index D131, content vocabulary model D132, content additional information D133, user information D134, user index D135, user vocabulary model D136, and user behavior history D137.

コンテンツ情報Ｄ１３０は、コンテンツの識別子と、コンテンツの実体を含む。コンテンツインデックスＤ１３１は、図４に示すように、コンテンツに出現する語彙の度数を示す。コンテンツ語彙モデルＤ１３２は、図５に示すように、コンテンツに出現する共起語彙の度数を示す。コンテンツ付加情報Ｄ１３３は、コンテンツに付加される情報であり、コンテンツの属性や、メタタグを含む。 The content information D130 includes a content identifier and a content entity. The content index D131 indicates the frequency of the vocabulary that appears in the content, as shown in FIG. The content vocabulary model D132 indicates the frequency of co-occurrence vocabulary appearing in the content, as shown in FIG. The content additional information D133 is information added to the content, and includes content attributes and meta tags.

ユーザ情報Ｄ１３４は、ユーザの識別子、氏名、所属、権限レベル等のユーザに関する情報を含む。ユーザインデックスＤ１３５は、図９に示すように、ユーザがアクセスしたコンテンツに含まれる語彙の出現度数を示す。ユーザ語彙モデルＤ１３６は、図１０に示すように、ユーザがアクセスしたコンテンツに含まれる共起語彙及びその度数を示す。ユーザ行動履歴Ｄ１３７は、図６に示すように、ユーザがコンテンツにアクセスした態様、頻度を示す。 The user information D134 includes information about the user such as a user identifier, name, affiliation, and authority level. As shown in FIG. 9, the user index D135 indicates the frequency of appearance of the vocabulary included in the content accessed by the user. As shown in FIG. 10, the user vocabulary model D136 indicates the co-occurrence vocabulary included in the content accessed by the user and its frequency. As shown in FIG. 6, the user action history D137 indicates the mode and frequency with which the user has accessed the content.

リポジトリ管理サブシステム１０３は、リポジトリ管理プログラムＰ１３を実行することによって、ネットワーク１０６を介して転送された分析情報を記憶装置１２０に格納し、ネットワーク１０６を介して送信された情報問合せを受信し、受信した問合せ内容に応じた情報を磁気ディスク装置１２０から取得し、ネットワーク１０６を介して要求元の情報検索クライアント１０５に返信する。次に、この処理の詳細を説明する。 The repository management subsystem 103 stores the analysis information transferred through the network 106 in the storage device 120 by executing the repository management program P13, and receives and receives the information inquiry transmitted through the network 106. Information corresponding to the inquiry content is acquired from the magnetic disk device 120 and returned to the requesting information retrieval client 105 via the network 106. Next, details of this processing will be described.

図１２は、本発明の実施の形態のリポジトリ管理サブシステム１０３によって実行される処理のフローチャートである。 FIG. 12 is a flowchart of processing executed by the repository management subsystem 103 according to the embodiment of this invention.

まず、リポジトリ管理サブシステム１０３は、情報検索クライアント１０５からリクエストを受信すると（Ｓ３０１）、受信したリクエストを解析する（Ｓ３０２）。 First, when the repository management subsystem 103 receives a request from the information search client 105 (S301), it analyzes the received request (S302).

受信したリクエストが情報参照要求である場合、要求された問合せ内容に応じた情報を記憶装置１２０から読み出して（Ｓ３０３）、読み出した情報を要求元の情報検索クライアント１０５に送信する（Ｓ３０４）。その後、ステップＳ３０１に戻り、他のリクエストの受信を待つ。 If the received request is an information reference request, information corresponding to the requested inquiry content is read from the storage device 120 (S303), and the read information is transmitted to the requesting information retrieval client 105 (S304). Then, it returns to step S301 and waits for reception of another request.

一方、受信したリクエストがコンテンツ情報登録要求である場合、要求された登録内容に応じて、情報を記憶装置１２０に格納する。 On the other hand, when the received request is a content information registration request, the information is stored in the storage device 120 according to the requested registration content.

具体的には、コンテンツ情報の登録要求であれば、受信したコンテンツ情報をコンテンツ情報Ｄ１３０に新たに登録する（Ｓ３０５）。コンテンツインデックスの更新要求であれば、受信したコンテンツインデックスによってコンテンツインデックスＤ１３１を更新する（Ｓ３０６）。コンテンツ語彙モデルの更新要求であれば、受信したコンテンツ語彙モデルによってコンテンツ語彙モデルＤ１３２を更新する（Ｓ３０７）。コンテンツ付加情報の登録要求であれば、受信したコンテンツ付加情報をコンテンツ付加情報Ｄ１３３に追加登録する（Ｓ３０８）。ユーザ行動履歴の登録要求であれば、受信したユーザ行動履歴をユーザ行動履歴Ｄ１３７に追加登録する（Ｓ３０９）。ユーザ情報の登録要求であれば、受信した新規登録ユーザの情報をユーザ情報Ｄ１３４に追加するように登録する（Ｓ３１０）。ユーザインデックスの登録要求であれば、受信したユーザインデックスによってユーザインデックスＤ１３５を更新する（Ｓ３１１）。ユーザ語彙モデルの更新要求であれば、受信したユーザ語彙モデルによってユーザ語彙モデルＤ１３６を更新する（Ｓ３１２）。ステップＳ３０５からステップＳ３１２の処理の終了後、ステップＳ３０１に戻り、他のリクエストの受信を待つ。 Specifically, if it is a content information registration request, the received content information is newly registered in the content information D130 (S305). If it is a content index update request, the content index D131 is updated with the received content index (S306). If the request is for updating the content vocabulary model, the content vocabulary model D132 is updated with the received content vocabulary model (S307). If it is a content additional information registration request, the received content additional information is additionally registered in the content additional information D133 (S308). If it is a user action history registration request, the received user action history is additionally registered in the user action history D137 (S309). If it is a user information registration request, registration is performed such that the received information of the newly registered user is added to the user information D134 (S310). If it is a user index registration request, the user index D135 is updated with the received user index (S311). If it is a user vocabulary model update request, the user vocabulary model D136 is updated with the received user vocabulary model (S312). After the processing from step S305 to step S312 ends, the process returns to step S301 and waits for reception of another request.

一方、受信したリクエストがサブシステム停止コマンドである場合、リポジトリ管理サブシステム１０３の動作を終了する。 On the other hand, when the received request is a subsystem stop command, the operation of the repository management subsystem 103 is terminated.

＜情報検索サーバ１０４＞
図１３は、本発明の実施の形態の情報検索サーバ１０４の構成を示すブロック図である。 <Information Search Server 104>
FIG. 13 is a block diagram showing a configuration of the information search server 104 according to the embodiment of this invention.

情報検索サーバ１０４は、前述したコンテンツ分析サブシステム１０１（図２）と、格納されているプログラムが異なる以外は同じ構成を有する。このため、前述したコンテンツ分析サブシステム１０１と同じ構成には同じ符号を付し、その説明は省略する。 The information search server 104 has the same configuration as the content analysis subsystem 101 (FIG. 2) described above except that the stored program is different. For this reason, the same components as those of the content analysis subsystem 101 described above are denoted by the same reference numerals, and description thereof is omitted.

すなわち、情報検索サーバ１０４は、メモリ１１０、記憶装置１２０、ＣＰＵ１３０、出力装置１４０、入力装置１５０及び通信インターフェース１６０を備え、これらの各構成がバス１７０によって接続される計算機である。 That is, the information search server 104 includes a memory 110, a storage device 120, a CPU 130, an output device 140, an input device 150, and a communication interface 160, and these components are computers connected by a bus 170.

メモリ１１０は、ＣＰＵ１３０によって実行されるプログラムを格納する。具体的には、システム制御プログラムＰ１０及び情報検索制御プログラムＰ１４がメモリ１１０に格納される。 The memory 110 stores a program executed by the CPU 130. Specifically, the system control program P10 and the information search control program P14 are stored in the memory 110.

情報検索制御プログラムＰ１４は、情報検索クライアント１０５から送信された検索要求に基づいて、コンテンツを検索するプログラムであり、サブプログラムとして、コンテンツ検索プログラムＰ１４１、検索結果取得プログラムＰ１４２、コンテンツ取得プログラムＰ１４３、ユーザ行動履歴登録プログラムＰ１４４、及びユーザ認証プログラムＰ１４５を含む。 The information search control program P14 is a program for searching for content based on a search request transmitted from the information search client 105. As subprograms, a content search program P141, a search result acquisition program P142, a content acquisition program P143, a user An action history registration program P144 and a user authentication program P145 are included.

コンテンツ検索プログラムＰ１４１は、コンテンツの検索を実行するためのプログラムで、検索条件式解析プログラムＰ１４１１、検索クエリ行列生成プログラムＰ１４１２、コンテンツ行列生成プログラムＰ１４１３、類似度算出プログラムＰ１４１４、及びコンテンツ抽出プログラムＰ１４１５を含む。 The content search program P141 is a program for executing a content search, and includes a search condition expression analysis program P1411, a search query matrix generation program P1412, a content matrix generation program P1413, a similarity calculation program P1414, and a content extraction program P1415. .

検索条件式解析プログラムＰ１４１１は、ユーザによって入力された検索条件式を解析する（図１４のステップＳ４１１）。検索クエリ行列生成プログラムＰ１４１２は、検索条件式の解析結果に従って、検索クエリ行列を生成する（図１４のステップＳ４１２）。 The search condition expression analysis program P1411 analyzes the search condition expression input by the user (step S411 in FIG. 14). The search query matrix generation program P1412 generates a search query matrix according to the analysis result of the search condition formula (step S412 in FIG. 14).

コンテンツ行列生成プログラムＰ１４１３は、検索クエリ行列と類似度が比較されるコンテンツ行列を生成する（図１７のステップＳ４１３４）。類似度算出プログラムＰ１４１４は、検索クエリ行列とコンテンツ行列との類似度を算出する（図１７のステップＳ４１３５））。 The content matrix generation program P1413 generates a content matrix whose similarity is compared with the search query matrix (step S4134 in FIG. 17). The similarity calculation program P1414 calculates the similarity between the search query matrix and the content matrix (step S4135 in FIG. 17).

コンテンツ抽出プログラムＰ１４１５は、コンテンツインデックスＤ１３１を検索し、検索クエリ行列のラベルの語彙が含まれているコンテンツのリストを抽出する（図１７のステップＳ４１３２）。 The content extraction program P1415 searches the content index D131, and extracts a list of content that includes the vocabulary of the label of the search query matrix (step S4132 in FIG. 17).

検索結果取得プログラムＰ１４２は、情報検索クライアント１０５からの問合せ内容に従って検索結果データＤ１４０を要求元に転送する。 The search result acquisition program P142 transfers the search result data D140 to the request source according to the inquiry content from the information search client 105.

コンテンツ取得プログラムＰ１４３は、情報検索クライアント１０５からの要求に従って、リポジトリ管理サブシステム１０３からコンテンツを取得する。 The content acquisition program P143 acquires content from the repository management subsystem 103 in accordance with a request from the information search client 105.

ユーザ行動履歴登録プログラムＰ１４４は、情報検索クライアント１０５からの要求に従ったコンテンツへのアクセス履歴をユーザ行動履歴Ｄ１３７に登録する。 The user action history registration program P144 registers the access history to the content according to the request from the information search client 105 in the user action history D137.

ユーザ認証プログラムＰ１４５は、情報検索クライアント１０５を操作するユーザを認証する。 The user authentication program P145 authenticates a user who operates the information search client 105.

記憶装置１２０には、各種プログラムＤ１００が格納される。この各種プログラムＤ１００には、システム制御プログラムＰ１０及び情報検索制御プログラムＰ１４が含まれており、ＣＰＵ１３０によって実行される際にメモリ１１０にロードされる。 The storage device 120 stores various programs D100. The various programs D100 include a system control program P10 and an information retrieval control program P14, and are loaded into the memory 110 when executed by the CPU 130.

また、記憶装置１２０には、検索結果データＤ１４０が格納される。検索結果データＤ１４０は、コンテンツ検索プログラムＰ１４１によって検索された検索結果であり、検索結果取得プログラムＰ１４２によって、情報検索クライアント１０５に転送される。 The storage device 120 stores search result data D140. The search result data D140 is a search result searched by the content search program P141, and is transferred to the information search client 105 by the search result acquisition program P142.

情報検索サーバ１０４は、情報検索制御プログラムＰ１４を実行することによって、情報検索クライアント１０５から送信された検索要求に基づいて、コンテンツを検索し、検索結果を要求元の情報検索クライアント１０５に返信する。次に、この処理の詳細を説明する。 The information search server 104 searches the content based on the search request transmitted from the information search client 105 by executing the information search control program P14, and returns the search result to the information search client 105 as the request source. Next, details of this processing will be described.

図１４は、本発明の実施の形態の情報検索サーバ１０４によって実行される処理のフローチャートである。 FIG. 14 is a flowchart of processing executed by the information search server 104 according to the embodiment of this invention.

まず、情報検索サーバ１０４は、情報検索クライアント１０５からリクエストを受信すると（Ｓ４０１）、受信したリクエストを解析する（Ｓ４０２）。 First, upon receiving a request from the information search client 105 (S401), the information search server 104 analyzes the received request (S402).

受信したリクエストがユーザ認証要求である場合、情報検索クライアント１０５に入力されたユーザＩＤに基づいてユーザを認証し、認証結果を要求元の情報検索クライアント１０５に返信する（Ｓ４０３）。 If the received request is a user authentication request, the user is authenticated based on the user ID input to the information search client 105, and the authentication result is returned to the requesting information search client 105 (S403).

一方、受信したリクエストが検索結果問合せである場合、検索結果取得プログラムＰ１４２を実行し、情報検索クライアント１０５からの問合せ内容に従って検索結果データＤ１４０を記憶装置１２０から読み出して、検索結果データＤ１４０を要求元情報検索クライアント１０５に転送する（Ｓ４２１）。検索結果データＤ１４０には、検索されたコンテンツのリストが含まれている。 On the other hand, if the received request is a search result query, the search result acquisition program P142 is executed, the search result data D140 is read from the storage device 120 according to the query content from the information search client 105, and the search result data D140 is obtained from the request source. The information is transferred to the information search client 105 (S421). The search result data D140 includes a list of searched contents.

一方、受信したリクエストがコンテンツ転送要求である場合、コンテンツ内容取得プログラムＰ１４３を実行し、情報検索クライアント１０５からの要求をリポジトリ管理サブシステム１０３に転送し、要求されたコンテンツをリポジトリ管理サブシステム１０３から取得する（Ｓ４３１）。そして、ユーザ行動履歴登録プログラムＰ１４４を実行し、コンテンツ転送要求の原因（例えば、閲覧）をリポジトリ管理サブシステム１０３のユーザ行動履歴Ｄ１３７に追加登録する（Ｓ４３２）。その後、要求されたコンテンツを要求元の情報検索クライアント１０５に転送する（Ｓ４３３）。 On the other hand, if the received request is a content transfer request, the content content acquisition program P143 is executed, the request from the information search client 105 is transferred to the repository management subsystem 103, and the requested content is transferred from the repository management subsystem 103. Obtain (S431). Then, the user action history registration program P144 is executed, and the cause (for example, browsing) of the content transfer request is additionally registered in the user action history D137 of the repository management subsystem 103 (S432). Thereafter, the requested content is transferred to the requesting information retrieval client 105 (S433).

一方、受信したリクエストがコンテンツ検索要求である場合、検索条件式解析プログラムＰ１４１１を実行し、検索条件式を解析する（Ｓ４１１）。具体的には、検索者のユーザＩＤを取得し、検索者が入力した文章を、形態素解析を用いて語彙に分割する。そして、語彙の出現頻度を算出し、検索条件式をベクトル化する。ベクトル化された検索条件式は、情報検索サーバ１０４のメモリ１１０のワークエリアに格納される。次に、検索クエリ行列生成プログラムＰ１４１２を実行し、検索クエリ行列を生成する（Ｓ４１２）。検索クエリ行列生成処理の詳細は、図１５を用いて後述する。本発明の実施の形態では、ステップＳ４１１によって検索条件式を解析し、検索クエリ行列を生成して、検索を実行するので、検索条件式が自然文で入力されても、適切な検索をすることができる。 On the other hand, if the received request is a content search request, the search condition expression analysis program P1411 is executed to analyze the search condition expression (S411). Specifically, the user ID of the searcher is acquired, and the text input by the searcher is divided into vocabulary using morphological analysis. Then, the appearance frequency of the vocabulary is calculated, and the search condition formula is vectorized. The vectorized search condition formula is stored in the work area of the memory 110 of the information search server 104. Next, the search query matrix generation program P1412 is executed to generate a search query matrix (S412). Details of the search query matrix generation processing will be described later with reference to FIG. In the embodiment of the present invention, the search condition expression is analyzed in step S411, a search query matrix is generated, and the search is executed. Therefore, even if the search condition expression is input as a natural sentence, an appropriate search is performed. Can do.

その後、検索クエリ行列に基づいて、コンテンツ情報Ｄ１３０中の各コンテンツの類似度を算出し（Ｓ４１３）、コンテンツ情報Ｄ１３０中の各コンテンツを類似度の降順にソートする（Ｓ４１４）。類似コンテンツ抽出処理の詳細は、図１７を用いて後述する。 Then, based on the search query matrix, the similarity of each content in the content information D130 is calculated (S413), and each content in the content information D130 is sorted in descending order of similarity (S414). Details of the similar content extraction processing will be described later with reference to FIG.

その後、ユーザー行動履歴登録プログラムＰ１４４を実行し、ユーザ行動履歴を追加登録する（Ｓ４１５）。そして、ソートした検索結果データ集合を要求元の情報検索クライアント１０５に返信する（Ｓ４１６）。 Thereafter, the user behavior history registration program P144 is executed to additionally register the user behavior history (S415). The sorted search result data set is returned to the requesting information search client 105 (S416).

一方、受信したリクエストがサーバー停止コマンドである場合、情報検索サーバ１０４の動作を終了する。 On the other hand, when the received request is a server stop command, the operation of the information search server 104 is terminated.

図１５Ａ及び図１５Ｂは、本発明の実施の形態の情報検索クエリ行列生成処理（Ｓ４１２）の詳細な手順を示すフローチャートである。 15A and 15B are flowcharts illustrating a detailed procedure of the information search query matrix generation process (S412) according to the embodiment of this invention.

まず、ステップＳ４１１において検索条件式を解析して得られた検索キーワード及びその重み値をワークエリアから読み込む（Ｓ４１２０１）。 First, the search keyword and its weight value obtained by analyzing the search condition formula in step S411 are read from the work area (S41201).

そして、検索キーワード数に１を加算した値の次数の正方零行列を作成し、作成した行列を検索クエリ行列に初期設定する（Ｓ４１２０２）。 Then, a square zero matrix of the order of a value obtained by adding 1 to the number of search keywords is created, and the created matrix is initially set as a search query matrix (S41202).

そして、作成した検索クエリ行列の行のラベルに、抽象ノード及び検索キーワードを設定し、検索クエリ行列の列のラベルにも、抽象ノード及び検索キーワードを設定する（Ｓ４１２０３）。すなわち、検索クエリ行列の行のラベルと列のラベルとには、同じ抽象ノード及び検索キーワードが設定される。なお、抽象ノードは第１行目及び第１列目のラベルに設定される。 Then, an abstract node and a search keyword are set in the row label of the created search query matrix, and an abstract node and a search keyword are also set in the column label of the search query matrix (S41203). That is, the same abstract node and search keyword are set for the row label and the column label of the search query matrix. The abstract node is set to the label in the first row and the first column.

その後、処理される検索キーワードを制御するパラメータｎを１に初期設定する（Ｓ４１２０４）。なお、ｎの最大値は、ステップＳ４１２０１において取得した検索キーワードの数である。 Thereafter, a parameter n for controlling the search keyword to be processed is initialized to 1 (S41204). Note that the maximum value of n is the number of search keywords acquired in step S41201.

そして、ｎ番目の検索キーワードをワークエリアから読み込む（Ｓ４１２０５）。 Then, the nth search keyword is read from the work area (S41205).

抽象ノード及びｎ番目の検索キーワードに対応する検索クエリ行列の値＜１，ｎ＞に、ステップＳ４０２０１において取得した重み値を設定する。同様に、ｎ番目の検索キーワード及び抽象ノードに対応する検索クエリ行列の値＜ｎ，１＞にも同じ重み値を設定する（Ｓ４１２０６）。 The weight value acquired in step S40201 is set to the search query matrix value <1, n> corresponding to the abstract node and the nth search keyword. Similarly, the same weight value is set for the search query matrix value <n, 1> corresponding to the nth search keyword and abstract node (S41206).

その後、ｎ番目の検索キーワードに関連する語彙のリストをユーザ語彙モデルＤ１３６から抽出する（Ｓ４１２０７）。具体的には、ｎ番目の検索キーワードとユーザＩＤとに基づいてユーザ語彙モデルＤ１３６の語彙１を検索し、対応する語彙２の語彙集合を取得し、関連語彙集合に設定する。同様に、ｎ番目の検索キーワードとユーザＩＤに基づいてユーザ語彙モデルＤ１３６の語彙２を検索し、対応する語彙１の語彙集合を取得し、関連語彙集合に設定する。取得した関連語彙はメモリ１１０のワークエリアに格納される。 Thereafter, a list of vocabulary related to the nth search keyword is extracted from the user vocabulary model D136 (S41207). Specifically, the vocabulary 1 of the user vocabulary model D136 is searched based on the nth search keyword and the user ID, the vocabulary set of the corresponding vocabulary 2 is acquired, and set to the related vocabulary set. Similarly, the vocabulary 2 of the user vocabulary model D136 is searched based on the nth search keyword and the user ID, the vocabulary set of the corresponding vocabulary 1 is acquired, and set to the related vocabulary set. The acquired related vocabulary is stored in the work area of the memory 110.

その後、処理される関連語彙を制御するパラメータｋを１に初期設定する（Ｓ４１２０８）。なお、パラメータｋの最大値は、ステップＳ４１２０７において取得した関連語彙集合の要素数である。 Thereafter, the parameter k for controlling the related vocabulary to be processed is initialized to 1 (S41208). Note that the maximum value of the parameter k is the number of elements of the related vocabulary set acquired in step S41207.

その後、ｋ番目の関連語彙をワークエリアから読み込む（Ｓ４１２０９）。そして、検索クエリ行列の行及び列のラベルを参照して、読み込んだｋ番目の関連語彙が検索クエリ行列に存在するか否かを判定する（Ｓ４１２１０）。その結果、ｋ番目の関連語彙が検索クエリ行列のラベルに存在する場合、ｋ番目の関連語彙を検索クエリ行列に追加する必要がないので、ステップＳ４１２１３に進む。 Thereafter, the kth related vocabulary is read from the work area (S41209). Then, with reference to the row and column labels of the search query matrix, it is determined whether or not the read k-th related vocabulary exists in the search query matrix (S41210). As a result, if the k-th related vocabulary exists in the label of the search query matrix, it is not necessary to add the k-th related vocabulary to the search query matrix, so the process proceeds to step S41213.

一方、ｋ番目の関連語彙が検索クエリ行列のラベルに存在しない場合、検索クエリ行列を１行及び１列拡張し、拡張された行及び列の要素を０に設定し（Ｓ４１２１１）、拡張された行及び列のラベルにｋ番目の関連語彙を設定する（Ｓ４１２１２）。 On the other hand, if the k-th related vocabulary does not exist in the search query matrix label, the search query matrix is expanded by 1 row and 1 column, and the expanded row and column elements are set to 0 (S41211). The k-th related vocabulary is set in the row and column labels (S41212).

その後、ｋ＋１番目の関連語彙が存在するか否かを判定する（Ｓ４１２１３）。その結果、ｋ＋１番目の関連語彙が存在する場合、パラメータｋに１を加算し（Ｓ４１２１５）、ステップＳ４１２０９に戻り、次の関連語彙を処理する。 Thereafter, it is determined whether or not the (k + 1) th related vocabulary exists (S41213). As a result, when the k + 1-th related vocabulary exists, 1 is added to the parameter k (S41215), and the process returns to step S41209 to process the next related vocabulary.

一方、ｋ＋１番目の関連語彙が存在しない場合、この検索キーワードに関する関連語彙の処理は終了したので、ｎ＋１番目の検索キーワードが存在するか否かを判定する（Ｓ４１２１４）。その結果、ｎ＋１番目の検索キーワードが存在する場合、パラメータｎに１を加算し（Ｓ４１２１６）、ステップＳ４１２０５に戻り、次の検索キーワードを処理する。 On the other hand, if the k + 1-th related vocabulary does not exist, the processing of the related vocabulary related to this search keyword is completed, so it is determined whether or not the n + 1-th search keyword exists (S41214). As a result, when the (n + 1) th search keyword exists, 1 is added to the parameter n (S41216), and the process returns to step S41205 to process the next search keyword.

一方、ｎ＋１番目の検索キーワードが存在しない場合、検索に必要な全ての関連語彙が含まれる検索クエリ行列の生成が完了したので、ステップＳ４１２１７に進む。なお、生成された検索クエリ行列は、メモリ１１０のワークエリアに格納されている。 On the other hand, if the (n + 1) th search keyword does not exist, the generation of the search query matrix including all the related vocabulary necessary for the search is completed, and the process proceeds to step S41217. The generated search query matrix is stored in the work area of the memory 110.

ステップＳ４１２１７では、検索クエリ行列の行を示すパラメータｉを２に初期設定した後（Ｓ４１２１７）、検索クエリ行列のｉ行のラベル（語彙）をワークエリアから読み込む（Ｓ４１２１８）。その後、検索クエリ行列の列を示すパラメータｊを２に初期設定した後（Ｓ４１２１９）、検索クエリ行列のｊ列のラベル（語彙）を読み込む（Ｓ４１２２０）。 In step S41217, after the parameter i indicating the row of the search query matrix is initialized to 2 (S41217), the label (vocabulary) of the i row of the search query matrix is read from the work area (S41218). Thereafter, after initializing the parameter j indicating the column of the search query matrix to 2 (S41219), the label (vocabulary) of the j column of the search query matrix is read (S41220).

その後、検索クエリ行列のｉ行のラベルと、ｊ列のラベルとの関連度をユーザ語彙モデルＤ１３６から取得し（Ｓ４１２２１）、取得した関連度を検索クエリ行列の＜ｉ，ｊ＞に設定する。ただし、ステップＳ４１２２１において関連語を取得できなかった場合は何も実行せず、次のステップに移る（Ｓ４１２２２）。 Thereafter, the degree of association between the i-row label of the search query matrix and the label of the j column is obtained from the user vocabulary model D136 (S41221), and the obtained degree of association is set in <i, j> of the search query matrix. However, if a related word cannot be acquired in step S41221, nothing is executed and the process proceeds to the next step (S41222).

その後、ｊ＋１番目のラベル（語彙）が存在するか否かを判定する（Ｓ４１２２３）。その結果、ｊ＋１番目の語彙が存在する場合、ｊに１を加算し（Ｓ４１２２６）、ステップＳ４１１２０に戻り、次の列を処理する。一方、ｊ＋１番目の語彙が存在しない場合、ｉ行の処理は終了したので、次に行に移るため、ｉ＋１番目のラベル（語彙）が存在するか否かを判定する（Ｓ４１２２４）。 Thereafter, it is determined whether or not the j + 1-th label (vocabulary) exists (S41223). As a result, when the j + 1-th vocabulary exists, 1 is added to j (S41226), and the process returns to step S41120 to process the next column. On the other hand, if the j + 1-th vocabulary does not exist, the processing for the i-th row is completed, so that it moves to the next row, and it is determined whether or not the i + 1-th label (vocabulary) exists (S41224).

その結果、ｉ＋１番目の語彙が存在する場合、ｉに１を加算し（Ｓ４１２２７）、ステップＳ４１２１８に戻り、次のｉ＋１行を処理する。一方、ｉ＋１番目の語彙が存在しない場合、検索クエリ行列への語彙間の関連度の登録が終了したので、生成した検索クエリ行列をワークエリアに格納し（Ｓ４１２２５）、情報検索クエリ行列生成処理（Ｓ４１２）を終了し、ステップＳ４１３に進む。 As a result, when the i + 1th vocabulary exists, 1 is added to i (S41227), and the process returns to step S41218 to process the next i + 1 line. On the other hand, if the i + 1th vocabulary does not exist, registration of the degree of association between the vocabularies in the search query matrix is completed, so the generated search query matrix is stored in the work area (S41225), and information search query matrix generation processing ( S412) is ended, and the process proceeds to step S413.

図１６は、本発明の実施の形態のステップＳ４１１及びＳ４１２において、検索条件式を解析し検索クエリ行列を生成する例を説明する図である。 FIG. 16 is a diagram illustrating an example of generating a search query matrix by analyzing a search condition formula in steps S411 and S412 according to the embodiment of the present invention.

ステップＳ４１１では、検索者が文章「ＢＴによる効果的なシステム構築手法について・・・」１７０１を入力した場合を考える。この文章１７０１を、形態素解析を用いて語彙に分割する。そして、語彙の文章１７０１中の出現頻度を算出し（１７０２）、検索条件式をベクトル化する。この生成されたベクトル１７０３は検索クエリ行列１７０４の１列目に含まれている。 In step S411, a case is considered in which the searcher inputs the sentence "about an effective system construction method using BT ..." 1701. This sentence 1701 is divided into vocabulary using morphological analysis. Then, the appearance frequency of the vocabulary sentence 1701 is calculated (1702), and the search condition expression is vectorized. This generated vector 1703 is included in the first column of the search query matrix 1704.

次に、ステップＳ４１２では、ベクトル１７０３が１行目及び１列目に設定された検索クエリ行列１７０４を生成する。検索クエリ行列１７０４の２行目及び２列目以後の要素は、ユーザ語彙モデルＤ１３６に記録された語彙間が共起する頻度が登録される。 Next, in step S412, a search query matrix 1704 in which the vector 1703 is set to the first row and the first column is generated. In the elements in the second row and the second column of the search query matrix 1704, the frequency with which the vocabulary recorded in the user vocabulary model D136 co-occurs is registered.

なお、本実施の形態では、検索者が、検索語として文章１７０１を入力した場合の検索方法について説明したが、検索者に検索語とそれに対する重みの入力を求め、形態素解析をすることなく、入力された検索語とそれに対する重みによって検索クエリを作成することもできる。なお、検索者が検索語に対する重みの入力を省略した場合、デフォルト値（例えば、１）を設定すればよい。 In this embodiment, the search method when the searcher inputs the sentence 1701 as the search term has been described. However, the searcher is requested to input the search term and its weight, and without performing morphological analysis, A search query can be created based on the input search term and the weight for it. When the searcher omits the input of the weight for the search word, a default value (for example, 1) may be set.

また、検索したい内容に関連したコンテンツのコンテンツＩＤの入力を検索者に求めることによって、形態素解析をすることなく、入力されたコンテンツＩＤに基づいて検索クエリ行列を作成することもできる。具体的には、まず、コンテンツＩＤに基づいてコンテンツインデックスＤ１３１から語彙集合を取得し、その重みの上位ｎ件を取得すれば、ユーザが検索したい内容を特徴付ける語彙集合を取得することができる。そして、得られた語彙集合に基づいて同様の処理を実行することによって検索クエリ行列を作成し、検索者が望むコンテンツを提示することができる。 Further, by requesting the searcher to input the content ID of the content related to the content to be searched, it is possible to create a search query matrix based on the input content ID without performing morphological analysis. Specifically, first, if a vocabulary set is acquired from the content index D131 based on the content ID and the top n items of the weight are acquired, the vocabulary set that characterizes the content that the user wants to search can be acquired. A search query matrix can be created by executing the same processing based on the obtained vocabulary set, and the content desired by the searcher can be presented.

さらに、ユーザの嗜好に合うコンテンツを推薦する処理に適用することも可能である。この場合、ユーザＩＤに基づいてユーザインデックスＤ１３５から語彙集合を取得し、その重みの上位ｎ件を取得すれば、そのユーザを特徴付ける語彙集合を取得することができる。そして、このユーザを特徴付ける語彙集合に基づいて、同様の処理を用いて検索クエリ行列を作成することによって、ユーザの嗜好に合わせたコンテンツを推薦することができる。 Furthermore, the present invention can be applied to a process for recommending content that matches the user's preference. In this case, if a vocabulary set is obtained from the user index D135 based on the user ID and the top n weights are obtained, a vocabulary set that characterizes the user can be obtained. Then, based on the vocabulary set that characterizes the user, a search query matrix is created using a similar process, so that content that matches the user's preference can be recommended.

図１７は、本発明の実施の形態の類似コンテンツ抽出処理（Ｓ４１３）の詳細な手順を示すフローチャートである。 FIG. 17 is a flowchart illustrating a detailed procedure of the similar content extraction process (S413) according to the embodiment of this invention.

まず、ステップＳ４１２で生成された検索クエリ行列をワークエリアから読み出す（Ｓ４１３１）。その後、コンテンツ抽出プログラムＰ１４１５を実行し、検索クエリ行列のラベルを用いて、コンテンツインデックスＤ１３１を検索し、検索クエリ行列のラベルの語彙が含まれているコンテンツのリストを抽出する（Ｓ４１３２）。 First, the search query matrix generated in step S412 is read from the work area (S4131). After that, the content extraction program P1415 is executed, the content index D131 is searched using the labels of the search query matrix, and the list of contents including the vocabulary of the labels of the search query matrix is extracted (S4132).

その後、ループを制御するためのパラメータｎを１に初期設定する（Ｓ４１３３）。そして、コンテンツ行列生成プログラムＰ１４１３を実行し、抽出されたｎ番目のコンテンツのコンテンツ行列を生成する（Ｓ４１３４）。このコンテンツ行列生成処理の詳細は、図１８を用いて後述する。 Thereafter, the parameter n for controlling the loop is initialized to 1 (S4133). Then, the content matrix generation program P1413 is executed to generate a content matrix of the extracted nth content (S4134). Details of the content matrix generation processing will be described later with reference to FIG.

そして、類似度算出プログラムＰ１４１４を実行し、検索クエリ行列と生成したｎ番目のコンテンツ行列との類似度を算出する（Ｓ４１３５）。このコンテンツ行列生成処理の詳細は、図２０を用いて後述する。 Then, the similarity calculation program P1414 is executed to calculate the similarity between the search query matrix and the generated nth content matrix (S4135). Details of the content matrix generation processing will be described later with reference to FIG.

その後、算出した類似度をコンテンツと関連付けて、ワークエリアに格納する（Ｓ４１３６）。 Thereafter, the calculated similarity is associated with the content and stored in the work area (S4136).

その後、ｎ＋１番目のコンテンツが存在するか否かを判定する（Ｓ４１３７）。その結果、ｎ＋１番目のコンテンツが存在する場合、ｎに１を加算し（Ｓ４１３８）、ステップＳ４１３４に戻り、次のコンテンツを処理する。一方、ｎ＋１番目のコンテンツが存在しない場合、抽出した全てのコンテンツの処理が終了しているので、類似コンテンツ抽出処理を終了し、ステップＳ４１４に進む。 Thereafter, it is determined whether or not the (n + 1) th content exists (S4137). As a result, if the (n + 1) th content exists, 1 is added to n (S4138), and the process returns to step S4134 to process the next content. On the other hand, if the (n + 1) th content does not exist, the processing of all the extracted contents is complete, so the similar content extraction process is terminated and the process proceeds to step S414.

図１８は、本発明の実施の形態のコンテンツ行列生成手順（Ｓ４１３４）の詳細な手順を示すフローチャートである。 FIG. 18 is a flowchart illustrating a detailed procedure of the content matrix generation procedure (S4134) according to the embodiment of this invention.

まず、検索クエリ行列の次数と同じ次数の正方零行列を作成し、作成した行列をコンテンツ行列に初期設定する（Ｓ４１３４０１）。 First, a square zero matrix having the same order as the order of the search query matrix is created, and the created matrix is initialized as a content matrix (S413401).

そして、検索クエリ行列の行及び列のラベルを、作成したコンテンツ行列の行のラベルに設定する（Ｓ４１３４０２）。すなわち、コンテンツ行列の行と列のラベルには、検索クエリ行列の行及び列のラベルと同じ抽象ノード及び検索キーワードが同じ順で設定される。なお、抽象ノードは第１行目及び第１列目のラベルに設定される。 Then, the row and column labels of the search query matrix are set to the row labels of the created content matrix (S413402). That is, the same abstract nodes and search keywords as the row and column labels of the search query matrix are set in the same order in the row and column labels of the content matrix. The abstract node is set to the label in the first row and the first column.

その後、コンテンツ行列のラベルに設定された語彙を示すパラメータｎを２に初期設定した後（Ｓ４１３４０３）、コンテンツ行列のｎ行のラベル（語彙）をワークエリアから読み込む（Ｓ４１３４０４）。 Thereafter, the parameter n indicating the vocabulary set in the label of the content matrix is initialized to 2 (S413403), and the label (vocabulary) of n rows in the content matrix is read from the work area (S413404).

その後、コンテンツ行列のｎ行のラベル（語彙）の重みを、ｎ行のラベルとコンテンツＩＤに基づいて、コンテンツインデックスＤ１３１から取得し、取得した重みをコンテンツ行列の＜１，ｎ＞、＜ｎ，１＞に設定する（Ｓ４１３４０５）。 Thereafter, the weights of the labels (vocabulary) of n rows of the content matrix are obtained from the content index D131 based on the labels and content IDs of the n rows, and the obtained weights are <1, n>, <n, 1> is set (S413405).

その後、ｎ＋１番目のラベル（語彙）が存在するか否かを判定する（Ｓ４１３４０６）。その結果、ｎ＋１番目の語彙が存在する場合、ｎに１を加算し（Ｓ４１３４０７）、ステップＳ４１３４０４に戻り、次の語彙を処理する。一方、ｎ＋１番目の語彙が存在しない場合、抽象ノードの処理は終了したので、次に検索キーワードの処理に移るため、検索クエリ行列の行を示すパラメータｉを２に初期設定する（Ｓ４１３４０８）。 Thereafter, it is determined whether or not the (n + 1) th label (vocabulary) exists (S413406). As a result, if the (n + 1) th vocabulary exists, 1 is added to n (S413407), and the process returns to step S413404 to process the next vocabulary. On the other hand, if the (n + 1) th vocabulary does not exist, the abstract node processing is completed, so that the parameter i indicating the row of the search query matrix is initialized to 2 in order to move to the search keyword processing next (S413408).

その後、コンテンツ行列のｉ行のラベル（語彙）をワークエリアから読み込む（Ｓ４１３４０９）。その後、コンテンツ行列の列を示すパラメータｊを２に初期設定した後（Ｓ４１３４１０）、コンテンツ行列のｊ列のラベル（語彙）を読み込む（Ｓ４１３４１１）。 Thereafter, the label (vocabulary) of i rows of the content matrix is read from the work area (S413409). Thereafter, after initializing the parameter j indicating the column of the content matrix to 2 (S413410), the label (vocabulary) of the j column of the content matrix is read (S413411).

その後、コンテンツ行列のｉ行のラベルと、ｊ列のラベルとの関連度をｉ行のラベル、ｊ列のラベル及びコンテンツＩＤに基づいて、コンテンツ語彙モデルＤ１３２から取得し（Ｓ４１３４１２）、取得した関連度を検索クエリ行列の＜ｉ，ｊ＞に設定する。ただし、関連度が取得できなかった場合、何もせず、次のステップに進む（Ｓ４１３４１３）。 Thereafter, the degree of association between the i-line label of the content matrix and the j-column label is obtained from the content vocabulary model D132 based on the i-line label, the j-column label, and the content ID (S413212), and the obtained association The degree is set to <i, j> of the search query matrix. However, if the degree of association cannot be acquired, nothing is done and the process proceeds to the next step (S413413).

その後、ｊ＋１番目のラベル（語彙）が存在するか否かを判定する（Ｓ４１３４１４）。その結果、ｊ＋１番目の語彙が存在する場合、ｊに１を加算し（Ｓ４１３４１７）、ステップＳ４１３４１１に戻り、次の列を処理する。一方、ｊ＋１番目の語彙が存在しない場合、ｉ行の処理は終了したので、次に行に移るため、ｉ＋１番目のラベル（語彙）が存在するか否かを判定する（Ｓ４１３４１５）。 Thereafter, it is determined whether or not the j + 1-th label (vocabulary) exists (S413414). As a result, when the j + 1-th vocabulary exists, 1 is added to j (S413417), and the process returns to step S413411 to process the next column. On the other hand, if the j + 1-th vocabulary does not exist, the processing for the i-th row is completed, so that it moves to the next row, and it is determined whether or not the i + 1-th label (vocabulary) exists (S413415).

その結果、ｉ＋１番目の語彙が存在する場合、ｉに１を加算し（Ｓ４１３４１８）、ステップＳ４１３４０９に戻り、次のｉ＋１行を処理する。一方、ｉ＋１番目の語彙が存在しない場合、コンテンツ行列への語彙間の関連度の登録が終了したので、生成したコンテンツ行列をワークエリアに格納し（Ｓ４１３４１６）、コンテンツ行列抽出処理（Ｓ４１３４）を終了し、ステップＳ４１４５に進む。 As a result, if the i + 1th vocabulary exists, 1 is added to i (S413418), and the process returns to step S413409 to process the next i + 1 line. On the other hand, if the i + 1-th vocabulary does not exist, registration of the degree of association between the vocabularies in the content matrix is completed, so the generated content matrix is stored in the work area (S413416), and the content matrix extraction process (S4134) is terminated. Then, the process proceeds to step S4145.

図１９は、本発明の実施の形態のステップＳ４１３におけるコンテンツ行列の生成の例を説明する図である。 FIG. 19 is a diagram illustrating an example of generating a content matrix in step S413 according to the embodiment of this invention.

まず、ステップＳ４１３２では、検索クエリ行列のラベルを用いて該当するコンテンツをフィルタリングする。このステップは、図１７のステップＳ４１３２と同じである。具体的には、検索クエリ行列１７０４のラベルを用いて、コンテンツインデックスＤ１３１を検索し、検索クエリ行列のラベルの語彙が含まれているコンテンツのリスト２０００を抽出する。 First, in step S4132, the corresponding content is filtered using the label of the search query matrix. This step is the same as step S4132 of FIG. Specifically, the content index D131 is searched using the labels of the search query matrix 1704, and the content list 2000 including the vocabulary of the labels of the search query matrix is extracted.

次に、ステップＳ４１３４では、コンテンツ行列２００１を生成する。ここで生成されるコンテンツ行列２００１の行及び列のラベルには、検索クエリ行列の行及び列のラベル
が設定される。コンテンツ行列２００１の１行目及び１列目には、コンテンツインデックスＤ１３１に記録されたコンテンツと語彙の重みが登録される。コンテンツ行列２００１の２行目及び２列目以後の要素には、コンテンツ語彙モデルＤ１３２に記録された語彙間の関連度が登録される。 Next, in step S4134, a content matrix 2001 is generated. The row and column labels of the search query matrix are set in the row and column labels of the content matrix 2001 generated here. In the first row and the first column of the content matrix 2001, the content recorded in the content index D131 and the vocabulary weight are registered. In the elements in the second row and second column of the content matrix 2001, the degree of association between vocabularies recorded in the content vocabulary model D132 is registered.

その後、ステップＳ４１３５では、検索クエリ行列とコンテンツ行列との類似度を算出する（Ｓ４１３５）。 Thereafter, in step S4135, the similarity between the search query matrix and the content matrix is calculated (S4135).

図２０は、本発明の実施の形態の類似度算出手順（Ｓ４１３５）の詳細な手順を示すフローチャートである。 FIG. 20 is a flowchart illustrating a detailed procedure of the similarity calculation procedure (S4135) according to the embodiment of this invention.

まず、ステップＳ４１２で生成した検索クエリ行列、及びステップＳ４１３４で生成したコンテンツ行列をワークエリアから読み出す（Ｓ４１３５０１）。 First, the search query matrix generated in step S412 and the content matrix generated in step S4134 are read out from the work area (S4133501).

その後、類似度を計算するためのパラメータステップＳＵＭを０に初期設定し（Ｓ４１３５０２）、検索クエリ行列及びコンテンツ行列の列を制御するパラメータｎを１に初期設定する（Ｓ４１３５０３）。 Thereafter, the parameter step SUM for calculating the similarity is initialized to 0 (S413502), and the parameter n for controlling the search query matrix and the content matrix column is initialized to 1 (S413503).

そして、検索クエリ行列のｎ列目をベクトル化（ＱＶｎ）し、ベクトルＱＶｎをワークエリアに格納する（Ｓ４１３５０４）。さらに、コンテンツ行列のｎ列目をベクトル化（ＣＶｎ）し、ベクトルＣＶｎをワークエリアに格納する（Ｓ４１３５０５）。 Then, the nth column of the search query matrix is vectorized (QVn), and the vector QVn is stored in the work area (S413504). Furthermore, the nth column of the content matrix is vectorized (CVn), and the vector CVn is stored in the work area (S413505).

その後、ワークエリアに格納されているベクトルＱＶｎとベクトルＣＶｎとの類似度を算出する。この類似度の算出には、コサイン類似度などの手法を利用することができる（Ｓ４１３５０６）。なお、コサイン類似度は、ベクトルＱＶｎとＣＶｎとの間の内積を、ベクトルのノルム（ベクトルの幾何学的な長さ）で除したものを用いることができる。そして、算出したベクトル間の類似度をＳＵＭに加算する（Ｓ４１３５０７）。 Thereafter, the similarity between the vector QVn and the vector CVn stored in the work area is calculated. For the calculation of the similarity, a technique such as cosine similarity can be used (S413506). The cosine similarity can be obtained by dividing the inner product between the vectors QVn and CVn by the vector norm (the geometric length of the vector). Then, the calculated similarity between vectors is added to the SUM (S413507).

その後、ｎ＋１列が存在するか否かを判定する（Ｓ４１３５０８）。その結果、ｎ＋１列目が存在する場合、ｎに１を加算し（Ｓ４１３５１０）、ステップＳ４１３５０４に戻り、次の列を処理する。一方、ｎ＋１列目が存在しない場合、ＳＵＭを検索クエリ行列の次元数で除算することによって、類似度を算出する（Ｓ４１３５０９）。その後、類似度算出手順（Ｓ４１３５）を終了し、ステップＳ４１３６に進み、算出された類似度をワークエリアに格納する。 Thereafter, it is determined whether or not n + 1 columns exist (S413508). As a result, when the (n + 1) th column exists, 1 is added to n (S413510), and the process returns to step S413504 to process the next column. On the other hand, if the n + 1-th column does not exist, the similarity is calculated by dividing SUM by the number of dimensions of the search query matrix (S413509). Thereafter, the similarity calculation procedure (S4135) is terminated, the process proceeds to step S4136, and the calculated similarity is stored in the work area.

以上、コサイン類似度（ベクトル間の内積）を用いて、検索クエリ行列とコンテンツ行列との間の類似度を算出する方法について説明した。 The method for calculating the similarity between the search query matrix and the content matrix using the cosine similarity (the inner product between the vectors) has been described above.

なお、本実施の形態では、検索クエリ行列の作成方法及びコンテンツ行列の作成方法について、各種記憶装置に格納されている分析結果をそのまま使用する方法について説明したが、他の方法を用いて検索クエリ行列及びコンテンツ行列を作成することもできる。 In this embodiment, the search query matrix creation method and the content matrix creation method have been described with respect to the method of using the analysis results stored in various storage devices as they are. Matrixes and content matrices can also be created.

検索クエリ行列又はコンテンツ行列を俯瞰すると、それぞれラベルをノードとし、行列内の要素値をエッジとするグラフ構造と考えることもできる。このとき、検索クエリ行列又はコンテンツ行列は、グラフ理論における隣接行列と呼ばれる行列形式である考えることができる。 An overview of the search query matrix or the content matrix can be thought of as a graph structure in which the labels are nodes and the element values in the matrix are edges. At this time, the search query matrix or the content matrix can be considered as a matrix form called an adjacency matrix in the graph theory.

このグラフ構造を構成するノード間の類似度を算出することによって、検索クエリ又はコンテンツと語彙の関係、及び、それぞれの語彙間の関連性の精度を高めることができる。このため、算出される類似度の精度を高めることができる。 By calculating the similarity between nodes constituting this graph structure, it is possible to increase the accuracy of the relationship between the search query or content and the vocabulary and the relationship between the vocabularies. For this reason, the accuracy of the calculated similarity can be increased.

グラフにおけるノード間の類似度を算出する方法は、グラフ理論の分野においてフォンノイマン行列を援用する方法や、ラプラシアン熱拡張行列を援用する方法など、いくつか公知なものが存在し、これらの方法を使用することができる。 There are several known methods for calculating the similarity between nodes in a graph, such as a method using the von Neumann matrix in the field of graph theory and a method using the Laplacian heat expansion matrix. Can be used.

例えば、正則化ラプラシアンカーネルを用いて検索クエリ行列におけるノード間の類似度（ＲＬ）を算出することができる。具体的には下式を用いるとよい。 For example, the similarity (RL) between nodes in the search query matrix can be calculated using a regularized Laplacian kernel. Specifically, the following formula may be used.

下式によって得られる値（ＲＬ）は行列となり、この行列の要素値はノード間の類似度、すなわち、検索クエリと語彙の関係度、及び、それぞれの語彙間の関連性を示す。 The value (RL) obtained by the following equation is a matrix, and the element values of this matrix indicate the similarity between nodes, that is, the relationship between the search query and the vocabulary, and the relationship between each vocabulary.

また、同様に、コンテンツ行列におけるノード間の類似度（ＲＬ）を算出することもできる。この場合、上式によって得られる値（ＲＬ）は行列である。この行列における要素値はノード間の類似度、すなわち、コンテンツと語彙の関係度、及び、それぞれの語彙間の関連性を示す。 Similarly, the similarity (RL) between nodes in the content matrix can be calculated. In this case, the value (RL) obtained by the above equation is a matrix. Element values in this matrix indicate the degree of similarity between nodes, that is, the degree of relationship between content and vocabulary, and the relationship between each vocabulary.

そして、正則化ラプラシアンカーネルを適用して再構築した検索クエリ行列とコンテンツ行列を用いて、前述したステップＳ４１３５の方法によって類似度を算出することができる。 Then, using the search query matrix and the content matrix reconstructed by applying the regularized Laplacian kernel, the similarity can be calculated by the method of step S4135 described above.

なお、正則化ラプラシアンカーネルを検索クエリ行列又はコンテンツ行列のいずれかに適用してもよい。正則化ラプラシアンカーネルは、検索クエリ行列又はコンテンツ行列における要素値（関連性）の精度を向上させることを目的として適用されるものであり、正則化ラプラシアンカーネルを適用しても検索クエリ行列又はコンテンツ行列におけるラベル及びその要素値の意味は変わらないからである。 Note that the regularized Laplacian kernel may be applied to either the search query matrix or the content matrix. The regularized Laplacian kernel is applied for the purpose of improving the accuracy of element values (relevance) in the search query matrix or content matrix, and even if the regularized Laplacian kernel is applied, the search query matrix or content matrix is applied. This is because the meanings of the labels and the element values in are not changed.

＜情報検索クライアント１０５＞
図２１は、本発明の実施の形態の情報検索クライアント１０５の構成を示すブロック図である。 <Information Search Client 105>
FIG. 21 is a block diagram illustrating a configuration of the information search client 105 according to the embodiment of this invention.

情報検索クライアント１０５は、前述したコンテンツ分析サブシステム１０１（図２）と、格納されているプログラムが異なる以外は同じ構成を有する。このため、前述したコンテンツ分析サブシステム１０１と同じ構成には同じ符号を付し、その説明は省略する。 The information search client 105 has the same configuration as the content analysis subsystem 101 (FIG. 2) described above except that the stored program is different. For this reason, the same components as those of the content analysis subsystem 101 described above are denoted by the same reference numerals, and description thereof is omitted.

すなわち、情報検索クライアント１０５は、メモリ１１０、記憶装置１２０、ＣＰＵ１３０、出力装置１４０、入力装置１５０及び通信インターフェース１６０を備え、これらの各構成がバス１７０によって接続される計算機である。 That is, the information search client 105 is a computer that includes a memory 110, a storage device 120, a CPU 130, an output device 140, an input device 150, and a communication interface 160, and these components are connected by a bus 170.

メモリ１１０は、ＣＰＵ１３０によって実行されるプログラムを格納する。具体的には、システム制御プログラムＰ１０及び検索クライアント制御プログラムＰ１５がメモリ１１０に格納される。 The memory 110 stores a program executed by the CPU 130. Specifically, the system control program P10 and the search client control program P15 are stored in the memory 110.

検索クライアント制御プログラムＰ１５は、情報検索サーバ１０４へ送信する検索要求にを生成するプログラムであり、サブプログラムとして、検索条件入力プログラムＰ１５１及び検索結果表示プログラムＰ１５２を含む。 The search client control program P15 is a program that generates a search request to be transmitted to the information search server 104, and includes a search condition input program P151 and a search result display program P152 as subprograms.

検索条件入力プログラムＰ１５１は、ユーザからの検索条件の入力を受け付け、検索リクエストを情報検索サーバ１０４に送信する。検索結果表示プログラムＰ１５２は、ユーザからの指示に従って検索結果を表示する。 The search condition input program P151 receives an input of search conditions from the user and transmits a search request to the information search server 104. The search result display program P152 displays search results according to instructions from the user.

記憶装置１２０には、各種プログラムＤ１００が格納される。この各種プログラムＤ１００には、システム制御プログラムＰ１０及び検索クライアント制御プログラムＰ１５が含まれており、ＣＰＵ１３０によって実行される際にメモリ１１０にロードされる。 The storage device 120 stores various programs D100. The various programs D100 include a system control program P10 and a search client control program P15, and are loaded into the memory 110 when executed by the CPU 130.

また、記憶装置１２０には、検索結果データＤ１５０が格納される。検索結果データＤ１５０は、情報検索サーバ１０４から転送された検索結果が一時的に格納されるキャッシュである。 Further, the storage device 120 stores search result data D150. The search result data D150 is a cache in which the search result transferred from the information search server 104 is temporarily stored.

情報検索クライアント１０５は、検索クライアント制御プログラムＰ１５を実行することによって、情報検索サーバ１０４に送信する検索要求を生成し、情報検索サーバ１０４によって行われた検索の結果を表示する。次に、この処理の詳細を説明する。 The information search client 105 generates a search request to be transmitted to the information search server 104 by executing the search client control program P15, and displays the result of the search performed by the information search server 104. Next, details of this processing will be described.

図２２Ａ及び図２２Ｂは、本発明の実施の形態の情報検索クライアント１０５によって実行される処理のフローチャートである。 22A and 22B are flowcharts of processing executed by the information search client 105 according to the embodiment of this invention.

まず、ユーザ認証用画面を表示して、ユーザ認証情報の入力を促す（Ｓ５０１）。そして、ユーザから入力されたユーザ認証情報を取得し（Ｓ５０２）、取得したユーザ認証情報を情報検索サーバ１０４に送り、ユーザ認証を要求する（Ｓ５０３）。 First, a user authentication screen is displayed to prompt input of user authentication information (S501). Then, user authentication information input by the user is acquired (S502), and the acquired user authentication information is sent to the information search server 104 to request user authentication (S503).

その後、情報検索サーバ１０４からユーザ認証の結果を受信すると、受信したユーザ認証の結果に基づいて、認証が成功したか否かを判定する（Ｓ５０４）。その結果、認証失敗である場合、ステップＳ５０１に戻り、認証失敗を表示し、さらにユーザ認証情報の入力を求める。一方、認証成功である場合、検索条件入力用画面を表示して、指示（コマンド）の入力を促す（Ｓ５０５）。 Thereafter, when the result of user authentication is received from the information search server 104, it is determined whether or not the authentication is successful based on the received result of user authentication (S504). As a result, if the authentication is unsuccessful, the process returns to step S501, the authentication failure is displayed, and the user authentication information is further requested. On the other hand, if the authentication is successful, a search condition input screen is displayed to prompt input of an instruction (command) (S505).

その後、コマンドが入力されると（Ｓ５０６）、入力されたコマンドを解析する（Ｓ５０７）。 Thereafter, when a command is input (S506), the input command is analyzed (S507).

解析したコマンドがクライアント停止コマンドである場合、情報検索処理を終了する。一方、解析したコマンドがコンテンツ検索コマンドである場合、検索条件入力プログラムＰ１５１を実行し、入力されたデータに基づいて検索リクエストを生成し（Ｓ５０８）、生成された検索リクエストを情報検索サーバ１０４に送信する（Ｓ５０９）。 If the analyzed command is a client stop command, the information retrieval process is terminated. On the other hand, if the analyzed command is a content search command, the search condition input program P151 is executed, a search request is generated based on the input data (S508), and the generated search request is transmitted to the information search server 104. (S509).

その後、情報検索サーバ１０４から検索結果を受信すると、受信した検索結果を記憶装置１２０の検索結果データＤ１５０に格納する（Ｓ５１０）。コンテンツ検索結果は、検索結果識別子と検索されたコンテンツ集合におけるコンテンツ識別子の全体又は部分集合が含まれ、図１４のステップＳ４１６で情報検索サーバ１０４から送信される。 Thereafter, when a search result is received from the information search server 104, the received search result is stored in the search result data D150 of the storage device 120 (S510). The content search result includes the search result identifier and all or a subset of the content identifier in the searched content set, and is transmitted from the information search server 104 in step S416 in FIG.

その後、検索結果表示プログラムＰ１５２を起動し、検索結果表示／指示入力用画面を表示し、指示内容の入力を促す（Ｓ５１１）。そして、ユーザから入力された指示内容を取得し（Ｓ５１２）、ユーザからの指示内容を解析する（Ｓ５１３）。 Thereafter, the search result display program P152 is started, a search result display / instruction input screen is displayed, and input of instruction contents is prompted (S511). Then, the instruction content input by the user is acquired (S512), and the instruction content from the user is analyzed (S513).

解析された指示内容が検出コンテンツ識別子リスト表示指示である場合、コンテンツ検索結果識別子問合せリクエストを作成し、作成したリクエストを情報検索サーバ１０４に送信する（Ｓ５１４）。その後、情報検索サーバ１０４からコンテンツ識別子の集合を受信すると、受信したコンテンツ識別子のリストを表示する（Ｓ５１５）。 If the analyzed instruction content is a detected content identifier list display instruction, a content search result identifier inquiry request is created, and the created request is transmitted to the information search server 104 (S514). Thereafter, when a set of content identifiers is received from the information search server 104, a list of received content identifiers is displayed (S515).

解析された指示内容がコンテンツ内容表示指示である場合、コンテンツ転送リクエストを作成し、作成したリクエストを情報検索サーバ１０４に送信する（Ｓ５１６）。その後、情報検索サーバ１０４からコンテンツデータを受信すると、受信したコンテンツデータを検索結果データＤ１５０に格納する（Ｓ５１７）。そして、受信したコンテンツデータを所定のフォームに変換して表示する（Ｓ５１８）。 If the analyzed instruction content is a content content display instruction, a content transfer request is created, and the created request is transmitted to the information search server 104 (S516). Thereafter, when content data is received from the information search server 104, the received content data is stored in the search result data D150 (S517). Then, the received content data is converted into a predetermined form and displayed (S518).

解析された指示内容が検索結果表示終了である場合、ステップＳ５０６に戻り、さらにコマンドを受信する。 If the analyzed instruction content is the end of the search result display, the process returns to step S506 and further receives a command.

以上説明したように、本発明の実施の形態によると、コンテンツと検索クエリに含まれる検索キーワードとの関係性だけでなく、コンテンツに含まれる語彙間の関係性及び検索クエリに含まれる検索キーワード間の語彙の関係性を考慮する。このため、多義的な語彙であっても、検索者が日常使用している語彙と関連するコンテンツの類似度が高くなり、検索者が日常使用していない語彙と関連するコンテンツの類似度が低くなる。つまり、ユーザ語彙モデルを用いることによって、多様な意味を持つ言葉の揺れを補正し、ユーザの嗜好に近いコンテンツを上位に表示することができ、検索精度を向上することができる。 As described above, according to the embodiment of the present invention, not only the relationship between the content and the search keyword included in the search query, but also the relationship between the vocabulary included in the content and the search keyword included in the search query. Consider the relationship of vocabulary. For this reason, even if the vocabulary is ambiguous, the similarity of the content related to the vocabulary used by the searcher is high, and the similarity of the content related to the vocabulary not used by the searcher is low. Become. That is, by using the user vocabulary model, it is possible to correct fluctuation of words having various meanings, display content close to the user's preference, and improve search accuracy.

１０１コンテンツ分析サブシステム
１０２ユーザ分析サブシステム
１０３リポジトリ管理サブシステム
１０４情報検索サーバ
１０５情報検索クライアント
１０６ネットワーク 101 Content Analysis Subsystem 102 User Analysis Subsystem 103 Repository Management Subsystem 104 Information Search Server 105 Information Search Client 106 Network

Claims

An information retrieval system configured by one or more computers to retrieve text information,
The text information is information including at least one of text data included in information content and text data added to the information content,
The calculator is
Means for acquiring a first relationship between words included in the search condition by analyzing the search condition input by the user;
Means for analyzing the stored text information to obtain a second relationship between words included in the text information;
An information search system comprising: means for selecting text information to be output as being matched with an input search condition according to the similarity between the acquired first relationship and second relationship.

The information search system manages a user vocabulary model in which relevance of words in text information related to the user is recorded,
The means for acquiring the first relationship is:
Extract words included in the input search condition by morphological analysis,
The search query matrix in which a relationship between the extracted words is defined is generated as a representation of the first relationship by referring to the user vocabulary model. Information retrieval system.

The computer includes user vocabulary model generation means for generating the user vocabulary model,
The user vocabulary model generating means includes
Identify text information accessed by the user and text information related to the data accessed by the user;
Extracting words included in the identified text information by morphological analysis,
Collecting words that exist within a predetermined co-occurrence range among the extracted words,
3. The information retrieval system according to claim 2, wherein the relevance of the words is obtained by counting the number of appearances of the collected words after being weighted according to the type of access.

The information search system manages a content vocabulary model in which relevance of words included in the stored text information is recorded,
The means for acquiring the second relationship is:
Extracting words contained in the stored text information by morphological analysis,
4. The content matrix in which the relationship between the extracted words is defined is generated as the first relationship by referring to the content vocabulary model. The information search system according to any one of the above.

The computer includes content vocabulary model generation means for generating the content vocabulary model,
The content vocabulary model generation means includes:
Extracting words contained in the stored text information by morphological analysis,
Collecting words existing in a predetermined co-occurrence range among words included in the accumulated text information,
5. The information search system according to claim 4, wherein the relevance of the word is obtained by counting the number of appearances of the collected word.

The information search system calculates an inner product between a search query matrix indicating the first relationship and a content matrix indicating the second relationship by calculating a norm of the search query matrix and a norm of the content matrix. 6. The information search system according to claim 1, further comprising means for calculating a similarity between the first relevance and the second relevance by dividing by a product.

An information search method in a search system configured by one or more computers and searching for text information,
The text information is information of text data included in information content or text data added to the information content,
The computer includes a processor that executes a program, and a memory that stores the program to be executed.
The method
The computer obtains a first relevance between words included in the search condition by analyzing the search condition input by a user, and writes the first relevance in the memory;
The computer obtains a second relationship between words included in the text information by analyzing the stored text information and writes the second relationship to the memory;
The computer selecting and outputting text information to be output as matching the input search condition according to the similarity between the acquired first relationship and second relationship. Information search method characterized by

The method includes the step of the computer managing a user vocabulary model in which word associations in text information associated with the user are recorded;
In the step of obtaining the first relevance, the computer extracts words included in the input search condition by morphological analysis, and refers to the user vocabulary model, thereby extracting the words between the extracted words. The information search method according to claim 7, further comprising: generating a search query matrix in which the relevance is defined as representing the first relevance.

In the method, the computer specifies text information accessed by a user and text information related to data accessed by the user, extracts words included in the specified text information by morphological analysis, and extracts the words. Of words collected within a predetermined co-occurrence range, the number of appearances of the collected words is weighted according to the type of access, and the relevance of the words is obtained. The method according to claim 8, further comprising: generating the user vocabulary model by:

The method includes the step of the computer managing a content vocabulary model in which relevance of words included in the stored text information is recorded,
In the step of obtaining the second relevance, the computer extracts words included in the accumulated text information by morphological analysis, and refers to the content vocabulary model to thereby extract the extracted words. The information search method according to any one of claims 7 to 9, wherein a content matrix in which a relevance is defined is generated as representing the first relevance.

In the method, the computer extracts words included in the accumulated text information by morphological analysis, and among the words included in the accumulated text information, the words exist within a predetermined co-occurrence range. The information retrieval method according to claim 10, further comprising: obtaining the relevance of the word by collecting the number of appearances of the collected word and generating the content vocabulary model thereby. Method.

In the method, the computer calculates an inner product between a search query matrix indicating the first relevance and a content matrix indicating the second relevance by using a norm of the search query matrix and a norm of the content matrix. The information retrieval method according to claim 7, further comprising: calculating a similarity between the first relationship and the second relationship by dividing by a product of Method.

An information retrieval server that retrieves text information according to a request from a client,
The text information is information of text data included in information content or text data added to the information content,
The information search server
Means for acquiring a first relationship between words included in the search condition by analyzing the search condition input by the user;
Means for analyzing the stored text information to obtain a second relationship between words included in the text information;
An information search server comprising: means for selecting text information to be output as a match with an input search condition based on the similarity between the acquired first relationship and second relationship.