JP2002230021A

JP2002230021A - Information retrieval device and method, and storage medium

Info

Publication number: JP2002230021A
Application number: JP2001021796A
Authority: JP
Inventors: Eiichiro Toshima; 英一朗戸島
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2001-01-30
Filing date: 2001-01-30
Publication date: 2002-08-16

Abstract

PROBLEM TO BE SOLVED: To easily and quickly perform information retrieval intended by a user. SOLUTION: A word is extracted from a retrieval query designated by a user, and modification analysis is operated (S41), and polysemy elimination is tried based on the result of the modification analysis and a co-occurrence data base (S42), and when the polysemy is eliminated, the processing is immediately moved to a S45, and in the other case, polysemy elimination based on user profile information is operated, and then the processing is moved to the S45. Then, extension processing such as near-synonym development is operated by using a query extension dictionary (S45), and the document vector generation processing of the retrieval query is operated (S46).

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は情報検索装置及び情
報検索方法並びに記憶媒体に関し、より詳しくは、入力
された検索文や検索キーワード等の検索条件（以下、
「クエリ」という）に従って情報検索を行う情報検索装
置、及び該情報検索装置を使用した情報検索方法、並び
に情報検索の検索手順や情報検索を行うためのデータ構
造が記憶された記憶媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an information retrieval apparatus, an information retrieval method, and a storage medium.
The present invention relates to an information search device that performs information search according to a “query”, an information search method using the information search device, and a storage medium that stores a search procedure for information search and a data structure for performing information search.

【０００２】[0002]

【従来の技術】近年におけるコンピュータや通信網（ネ
ットワーク）の発達に伴い、大量の電子化された文書の
データベースへの蓄積が進展してきており、それに伴っ
て電子化された大量のデータベースから所望の文書デー
タを検索する情報検索の需要が高まってきている。2. Description of the Related Art With the recent development of computers and communication networks (networks), the accumulation of a large number of electronic documents in a database has been progressing. There is an increasing demand for information retrieval for retrieving document data.

【０００３】この種の情報検索は、従来、キーワード検
索、全文検索のようなクエリとの表記の一致を前提とし
た検索方法が主流であったが、最近では特定のクエリや
文書に類似している類似文書の検索を行う手法が提案さ
れている。Conventionally, this type of information search is based on a search method premised on matching the notation with a query, such as a keyword search or a full-text search. A method for searching for a similar document has been proposed.

【０００４】そして、このような類似文書の検索手法と
して、文書をｎ次元のベクトル空間上の点にマッピング
し、それらの間の距離の大小により文書同士の類似性、
又はクエリと文書との類似性を算出するベクトル空間モ
デル方式が既に知られている（例えば、「熊本, 島田,
加藤：概念ベースの情報検索への適用, 信学技報, Vol.
AI98-63, pp.9-16, 1999」）。[0004] As a similar document search method, a document is mapped to points in an n-dimensional vector space, and the similarity between the documents is determined by the distance between them.
Alternatively, a vector space model method for calculating the similarity between a query and a document is already known (for example, “Kumamoto, Shimada,
Kato: Application of Concept-based Information Retrieval, IEICE Technical Report, Vol.
AI98-63, pp.9-16, 1999 ”).

【０００５】しかしながら、このようなベクトル空間モ
デル方式では、特に短い文章をクエリとして類似文書の
検索を行った場合、ユーザの所望しない文書が検索され
ることも多い。[0005] However, in such a vector space model system, particularly when a similar document is searched using a short sentence as a query, a document not desired by the user is often searched.

【０００６】すなわち、クエリに使用される文字列は多
義性を有するため、前記クエリがユーザの所望する意味
に解釈されなかったり、或いはクエリに使用される文字
列の語義の解釈が不完全であるためにクエリの意味が十
分に補完されず、このため、上述したベクトル空間モデ
ル方式では検索結果がユーザの意図通りにならない場合
が多い。That is, since a character string used in a query has ambiguity, the query is not interpreted in a meaning desired by the user, or the meaning of the character string used in the query is incompletely interpreted. Therefore, the meaning of the query is not sufficiently supplemented, and therefore, in the vector space model method described above, the search result often does not meet the user's intention.

【０００７】そこで、斯かるユーザの意図しない文書の
検索を極力回避する方策として、クエリを類義語で拡張
する際に、単語間で共起性の低い類義語を展開対象から
除外する技術が提案されている（例えば、特開平１１−
４５２７４号公報；以下「第１の従来技術」という）。Therefore, as a measure for avoiding such a search of a document not intended by the user as much as possible, a technique has been proposed in which, when a query is expanded with synonyms, synonyms having low co-occurrence between words are excluded from expansion targets. (See, for example,
No. 45274; hereinafter, referred to as “first related art”).

【０００８】該第１の従来技術では、前記共起性の低い
類義語を文書検索の検索対象から取り除くことにより、
ユーザの所望する文書とは無関係な類義語の出現してい
る文書を検索対象から除外している。In the first prior art, a synonym having a low co-occurrence is removed from a search target of a document search.
Documents in which synonyms unrelated to the document desired by the user appear are excluded from search targets.

【０００９】また、その他の従来技術としては、各ユー
ザの文書ごとのアクセス状況を記録して該アクセス状況
に基づいた嗜好ベクトルを作成し、クエリのベクトルを
嗜好ベクトルに近付くようにシフトさせた技術も提案さ
れている（特開平１１−５３３９４号公報；以下、「第
２の従来技術」という）。Another conventional technique is to record the access status of each user for each document, create a preference vector based on the access status, and shift the query vector so as to approach the preference vector. (Japanese Patent Application Laid-Open No. H11-53394; hereinafter, referred to as "second conventional technique").

【００１０】該第２の従来技術では、クエリのベクトル
を嗜好ベクトルに近づくようにシフトさせることによ
り、検索結果がユーザの嗜好に近付くようにし、これに
よりユーザの嗜好を反映した文書の検索を可能にしてい
る。In the second prior art, the search result is made closer to the user's preference by shifting the vector of the query so as to be closer to the preference vector, thereby making it possible to retrieve a document reflecting the user's preference. I have to.

【００１１】[0011]

【発明が解決しようとしている課題】しかしながら、上
記第１の従来技術は、クエリの意味を拡張するために類
義語展開する際に共起性の低い類義語を除外しているの
みであるため、クエリ中の各単語の語義が特定されず、
したがって各単語の多義性は依然として解消されず、ユ
ーザの所望しない文書が検索結果として表示されること
も多いという問題点があった。However, in the first prior art, when synonym expansion is performed to expand the meaning of a query, only synonyms with low co-occurrence are excluded. The meaning of each word in is not specified,
Therefore, the ambiguity of each word remains unresolved, and a document that the user does not want is often displayed as a search result.

【００１２】すなわち、第１の従来技術では、例えば、
ユーザがクエリとして「フォームの種類にどんなものが
あるか教えてくれ」という文字列を入力した場合、帳票
に関するドキュメントや野球の投球フォームに関するド
キュメントを含め多くのドキュメントが検索され、クエ
リが多義性を保持した状態で表示出力される。That is, in the first prior art, for example,
If the user enters a query such as "Please tell me what form types are available," many documents are searched, including documents related to forms and baseball pitching forms. Displayed and output in the retained state.

【００１３】しかしながら、金融の帳票設計を業務とす
るユーザには、帳票の種類に関するドキュメントが必要
であって野球の投球姿勢や水泳の泳法等に関するドキュ
メントは通常は必要としない。一方、スポーツを趣味と
するユーザにとっては、野球の投球姿勢や水泳の泳法に
関するドキュメントを所望する場合が多く、帳票の種類
に関するドキュメントは通常は所望しない場合が多い。[0013] However, a user who is engaged in the design of financial forms needs documents related to the types of forms, and usually does not need documents related to baseball pitching postures, swimming styles, and the like. On the other hand, a user who enjoys sports often desires documents relating to baseball pitching postures and swimming techniques, and usually does not desire documents relating to types of forms.

【００１４】すなわち、上記第１の従来技術では、共起
性の低い類義語は排除されるものの、クエリの有する多
義性を保持した状態で検索されるため、ユーザの意図し
た文書以外に多数の文書が表示出力され、このため所望
の検索結果を容易且つ迅速に得ることができないという
問題点があった。That is, in the first prior art, although a synonym having low co-occurrence is excluded, a search is performed while retaining the polysemy of the query. Is displayed and output, which makes it difficult to obtain a desired search result easily and quickly.

【００１５】また、第２の従来技術は、クエリ全体の意
味をユーザの過去の嗜好、すなわち嗜好履歴の方向に強
制的にシフトしているため、例えば、帳票の設計者が野
球の投手の投球フォームを調べたいときには、システム
上での語義解釈に誤解が生じないように、単なる「フォ
ームの種類」ではなく、「投球フォームの種類」という
文字列の入力を考慮する必要が生じ、このため使い勝手
が悪くなるという問題点があった。In the second prior art, the meaning of the entire query is forcibly shifted in the direction of the user's past preferences, that is, in the direction of the preference history. When examining a form, it is necessary to consider not just a "form type" but a character string "throwing form type" in order to avoid misunderstanding the meaning interpretation on the system. There was a problem that it became worse.

【００１６】しかも、該第２の従来技術では、クエリが
一律に嗜好ベクトルに近付くため、たとえ、「投球フォ
ームの種類」の文字列を入力した場合であっても、帳票
の設計者に対しては帳票に関するドキュメントが検索さ
れる虞もあり、その結果ユーザの所望しない検索結果が
表示される場合があるという問題点があった。Moreover, in the second prior art, since the query uniformly approaches the preference vector, even if a character string of "the type of pitch form" is input, the form designer is instructed. However, there is a risk that a document related to a form may be searched, and as a result, a search result that is not desired by the user may be displayed.

【００１７】このように上記第１及び第２の従来技術で
は、クエリの表記内容やユーザの嗜好に基づいて検索処
理されているに過ぎず、クエリの有する多義性が考慮さ
れず、またクエリの意味内容の補完も十分に行われてい
ないため、ユーザの所望しない検索結果が得られること
も多いという問題点があった。As described above, in the first and second prior arts, the search processing is merely performed based on the description contents of the query and the user's preference, and the ambiguity of the query is not taken into consideration. There is also a problem that the search result that is not desired by the user is often obtained because the supplement of the meaning content is not sufficiently performed.

【００１８】本発明はこのような問題点に鑑みなされた
ものであって、ユーザの意図に即した情報検索を容易且
つ迅速に行うことのできる情報検索装置及び情報検索方
法並びに記憶媒体を提供することを目的とする。The present invention has been made in view of such a problem, and provides an information search apparatus, an information search method, and a storage medium that can easily and quickly perform information search according to a user's intention. The purpose is to:

【００１９】[0019]

【課題を解決するための手段】上記目的を達成するため
に本発明に係る情報検索装置は、検索条件を入力する検
索条件入力手段と、該検索条件入力手段により入力され
た検索条件から単語を抽出する形態素解析手段と、複数
の単語間の共起関係を語義と関連付けて記憶する共起関
係記憶手段と、該共起関係記憶手段に記憶された共起情
報と前記形態素解析手段の解析結果とに基づいて前記検
索条件の共起関係を抽出し多義性を解消する第１の多義
性解消手段と、該第１の多義性解消手段により多義性の
解消された語義に基づいて情報検索を行う情報検索手段
とを有することを特徴とし、ユーザの嗜好を表現するユ
ーザプロファイル情報が記憶されたユーザプロファイル
記憶手段と、前記ユーザプロファイル記憶手段に記憶さ
れたユーザプロファイル情報と前記形態素解析手段の解
析結果とに基づいて前記検索条件の多義性を解消する第
２の多義性解消手段とを備え、前記第１の多義性解消手
段は前記第２の多義性解消手段より優先することを特徴
としている。In order to achieve the above object, an information retrieval apparatus according to the present invention comprises a search condition input means for inputting a search condition, and a method for retrieving words from the search condition input by the search condition input means. Morphological analysis means to be extracted, co-occurrence relation storage means for storing co-occurrence relations between a plurality of words in association with meanings, co-occurrence information stored in the co-occurrence relation storage means, and analysis results of the morphological analysis means A first ambiguity eliminating means for extracting a co-occurrence relation of the search condition based on the first and second ambiguities, and performing an information search on the basis of the sense of ambiguity eliminated by the first ambiguity eliminating means. User profile storage means for storing user profile information expressing user preferences, and user profile information stored in the user profile storage means. Second ambiguity eliminating means for eliminating the ambiguity of the search condition based on the file information and the analysis result of the morphological analysis means, wherein the first ambiguity eliminating means comprises the second ambiguity eliminating means. It is characterized by giving priority over means.

【００２０】また、本発明に係る情報検索方法は、検索
条件を入力する検索条件入力ステップと、該検索条件入
力ステップで入力された検索条件から単語を抽出する形
態素解析ステップと、複数の単語間の共起関係を語義と
関連付けて記憶された共起情報と前記形態素解析ステッ
プでの解析結果とに基づいて前記検索条件の共起関係を
抽出し多義性を解消する第１の多義性解消ステップと、
該第１の多義性解消ステップで多義性の解消された語義
に基づいて情報検索を行う情報検索ステップとを含むこ
とを特徴とし、さらにユーザの嗜好を表現するユーザプ
ロファイル情報と前記形態素解析ステップの解析結果と
に基づいて前記検索条件の多義性を解消する第２の多義
性解消ステップを備え、前記第１の多義性解消ステップ
は前記第２の多義性解消ステップより優先することを特
徴としている。Further, the information search method according to the present invention includes a search condition input step of inputting search conditions, a morphological analysis step of extracting words from the search conditions input in the search condition input step, and A first polysemy resolving step of extracting a co-occurrence relation of the retrieval condition and resolving polysemy based on co-occurrence information stored in association with co-occurrence relation with word meaning and the analysis result in the morphological analysis step When,
An information search step of performing an information search based on the meaning of the ambiguity eliminated in the first polysemy elimination step, further comprising: user profile information expressing user preferences; A second ambiguity eliminating step for eliminating the ambiguity of the search condition based on the analysis result, wherein the first ambiguity eliminating step has priority over the second ambiguity eliminating step. .

【００２１】また、本発明に係る記憶媒体は、前記第１
の多義性解消手順より優先される、ユーザの嗜好を表現
するユーザプロファイル情報と前記係り受け解析ステッ
プの解析結果とに基づいて前記検索条件の多義性を解消
する第２の多義性解消手順が記憶されていることを特徴
とし、さらに前記第２の多義性解消手順で多義性の解消
された語義に対し拡張処理を行う語義拡張手順が記憶さ
れていることを特徴ととしている。Further, the storage medium according to the present invention is characterized in that:
A second disambiguation procedure for resolving the disambiguation of the search condition is stored based on the user profile information expressing the user's preference and the analysis result of the dependency analysis step, which is prioritized over the disambiguation procedure. In addition, a semantic extension procedure for performing an extension process on the semantics whose ambiguity has been eliminated in the second ambiguity eliminating procedure is stored.

【００２２】尚、本発明のその他の特徴は、下記の発明
の実施の形態の記載より明らかとなろう。The other features of the present invention will be apparent from the following description of embodiments of the invention.

【００２３】[0023]

【発明の実施の形態】次に、本発明の実施の形態を図面
に基づいて詳説する。Next, an embodiment of the present invention will be described in detail with reference to the drawings.

【００２４】図１は本発明に係る情報検索装置としての
文書検索装置の一実施の形態を示すブロック構成図であ
って、該文書検索装置は、キーボードやマウス等からな
る入力装置１と、ＣＲＴや液晶ディスプレイ等からなる
表示装置２と、後述する所定のデータが格納されたハー
ドディスク（ＨＤ）３と、フレキシブルディスク（Ｆ
Ｄ）やＣＤ（コンパクトディスク）、ＤＶＤ（デジタル
ビデオディスク）等の外部記憶媒体にアクセスするため
のリムーバブルディスクドライブ４と、通信回線を介し
て外部とデータ交換を行うモデムやＬＡＮコントローラ
等の通信装置５と、後述する所定の制御プログラムが格
納された読出し専用の固定メモリ（ＲＯＭ）６と、各種
データを一時的に記憶したりワークエリアとして使用さ
れる書込み可能なランダムアクセスメモリ（ＲＡＭ）７
と、バス８を介して上記各構成要素に接続され装置全体
を制御する中央演算処理装置（ＣＰＵ）９とを備えてい
る。FIG. 1 is a block diagram showing an embodiment of a document search device as an information search device according to the present invention. The document search device includes an input device 1 including a keyboard and a mouse, and a CRT. , A hard disk (HD) 3 storing predetermined data described later, and a flexible disk (F).
D), a removable disk drive 4 for accessing an external storage medium such as a CD (compact disk) and a DVD (digital video disk), and a communication device such as a modem or a LAN controller for exchanging data with the outside via a communication line. 5, a read-only fixed memory (ROM) 6 in which a predetermined control program described later is stored, and a writable random access memory (RAM) 7 for temporarily storing various data and used as a work area
And a central processing unit (CPU) 9 connected to each of the above components via a bus 8 and controlling the entire apparatus.

【００２５】また、ＨＤ３には、語義ベースで単語の意
味ベクトルが格納された単語ベクトル辞書３ａ、検索対
象となる文書データベース３ｂ、ユーザの嗜好を記録し
たユーザプロファイル３ｃ、係り受け単語間の共起関係
が格納された共起データベース３ｄ、及びクエリの類義
語展開や関連展開を行うクエリ拡張辞書３ｅが格納され
ている。The HD 3 has a word vector dictionary 3a storing meaning vectors of words on a meaning basis, a document database 3b to be searched, a user profile 3c recording user preferences, and co-occurrence among dependency words. A co-occurrence database 3d in which relationships are stored, and a query expansion dictionary 3e that performs synonym expansion and related expansion of queries are stored.

【００２６】尚、本実施の形態では、ＣＰＵ９で演算処
理される制御プログラムは、ＲＯＭ６に記憶されている
が、該ＲＯＭ６に代えてＨＤ３に記憶させ、該HD３から
ＲＡＭ７上にロードして実行してもよく、或いはＦＤ等
の外部記憶媒体に記憶させリムーバルディスクドライブ
４を介してＲＡＭ７上にロードし実行するようにしても
よい。In this embodiment, the control program that is processed by the CPU 9 is stored in the ROM 6. However, the control program is stored in the HD 3 instead of the ROM 6, and loaded from the HD 3 onto the RAM 7 and executed. Alternatively, the program may be stored in an external storage medium such as an FD, loaded on the RAM 7 via the removable disk drive 4, and executed.

【００２７】図２は表示装置２の表示画面の一例を示し
た図であって、該表示画面はクエリ（検索条件）を表示
するクエリ表示部２ａと、クエリに基づいた検索結果を
表示する検索結果表示部２ｂとから構成されている。FIG. 2 is a diagram showing an example of a display screen of the display device 2. The display screen includes a query display section 2a for displaying a query (search condition) and a search for displaying a search result based on the query. And a result display section 2b.

【００２８】クエリ表示部２ａは、具体的には、ユーザ
が入力装置１を介して入力した検索条件、例えば、自然
文（「フォームの種類」「フォームの種類について知り
たい」「投球フォームのバリエーション」等）、複数の
キーワードの羅列（「フォーム、種類」等）、又はユー
ザの指示する既存文書（「文書番号２６７」等）が表示
される。The query display section 2a specifically displays search conditions input by the user via the input device 1, for example, natural sentences (“form type”, “want to know form type”, “variation of pitch form” ), A list of a plurality of keywords (such as “form, type”), or an existing document specified by the user (such as “document number 267”).

【００２９】そして、検索結果表示部２ｂには、検索結
果として各文書を識別する文書ＩＤ、及び文書ＩＤに対
応した文書タイトル、及びクエリに対する文書ＩＤの類
似度が表示される。The search result display section 2b displays, as a search result, a document ID for identifying each document, a document title corresponding to the document ID, and the similarity of the document ID to the query.

【００３０】図３はＨＤ３に格納される単語ベクトル辞
書３ａのフォーマット図ある。FIG. 3 is a format diagram of the word vector dictionary 3a stored in the HD 3.

【００３１】単語ベクトル辞書３ａは、各単語の語義を
示す意味ベクトル（意味分類ごとの特徴量リスト）の集
合であって、各次元（１、２、３、…）は意味分類を表
現している。The word vector dictionary 3a is a set of semantic vectors (feature lists for each semantic classification) indicating the meaning of each word, and each dimension (1, 2, 3,...) Represents a semantic classification. I have.

【００３２】すなわち、単語ベクトル辞書３ａでは、特
定の語義が各単語（１、２、３、…）に意味付けられて
おり、各単語に対して各次元の意味分類がどの程度含意
されているか、つまり意味ベクトルの特徴量がマトリッ
クス状に書き込まれている。That is, in the word vector dictionary 3a, a specific meaning is assigned to each word (1, 2, 3,...), And to what extent the meaning classification of each dimension is implied for each word. That is, the feature amounts of the meaning vectors are written in a matrix.

【００３３】例えば、次元３は「宇宙・空」という意味
分類を示し、次元４は「取引・売買」という意味分類を
示し、次元７は「身振り・動作」という意味分類を示
し、一方、単語７は「フォーム（帳票）」という特定の
語義に意味付けられている。そして、該単語ベクトル辞
書３ａでは単語７における次元３の意味ベクトルの特徴
量は「０」であるため、「フォーム(帳票)」という単語
には「宇宙・空」の意味を全く有していないことが分か
る。For example, dimension 3 indicates a semantic classification of “space / sky”, dimension 4 indicates a semantic classification of “trading / trading”, and dimension 7 indicates a semantic classification of “gesture / action”. 7 is given a specific meaning called “form (form)”. In the word vector dictionary 3a, since the feature amount of the dimension 3 meaning vector in the word 7 is "0", the word "form (form)" has no meaning of "space / sky" at all. You can see that.

【００３４】また、単語７では次元４の特徴量が「２
１」と他の特徴量に比べて相対的に大きく、単語７にお
ける次元７の特徴量は「１」と相対的に小さいが、これ
は「フォーム（帳票）」が「取引・売買」という意味の
寄与度は大きいが、「身振り・動作」という意味の寄与
度は小さいことを示している。In the case of the word 7, the feature amount of dimension 4 is “2”.
“1” is relatively large compared to other feature amounts, and the feature amount of dimension 7 in word 7 is relatively small as “1”, which means that “form (form)” is “transaction / trading”. Is large, but the contribution of "gesture / motion" is small.

【００３５】また、単語８は「フォーム（姿勢）」とい
う語義を有しており、単語８においては次元４の特徴量
は「０」であり、次元７の特徴量は「２３」と相対的に
大きい。これは「フォーム（姿勢）」には「取引・売
買」という意味が全く存在しないが、「身振り・動作」
という意味の寄与度は大きいことを示している。The word 8 has the meaning of "form (posture)". In the word 8, the dimension 4 has a feature amount of "0" and the dimension 7 has a feature amount of "23". Big. This means that "form (posture)" has no meaning of "transaction / buy / sell", but "gesture / action"
Means that the contribution degree is large.

【００３６】このように単語ベクトル辞書３ａにより、
語義別の各単語の意味する寄与度を認識することができ
る。As described above, by the word vector dictionary 3a,
It is possible to recognize the degree of contribution of each word according to meaning.

【００３７】図４はＨＤ３に格納された文書データベー
ス３ｂのフォーマット図であって、該文書データベース
３ｂには文書ベクトルの特徴量が書き込まれている。FIG. 4 is a format diagram of the document database 3b stored in the HD 3, in which the feature amount of the document vector is written.

【００３８】文書の意味は文書中でどのような単語が使
用されたかによって決定されると判断し、各文書の意味
は、その文書を構成する単語の意味ベクトルを加算して
いくことで算出される。したがって、算出されたベクト
ルの次元は単語ベクトル辞書３ａの意味ベクトルの次元
と同一となり、特定の意味分類を表現する。そして、加
算されて得られたベクトルは「１」を基準に正規化さ
れ、該正規化されたベクトルが文書ベクトルの特徴量と
して文書データベース３ｂに格納される。It is determined that the meaning of the document is determined by what word is used in the document, and the meaning of each document is calculated by adding the meaning vectors of the words constituting the document. You. Therefore, the dimension of the calculated vector becomes the same as the dimension of the meaning vector of the word vector dictionary 3a, and expresses a specific meaning classification. Then, the vector obtained by the addition is normalized on the basis of “1”, and the normalized vector is stored in the document database 3b as a feature amount of the document vector.

【００３９】この図４から明らかなように、例えば、文
書ＩＤが「６９４９」の場合では次元４の特徴量は
「０．００９」であり、次元７の特徴量は「０．４２
５」であり、文書ＩＤが「６９５３」の場合では次元４
の特徴量は「０．３６２」、であり、次元７の特徴量は
「０．００８」である。そしてこれにより文書ＩＤが
「６９４９」の文章は、「身振り・手振り」の意味分類
は或る程度含んでいるが、「取引・売買」の意味分類を
殆ど含んでおらず、また、文書ＩＤが「６９５３」の文
章は「取引・売買」の意味分類は或る程度含んでいる
が、「身振り・動作」の意味分類をほとんど含んでいな
いことが分かる。As is apparent from FIG. 4, for example, when the document ID is “6949”, the feature quantity of dimension 4 is “0.009” and the feature quantity of dimension 7 is “0.42”.
5 ", and the document ID is" 6953 ".
Is "0.362", and the feature of dimension 7 is "0.008". As a result, the sentence having the document ID “6949” includes the semantic classification of “gesture / hand gesture” to some extent, but hardly includes the semantic classification of “transaction / trading”. It can be seen that the sentence "6953" includes the semantic classification of "transaction / trade" to some extent, but hardly includes the semantic classification of "gesture / action".

【００４０】図５はＨＤ３に格納されたユーザプロファ
イル３ｃのフォーマット図である。FIG. 5 is a format diagram of the user profile 3c stored in the HD 3.

【００４１】ユーザプロファイル３ｃも単語ベクトル辞
書３ａの意味ベクトルと同一の次元を有し、ユーザがド
キュメントファイルにアクセスする毎にプロファイルが
更新される。The user profile 3c has the same dimension as the meaning vector of the word vector dictionary 3a, and is updated every time a user accesses a document file.

【００４２】すなわち、初期状態ではプロファイルは
「０」に設定されているが、ユーザが特定のドキュメン
トにアクセスすると、当該ドキュメントの文書ベクトル
が算出され、算出された値が累積プロファイルに加算さ
れる。そして、新たな累積プロファイルが得られた後、
プロファイルは「１」を基準にして正規化され、正規化
プロファイルの更新が行なわれる。That is, in the initial state, the profile is set to "0", but when the user accesses a specific document, the document vector of the document is calculated, and the calculated value is added to the cumulative profile. And after a new cumulative profile is obtained,
The profile is normalized based on “1”, and the normalized profile is updated.

【００４３】図５（ａ）は、例えばスポーツに関心のあ
るユーザのプロファイルを示しており、次元７（身振
り、動作）の特徴量が「０．１８６」と比較的大きくな
っている。これは該ユーザが「身振り・動作」の意味分
類を有するドキュメントを多く参照していることを示し
ている。FIG. 5A shows, for example, a profile of a user who is interested in sports, and the feature amount of dimension 7 (gesture, motion) is relatively large at "0.186". This indicates that the user refers to many documents having a semantic classification of “gesture / action”.

【００４４】一方、図５（ｂ）は窓口業務に関心の深い
ユーザのプロファイルを示しており、次元７（身振り、
動作）の特徴量は「０．０００」であるが、次元４（取
引・売買）の特徴量は「０．３２９」と大きな数値を示
している。これは該ユーザが「取引・売買」に関するド
キュメントを多く参照していることを示している。On the other hand, FIG. 5B shows a profile of a user who is deeply interested in the window service, and has a dimension 7 (gesture,
The characteristic amount of the operation (operation) is “0.000”, but the characteristic amount of dimension 4 (trade / trade) shows a large numerical value of “0.329”. This indicates that the user refers to many documents related to “trading / trading”.

【００４５】図６はＨＤ３に格納される共起データベー
ス３ｄのフォーマット図であり、係り単語、受け単語、
及び両者間に介在する助詞の３つの共起情報が記憶され
ている。尚、前記助詞が存在しないときは「ｎｕｌｌ」
が書き込まれる。FIG. 6 is a format diagram of the co-occurrence database 3 d stored in the HD 3, in which a related word, a received word,
And three pieces of co-occurrence information of particles intervening between them are stored. If the particle does not exist, "null"
Is written.

【００４６】本実施の形態では、入力されたクエリ中の
文字列を形態素解析した後、係り受け解析を行って係り
単語、受け単語、及び助詞情報を抽出し、共起データベ
ース３ｄを参照し、これら係り単語、受け単語、及び助
詞情報の間で照合処理を行う。そして、共起データベー
ス３ｄに上記係り単語、受け単語、及び助詞情報に対応
する文字列がある場合は、各単語は共起データベース３
ｄに記載通りの語義であると解釈される。In the present embodiment, after performing a morphological analysis on a character string in an input query, a dependency analysis is performed to extract dependency words, received words, and particle information, and the co-occurrence database 3d is referred to. A collation process is performed between these related words, received words, and particle information. If the co-occurrence database 3d has character strings corresponding to the above-mentioned related words, received words, and particle information, each word is stored in the co-occurrence database 3d.
The meaning is interpreted as described in d.

【００４７】例えば、「投球フォーム」という語句が係
り受け解析によって「投球／フォーム」と抽出されたと
きは、このフォームの語義は共起データベース３ｄに従
って「姿勢」の語義と解釈される。また、クエリが「フ
ォームに情報を入力する」という場合は、係り受け解析
により「フォーム／に／入力」という単語及び助詞が抽
出され、これら単語及び助詞と共起データベース３ｄと
の間で照合処理がなされ、その結果、この「フォーム」
は「帳票」の語義であると解釈される。For example, when the phrase “throwing form” is extracted as “throw / form” by dependency analysis, the meaning of this form is interpreted as the meaning of “posture” according to the co-occurrence database 3d. When the query is "input information in a form", the word "form / ni / input" is extracted by the dependency analysis, and a collation process is performed between the word and the particle and the co-occurrence database 3d. And as a result this "form"
Is interpreted as a meaning of “form”.

【００４８】図７はＨＤ３に格納されるクエリ拡張辞書
３ｅのフォーマット図であって、図７（ａ）は類義語辞
書、図７（ｂ）は関連語辞書を示している。FIG. 7 is a format diagram of the query expansion dictionary 3e stored in the HD 3. FIG. 7A shows a synonym dictionary, and FIG. 7B shows a related word dictionary.

【００４９】すなわち、類義語辞書には見出語に対して
展開されるべき類義語が格納されている。例えば、見出
語「フォーム（姿勢）」には類義語として「姿勢、形、
スタイル、ポーズ」が格納され、見出語「フォーム（帳
票）」には類義語として「書式、伝票、帳票」が格納さ
れている。That is, the synonym dictionary stores synonyms to be developed for the entry word. For example, the headword “form (posture)” is a synonym for “posture, shape,
"Style, pose" is stored, and "form (form), form" is stored as a synonym in the entry word "form (form)".

【００５０】尚、通常、類義語は見出語に対して同義概
念または下位概念の関係にある。そして、類義語辞書は
クエリ中の各単語を展開し、各単語の意味内容を許容範
囲まで拡張するために使用される。従来はクエリ中に表
記された「フォーム」に対し展開される類義語を有して
いたため、「姿勢」と「帳票」が混在されて展開されて
いたが、本実施の形態では、類義語辞書は語義ベースで
保持されるので、「姿勢」と「帳票」が混在されて展開
されることはない。尚、類義語にも語義情報が格納され
ており、類義語展開後も語義ベースで処理が可能であ
る。Usually, synonyms have a synonymous concept or a lower concept with respect to the headword. The synonym dictionary is used to expand each word in the query and extend the meaning of each word to an allowable range. Conventionally, there was a synonym that was expanded for the “form” described in the query, so “posture” and “form” were mixed and expanded, but in the present embodiment, the synonym dictionary is Since it is held at the base, "posture" and "form" are not mixed and developed. Note that synonyms also store semantic information, and can be processed on a synonym basis even after synonym expansion.

【００５１】また、関連語辞書は、起点語に対して展開
されるべき関連語が格納されている。例えば、起点語
「フォーム（姿勢）」には関連語としては「スポーツ、
分析、改善」が格納され、起点語「フォーム（帳票）」
には関連語としては「購入、申し込み、振込み、送金」
が格納されている。The related word dictionary stores related words to be developed for the starting word. For example, the starting word "form (posture)" is related to "sports,
Analysis, improvement ”is stored, and the starting word“ form (form) ”is stored.
Is related to "purchase, sign up, transfer, remittance"
Is stored.

【００５２】尚、関連語は、上述した類義語とは異な
り、起点語との間には上位下位の関係は存在しない。そ
して、関連語辞書はクエリ中の各単語を展開して、クエ
リ全体の意味内容をある程度充実させるために使用され
る。It should be noted that the related word differs from the synonym described above in that there is no higher-order or lower-order relationship with the originating word. The related word dictionary is used to expand each word in the query to enhance the meaning of the entire query to some extent.

【００５３】このように構成された文書検索装置は、入
力装置１からの各種の入力に応じて作動し、該入力装置
１からの入力信号がＣＰＵ９に供給され、該ＣＰＵ９が
ＲＯＭ６内に記憶してある制御プログラムを読み出し、
該制御プログラムに従って、各種の制御が行われる。The document retrieval apparatus thus configured operates according to various inputs from the input device 1, an input signal from the input device 1 is supplied to the CPU 9, and the CPU 9 stores the input signal in the ROM 6. Read the control program
Various controls are performed according to the control program.

【００５４】図８は本文書検索装置で実行される文書検
索方法の処理手順の一実施の形態を示すフローチャート
であって、本プログラムはＣＰＵ９で実行される。FIG. 8 is a flowchart showing an embodiment of a processing procedure of the document search method executed by the document search apparatus. The program is executed by the CPU 9.

【００５５】ステップＳ１で各種パラメータの初期化や
初期画面の点灯等、初期化処理を行った後、ステップＳ
２では入力装置１からの操作入力を待機し、続くステッ
プＳ３では入力された操作内容を判別する。After performing initialization processing such as initialization of various parameters and lighting of an initial screen in step S1, step S1 is performed.
2 waits for an operation input from the input device 1, and in the following step S3, the content of the input operation is determined.

【００５６】すなわち、本文書検索方法の検索手順は、
文書データベースへの登録処理、ユーザプロファイルの
更新処理、及びクエリに応じた検索実行処理の３つに大
別され、したがって、ユーザは、検索段階に応じてこれ
ら３つの処理のいずれかを選択して入力操作する。That is, the search procedure of this document search method is as follows.
The process is roughly divided into three processes: a registration process to a document database, a user profile update process, and a search execution process according to a query. Therefore, the user selects one of these three processes according to the search stage. Perform input operation.

【００５７】そして、文書データベース３ｂへの登録処
理が指示されたときはステップＳ４に進んで文書登録処
理を実行し、ユーザプロファイル３ｃの更新処理が指示
されたときはステップＳ５に進んでプロファイル更新処
理を実行し、検索実行処理が指示されたときはステップ
Ｓ６に進んで検索実行処理を実行し、その後ステップＳ
７に進んで上記の各処理の処理結果を表示パターンに展
開して出力し、ステップＳ２に戻る。When the registration processing to the document database 3b is instructed, the process proceeds to step S4 to execute the document registration process. When the update process of the user profile 3c is instructed, the process proceeds to step S5 to perform the profile updating process. Is executed, and when the search execution process is instructed, the process proceeds to step S6 to execute the search execution process.
The process proceeds to step S7, where the processing result of each of the above processes is developed into a display pattern and output, and the process returns to step S2.

【００５８】図９はステップＳ４（図８）で実行される
文書登録処理の処理手順を示すフローチャートであっ
て、後述する検索処理を実行するために文書ベクトルを
文書データベース３ｂに登録する。FIG. 9 is a flowchart showing the procedure of the document registration process executed in step S4 (FIG. 8). In order to execute a search process described later, a document vector is registered in the document database 3b.

【００５９】ステップＳ１１では入力されたクエリから
形態素解析を行って単語の抽出処理をし、次いで係り受
け解析を行う。そして続くステップＳ１１では係り受け
解析により解析された係り単語及び受け単語と共起デー
タベース３ｄとを照合し、当該係り単語及び受け単語を
組にした文字列が共起データベース３ｄに格納されてい
る場合は単語の語義を特定する。In step S11, morphological analysis is performed on the input query to extract words, and then dependency analysis is performed. In the subsequent step S11, the dependency word and the received word analyzed by the dependency analysis are collated with the co-occurrence database 3d, and a character string in which the related word and the received word are paired is stored in the co-occurrence database 3d. Specifies the meaning of a word.

【００６０】尚、語義が特定できなかった単語について
はその表記を有する全ての語義の単語ベクトルに頻度別
の重みをつけて加算される。For words whose meaning cannot be specified, the word vectors of all the meanings having the notation are weighted and added.

【００６１】次に、ステップＳ１３では文書ベクトルの
生成処理を行う。すなわちステップＳ１１とステップＳ
１２で抽出された単語及び特定された語義から単語ベク
トル辞書３ａを検索して意味ベクトルの特徴量を算出
し、その総和から文書を特徴付ける文書ベクトルの特徴
量を生成する。すなわち、文書ベクトルは、上述したよ
うに文書の表現する意味を表すものであり、各単語に関
し単語ベクトル辞書３ａに書き込まれた意味ベクトルの
特徴量を加算していくことにより生成される。Next, in step S13, a document vector generation process is performed. That is, step S11 and step S
The word vector dictionary 3a is searched from the word extracted in step 12 and the specified meaning, and the feature amount of the meaning vector is calculated, and the feature amount of the document vector characterizing the document is generated from the sum. That is, the document vector represents the meaning expressed by the document as described above, and is generated by adding the feature amount of the meaning vector written in the word vector dictionary 3a for each word.

【００６２】そして続くステップＳ１４では文書データ
ベース３ｂへの登録処理を行い、メインルーチン（図
８）に戻る。すなわち、文書の内容とステップＳ１３で
得られた文書ベクトルの特徴量を文書データベース３ｂ
に登録すると共に該文書データベース３ｂのインデック
スを更新する。Then, in step S14, a registration process to the document database 3b is performed, and the process returns to the main routine (FIG. 8). That is, the content of the document and the feature amount of the document vector obtained in step S13 are stored in the document database 3b.
And updates the index of the document database 3b.

【００６３】図１０はステップＳ５（図８）で実行され
るプロファイル更新処理の処理手順を示すフローチャー
トであって、ユーザからの指示により特定のドキュメン
トファイルにアクセスするとき、例えば、文書データベ
ース３ｂに登録されていない個人使用のＦＤ等の外部記
憶媒体へのファイルの読み書き、或いはインタネットを
介したＷｅｂページにアクセスするとき等に実行され
る。FIG. 10 is a flowchart showing the procedure of the profile update process executed in step S5 (FIG. 8). When a specific document file is accessed by an instruction from the user, for example, it is registered in the document database 3b. This is executed when a file is read from or written to an external storage medium such as an FD that is not used for personal use, or when a Web page is accessed via the Internet.

【００６４】ステップＳ２１では文書データを入手し、
次いで、ステップＳ２２で形態素解析により単語を抽出
した後、係り受け解析を行う。次いで、ステップＳ２３
では、上述と同様、共起データベース３ｃを参照し、係
り単語と受け単語の組が共起データベース３ｃに格納さ
れている場合は単語の語義を特定する。In step S21, document data is obtained.
Next, in step S22, after extracting words by morphological analysis, dependency analysis is performed. Next, step S23
Then, as described above, the co-occurrence database 3c is referred to, and when the set of the related word and the received word is stored in the co-occurrence database 3c, the meaning of the word is specified.

【００６５】次に、ステップＳ２４では、上述と同様、
ステップＳ２１とステップＳ２２とで抽出された単語及
び特定された語義から単語ベクトル辞書３ａを検索して
意味ベクトルを生成し、その後文書ベクトルを生成す
る。ステップＳ２５では生成された文書ベクトルを累積
プロファイルに加算し、続くステップＳ２６では「１」
を基準にして累積プロファイルを正規化し、これにより
正規化プロファイルを作成する。Next, in step S24, as described above,
The word vector dictionary 3a is searched from the word extracted in step S21 and step S22 and the specified meaning to generate a meaning vector, and then a document vector is generated. In step S25, the generated document vector is added to the cumulative profile, and in step S26, “1” is added.
Is normalized, and a normalized profile is created.

【００６６】このようにしてユーザプロファイルを更新
した後、ステップＳ２７では本来の処理である各ファイ
ルの処理（例えば、ファイルの参照処理、書き込み処理
など）を行い、その後メインルーチンに戻る。After updating the user profile in this manner, in step S27, the original processing of each file (for example, file reference processing, writing processing, etc.) is performed, and thereafter, the process returns to the main routine.

【００６７】図１１はステップＳ６（図８）で実行され
る検索実行処理の処理手順のフローチャートである。FIG. 11 is a flowchart showing the procedure of the search execution process executed in step S6 (FIG. 8).

【００６８】ステップＳ３１は検索クエリ入力処理を実
行し、ユーザは自然文や複数のキーワード或いは既存の
文書指定等によりクエリを入力し、入力内容に応じたク
エリのテキストストリングを入手する。例えば、クエリ
として既存の文書を指定した場合は該文書にアクセス
し、適当なフォーマットに変更して当該文書の内容をテ
キストファイル化し、そのテキストストリングを入手す
る。In step S31, a search query input process is executed, and the user inputs a query using a natural sentence, a plurality of keywords, or designation of an existing document, and obtains a text string of the query according to the input content. For example, when an existing document is specified as a query, the document is accessed, the format is changed to an appropriate format, the contents of the document are converted to a text file, and the text string is obtained.

【００６９】次いで、ステップＳ３２では前記テキスト
ストリングに基づいてクエリベクトルの生成処理を行
う。Next, in step S32, a query vector generation process is performed based on the text string.

【００７０】図１２はステップＳ３２で実行されるクエ
リベクトル生成処理ルーチンのフローチャートである。FIG. 12 is a flowchart of the query vector generation processing routine executed in step S32.

【００７１】すなわち、ステップＳ４１ではユーザ指定
の検索クエリから単語を抽出し、形態素解析用辞書を使
用して形態素解析を行い、更に係り受け解析を行う。続
くステップＳ４２では全ての係り受け解析の結果と共起
データベース３ｄとを照合し、解析された係り単語と受
け単語との組が共起データベース３ｄに格納されている
場合は単語の語義を特定する。That is, in step S41, words are extracted from the search query specified by the user, morphological analysis is performed using the morphological analysis dictionary, and dependency analysis is further performed. In the following step S42, all the results of the dependency analysis are collated with the co-occurrence database 3d, and if the analyzed combination of the dependency word and the received word is stored in the co-occurrence database 3d, the meaning of the word is specified. .

【００７２】次に、ステップＳ４３ではクエリ中の全て
の単語の多義性が解消されたか否かを判断し、解消され
ている場合は直ちにステップＳ４５に進む一方、解消さ
れていない場合はステップＳ４４に進んでユーザプロフ
ァイルに基づく多義解消を行い、その後、ステップＳ４
５に進む。Next, in step S43, it is determined whether or not the ambiguity of all the words in the query has been resolved. If the ambiguity has been resolved, the process immediately proceeds to step S45. If not, the process proceeds to step S44. Then, the disambiguation is performed based on the user profile.
Go to 5.

【００７３】具体的には、ステップＳ４４では多義解消
されなかったクエリ中に表記された単語の全ての語義を
示す単語ベクトルＸと、正規化ユーザプロファイルベク
トルＱとの余弦測度ＳＤ（Ｘ，Ｑ）を求め、該余弦測度
ＳＤ（Ｘ，Ｑ）を類似度として算出する。すなわち、単
語ベクトルＸは、選択される語義が１つだけの場合もあ
り、また複数存在する場合もあり、一般的には数式
（１）示すようにｎ次元（ｘ１〜ｘｎ）のベクトルで表
される。同様に正規化ユーザプロファイルベクトルＱも
数式（２）に示すようにｎ次元（ｑ１〜ｑｎ）のベクト
ルで表され、また、余弦測度ＳＤ（Ｘ，Ｑ）は両ベクト
ルの内積を両ベクトルの絶対値の積で除算した値とな
る。しかるに単語ベクトルＸ及び正規化ユーザプロファ
イルベクトルＱは「１」を基準に正規化されているた
め、余弦測度ＳＤ（Ｘ，Ｑ）は前記内積に相当し、した
がって、余弦測度ＳＤ（Ｘ，Ｑ）は、数式（３）に示す
ように、両ベクトルの同次元の特徴量の積の総和とな
る。More specifically, in step S44, the cosine measure SD (X, Q) of the word vector X indicating all the meanings of the words described in the query that has not been disambiguated and the normalized user profile vector Q Is calculated, and the cosine measure SD (X, Q) is calculated as the similarity. That is, the word vector X may be selected from only one meaning or may be plural. Generally, the word vector X is represented by an n-dimensional (x1 to xn) vector as shown in Expression (1). Is done. Similarly, the normalized user profile vector Q is represented by an n-dimensional (q1 to qn) vector as shown in Expression (2), and the cosine measure SD (X, Q) is obtained by calculating the inner product of both vectors by the absolute value of both vectors. The value is divided by the product of the values. However, since the word vector X and the normalized user profile vector Q are normalized based on “1”, the cosine measure SD (X, Q) corresponds to the inner product, and therefore, the cosine measure SD (X, Q) Is the sum of the products of the same-dimensional features of both vectors, as shown in equation (3).

【００７４】[0074]

【数１】 (Equation 1)

【００７５】このようにして余弦測度ＳＤ（Ｘ，Ｑ）、
すなわち類似度を求め、ある閾値以上に類似する語義を
選択して無関係と解される語義を除外することにより、
ユーザプロファイルによる多義解消を行う。Thus, the cosine measure SD (X, Q),
That is, by calculating the similarity, selecting a meaning that is similar to a certain threshold or more and excluding a meaning that is interpreted as irrelevant,
Eliminate ambiguity by user profile.

【００７６】このように、ステップＳ４２で共起データ
ベースにより多義解消できなかった単語に対してユーザ
プロファイルによる多義解消を行う。ここで強調すべき
点は、ユーザプロファイルによる多義解消に優先して共
起情報による多義解消が行われることである。As described above, in the step S42, the ambiguity is eliminated by the user profile for the word that could not be eliminated by the co-occurrence database. What should be emphasized here is that polysemy elimination is performed by co-occurrence information prior to polysemy elimination by the user profile.

【００７７】そして、ステップＳ４５では、クエリ拡張
辞書３ｅを使用してクエリの拡張を行う。すなわち、ユ
ーザの指示に従い、「類義語展開のみ」、「類義語展開
＋関連語展開」、「関連語展開のみ」などのバリエーシ
ョン処理を行う。Then, in step S45, the query is expanded using the query expansion dictionary 3e. That is, according to the user's instruction, variation processing such as “only synonym expansion”, “synonym expansion + related word expansion”, and “only related word expansion” is performed.

【００７８】次に、ステップＳ４６では検索クエリの文
書ベクトル生成処理を行う。すなわち、これまでの処理
で抽出された単語及び特定された語義から単語ベクトル
辞書３ａを検索し、単語ごとの次元別の特徴量を算出
し、その総和から文書ベクトルを生成して図１１のルー
チンに戻る。Next, in step S46, document vector generation processing of a search query is performed. In other words, the word vector dictionary 3a is searched from the word extracted in the processing up to this point and the specified meaning, the feature amount for each dimension is calculated for each word, and a document vector is generated from the sum thereof, and the routine shown in FIG. Return to

【００７９】次に、図１１のステップＳ３３ではステッ
プＳ３２で得られたクエリベクトルとＱ′と検索対象と
なる文書データベース３ｂの文書ベクトルＸ′とから余
弦測度ＳＤ（Ｘ′，Ｑ′）、すなわち類似度を算出し、
ＲＡＭ７に格納する。Next, in step S33 of FIG. 11, the cosine measure SD (X ', Q'), that is, the query vector obtained in step S32, Q 'and the document vector X' of the document database 3b to be searched, Calculate the similarity,
Store in RAM7.

【００８０】図１３はステップＳ３３で実行される類似
度生成処理ルーチンのフローチャートである。FIG. 13 is a flowchart of the similarity generation processing routine executed in step S33.

【００８１】すなわち、ステップＳ５１では文書データ
ベース３ｂ内の検索対象となる文書を指定するカウンタ
のカウント値Nを初期値１にセットし、続くステップＳ
５２では文書データベース３ｂを検索し、Ｎ番目（最初
のループではＮ＝１）の文書の文書ベクトルＸ′を読み
出し、ステップＳ５３ではＮ番目の文書ベクトルＸ′と
検索クエリのクエリベクトルＱ′とに基づいて類似度を
算出する。すなわち、ステップＳ５３では検索クエリを
１つの文書とみなしてクエリベクトルＱ′を求め、検索
対象の文書データベース３ｂ上の文書の文書ベクトル
Ｘ′とクエリベクトルＱ′との余弦測度ＳＤ（Ｘ′，
Ｑ′）を求め、類似度を算出する（下記数式（４）〜
（６）参照）。That is, in step S51, the count value N of the counter for designating the document to be searched in the document database 3b is set to the initial value 1, and the subsequent step S51
At 52, the document database 3b is searched, and the document vector X 'of the Nth (N = 1 in the first loop) document is read. At step S53, the Nth document vector X' and the query vector Q 'of the search query are converted. A similarity is calculated based on the similarity. That is, in step S53, the search query is regarded as one document to obtain a query vector Q ', and the cosine measure SD (X', X ',) of the document vector X' of the document on the search target document database 3b and the query vector Q 'is obtained.
Q ′) and calculate the similarity (formula (4) below)
(6)).

【００８２】[0082]

【数２】 (Equation 2)

【００８３】そして、例えば、「フォーム」という表記
を上記図５（ａ）（ｂ）の２つのプロファイルに従って
解釈した場合、図５（ａ）のプロファイルに従うと、次
元７の特徴量が大きいので、図３の単語７と単語８とで
は、次元７の特徴量が大きい単語８（「フォーム(姿
勢)」）との内積が大きくなり、「姿勢」の語義が採用
される。For example, when the notation “form” is interpreted according to the two profiles shown in FIGS. 5A and 5B, according to the profile shown in FIG. 5A, the feature of dimension 7 is large. In the words 7 and 8 in FIG. 3, the inner product of the word 8 (“form (posture)”) having a large dimension 7 is increased, and the meaning of “posture” is adopted.

【００８４】一方、図５（ｂ）のプロファイルの従う
と、次元４の特徴量が大きいので、図３の単語７と単語
８とでは、次元４の特徴量が大きい単語７「フォーム
(帳票)」との内積が大きくなり、「帳票」の語義が採用
される。On the other hand, according to the profile of FIG. 5B, since the feature of dimension 4 is large, the word 7 and the word 8 of FIG.
(Form) ”and the meaning of“ form ”is adopted.

【００８５】次に、ステップＳ５４では算出された類似
度をＲＡＭ７に格納し、続くステップＳ５５では文書デ
ータベース３ｂ内の検索対象文書に残文書があるか否か
を判断し、存在しない場合はそのまま図１１のルーチン
に戻る一方、存在する場合はステップＳ５６でカウンタ
のカウント値Ｎを「１」だけインクリメントしてステッ
プＳ５２に戻り、上述の処理を繰り返す。Next, in step S54, the calculated similarity is stored in the RAM 7, and in the following step S55, it is determined whether or not there is a remaining document in the document to be searched in the document database 3b. While the process returns to the routine of No. 11, if it exists, the count value N of the counter is incremented by "1" in step S56, the process returns to step S52, and the above process is repeated.

【００８６】尚、本実施の形態では前記類似度をＲＡＭ
７に格納しているが、文書データベース３ｂに登録して
もよい。In the present embodiment, the similarity is stored in RAM
7, but may be registered in the document database 3b.

【００８７】次に、図１１のルーチンに戻り、ステップ
Ｓ３４ではＲＡＭ７を参照し、ステップＳ３３で得られ
た文書ごとの類似度を順序付けする。そして、ステップ
Ｓ３５ではステップＳ３４で順序付けされた文書を検索
結果としてリストアップし、表示装置２に表示する。
尚、この時、ステップＳ３３で登録された類似度の値も
同時に表示する。Next, returning to the routine of FIG. 11, in step S34, the similarity degree for each document obtained in step S33 is ordered with reference to the RAM 7. In step S35, the documents ordered in step S34 are listed as search results and displayed on the display device 2.
At this time, the value of the similarity registered in step S33 is also displayed.

【００８８】このように本実施の形態によれば、共起デ
ータベース３ｄに従ってクエリを解析することにより、
クエリの語義を正確に解釈することができ、これにより
クエリの多義性解消を高精度に行なうことができ、ま
た、共起データベース３ｄに基づいた多義解消を行なう
ことができない場合は、ユーザプロファイル情報によっ
て、ユーザの過去の嗜好に従って多義解消することがで
きるので、よりユーザの意図に即した検索精度の高い検
索結果を容易且つ迅速に得ることができる。As described above, according to the present embodiment, by analyzing a query according to the co-occurrence database 3d,
The meaning of the query can be interpreted accurately, whereby the ambiguity of the query can be eliminated with high accuracy. If the ambiguity can not be eliminated based on the co-occurrence database 3d, the user profile information As a result, the sense of ambiguity can be resolved according to the user's past preferences, so that a search result with a higher search accuracy that is more suited to the user's intention can be obtained easily and quickly.

【００８９】図１４は本発明に係る情報検索装置として
の文書検索装置の第２の実施の形態を示すブロック構成
図であって、本第２の実施の形態ではクエリの拡張を行
わない場合を示しており、クエリ拡張辞書が省略されて
いる。FIG. 14 is a block diagram showing a second embodiment of a document search apparatus as an information search apparatus according to the present invention. In the second embodiment, a case where a query is not extended is described. And the query expansion dictionary is omitted.

【００９０】図１５は本第２の実施の形態におけるクエ
リベクトル生成処理ルーチンのフローチャートであっ
て、ステップＳ６１ではユーザ指定の検索クエリから単
語を抽出し、形態素解析用辞書を使用して形態素解析を
行い、更に係り受け解析を行う。続くステップＳ６２で
は全ての係り受け解析の結果と共起データベース３ｄと
を対照し、解析された係り単語と受け単語との組が共起
データベース３ｄに格納されているかどうかに応じて各
語義のもっともらしさ（語義尤度）を算出していく。FIG. 15 is a flowchart of a query vector generation processing routine according to the second embodiment. In step S61, words are extracted from a user-specified search query, and morphological analysis is performed using a morphological analysis dictionary. Perform the dependency analysis. In the following step S62, all the results of the dependency analysis are compared with the co-occurrence database 3d, and the meaning of each meaning is determined according to whether or not a set of the analyzed dependency word and the received word is stored in the co-occurrence database 3d. The likelihood (word sense likelihood) is calculated.

【００９１】次に、ステップＳ６３では上記第１の実施
の形態と同様、余弦測度に基づく類似度を算出し、類似
度に従って語義尤度を求めていく。そしてステップＳ６
２、Ｓ６３で求めた語義尤度を合算し、最終的に最もも
っともらしいとされた語義を選択する。このとき、語義
尤度の重みとしてＳ６２の語義尤度に対する重みをより
大きくすることで、共起情報による多義解消をプロファ
イルによる多義解消よりも優先する。Next, in step S63, similar to the first embodiment, the similarity based on the cosine measure is calculated, and the word meaning likelihood is calculated according to the similarity. And step S6
2. The word likelihoods calculated in S63 are summed up, and finally the most likely word meaning is selected. At this time, by increasing the weight of the semantic likelihood in S62 as the weight of the semantic likelihood, the disambiguation by co-occurrence information is prioritized over the disambiguation by the profile.

【００９２】そして、ステップＳ６４では、検索クエリ
の文書ベクトル生成処理を行う。すなわち、これまでの
処理で抽出された単語及び特定された語義から単語ベク
トル辞書３ａを検索し、単語ごとの次元別の特徴量を算
出し、その総和から文書ベクトルを生成している。In step S64, document vector generation processing of the search query is performed. That is, the word vector dictionary 3a is searched from the word extracted in the processing up to this point and the specified meaning, a feature amount for each dimension of each word is calculated, and a document vector is generated from the sum thereof.

【００９３】このように本実施の形態は、クエリ拡張を
行わない場合の例を示すと共に、共起データベースに基
づく多義解消を、ユーザプロファイル情報による多義解
消よりも優先する別の実施形態を示している。As described above, this embodiment shows an example in which query expansion is not performed, and shows another embodiment in which polysemy resolution based on a co-occurrence database is prioritized over polysemy resolution based on user profile information. I have.

【００９４】尚、本発明は上記実施の形態に限定される
のではない。The present invention is not limited to the above embodiment.

【００９５】上記実施の形態では、文書検索方式として
類似文書検索について説明したが、他の検索方式に適用
することもできる。例えば、従来の全文検索システムに
おいてもクエリを類義語展開することがあるが、本発明
によればクエリが語義ベースで解析されているので、図
７に示すような語義ベースの類義語辞書により容易に必
要な類義語だけを展開することができ、また展開単語は
特定語義についての類義語に限定されるので、検索ノイ
ズを減少させることができる。In the above embodiment, similar document retrieval has been described as a document retrieval method, but the present invention can be applied to other retrieval methods. For example, in a conventional full-text search system, a query may be expanded into a synonym, but according to the present invention, since the query is analyzed on a synonym basis, it is more easily required by a synonym dictionary based on the semantics as shown in FIG. Since only similar synonyms can be expanded, and expanded words are limited to synonyms relating to a specific meaning, search noise can be reduced.

【００９６】また、上記実施の形態では、ユーザプロフ
ァイルの更新をファイルアクセス毎に更新するように構
成したが、アクセスしたファイルの履歴情報のみを記録
しておき、ある所定期間ごとに一括してプロファイルを
更新するように構成するようにするのも好ましい。この
場合はファイルアクセス毎に余分な処理時間を要するこ
となく快適にファイルアクセスすることができ、また、
通常コンピュータを使用しない深夜などに一括してプロ
ファイルを更新することもでき、この場合はプロファイ
ル更新による通常ユーザ層に対する処理時間的影響はよ
り軽微となる。In the above embodiment, the user profile is updated every time a file is accessed. However, only the history information of the accessed file is recorded, and the profile is collectively updated every predetermined period. Is preferably updated. In this case, the file can be accessed comfortably without extra processing time for each file access.
The profile can be updated collectively at midnight when a normal computer is not used, and in this case, the influence of the update of the profile on the normal user layer in processing time becomes smaller.

【００９７】また、ユーザプロファイルをアクセスした
ファイル履歴から構成するようにしているが、ユーザが
直接このユーザプロファイルを作成するようにしてもよ
い。Although the user profile is configured from the history of accessed files, the user may directly create the user profile.

【００９８】また、ユーザが直接ユーザプロファイル情
報を最初から作成するのは困難な場合があるため、ある
種のガイドラインを作成し、システムの質問に応えてい
くだけでプロファイルが作成されるようにしてもよい。[0098] In some cases, it is difficult for a user to directly create user profile information from the beginning. Therefore, a certain type of guideline may be created and a profile may be created simply by responding to system questions. Good.

【００９９】さらに、上記実施の形態では、ユーザプロ
ファイルは各個人が別々の情報を保有するように構成し
たが、会社組織などにおいては各グループ単位でプロフ
ァイルを保有することも考えられ、斯かる場合はグルー
プ内の他のメンバーによるファイルアクセスによってプ
ロファイル情報が更新されることとなる。Further, in the above-described embodiment, the user profile is configured so that each individual has different information. However, in a company organization, etc., it is conceivable that each individual group has a profile. The profile information is updated by file access by another member in the group.

【０１００】上述の実施形態においては、語義の多義性
を解消する方法として、共起データベースによる方法、
ユーザプロファイルによる方法の２種類を挙げたが、多
義解消手段はこの２種類に限定されるものではない。例
えば、各単語の語義リストを表示し、該語義リストの中
から所望の語義をユーザが選択するようにしてもよく、
文脈に応じて語義を解釈するようにしてもよい。すなわ
ち、ユーザの発行する数多くのクエリに対して、文脈ベ
クトルを用意し、クエリ入力ごとに文脈ベクトルを更新
することにより、新たなクエリ入力では重要単語を省略
してもユーザの意図通りの検索を行なうことが可能とな
り、斯かる多義解消方法を混在するようにしてもよい。In the above-described embodiment, as a method for resolving polysemy of a meaning, a method using a co-occurrence database,
Although two types of methods based on the user profile have been described, the disambiguation means is not limited to these two types. For example, a meaning list of each word may be displayed, and the user may select a desired meaning from the meaning list,
The meaning may be interpreted according to the context. That is, for many queries issued by the user, a context vector is prepared, and the context vector is updated for each query input, so that even if an important word is omitted in a new query input, a search as intended by the user is performed. It is possible to carry out such a method, and such a polysemy elimination method may be mixed.

【０１０１】[0101]

【発明の効果】以上詳述したように本発明によれば、ベ
クトル空間モデルを応用した類似文書検索において共起
情報に従って検索クエリを解析し多義解消するので、ク
エリが正確に解釈でき、ユーザの意図に沿った検索精度
の高い情報検索を行なうことができる。As described above in detail, according to the present invention, in a similar document search using a vector space model, a search query is analyzed according to co-occurrence information and ambiguity is resolved, so that the query can be interpreted accurately, and Information retrieval with high retrieval accuracy according to the intention can be performed.

【０１０２】また、検索クエリ中の多義語をユーザの過
去の嗜好を表現するユーザプロファイル情報を参照して
多義解消するので、クエリをよりユーザの嗜好に合わせ
て解釈でき、ユーザの意図に沿った検索精度の高い情報
検索を行うことができる。また、共起情報による多義解
消をユーザプロファイル情報による多義解消よりも優先
するので、よりユーザの意図に沿った検索精度の高い情
報検索を行なうことができる。Also, since polysemy in the search query is resolved by referring to the user profile information expressing the user's past preference, the query can be interpreted more in accordance with the user's preference, and the user's intention can be interpreted. Information retrieval with high retrieval accuracy can be performed. In addition, since disambiguation by co-occurrence information is prioritized over disambiguation by user profile information, it is possible to perform information retrieval with higher retrieval accuracy in accordance with the intention of the user.

[Brief description of the drawings]

【図１】本発明に係る情報検索装置としての文書検索装
置の一実施の形態（第１の実施の形態）を示すブロック
構成図である。FIG. 1 is a block diagram showing an embodiment (first embodiment) of a document search device as an information search device according to the present invention.

【図２】表示装置の表示画面の一例を示す図である。FIG. 2 is a diagram illustrating an example of a display screen of a display device.

【図３】単語ベクトル辞書のフォーマット図である。FIG. 3 is a format diagram of a word vector dictionary.

【図４】文書データベースのフォーマット図である。FIG. 4 is a format diagram of a document database.

【図５】ユーザプロファイルのフォーマット図である。FIG. 5 is a format diagram of a user profile.

【図６】共起データベースのフォーマット図である。FIG. 6 is a format diagram of a co-occurrence database.

【図７】クエリ拡張辞書ののフォーマット図である。FIG. 7 is a format diagram of a query expansion dictionary.

【図８】本発明に係る情報検索方法としての文書検索方
法の検索手順を示すメインルーチンのフローチャートで
あるFIG. 8 is a flowchart of a main routine showing a search procedure of a document search method as an information search method according to the present invention.

【図９】文書登録処理ルーチンのフローチャートであ
る。FIG. 9 is a flowchart of a document registration processing routine.

【図１０】プロファイル更新処理ルーチンのフローチャ
ートである。FIG. 10 is a flowchart of a profile update processing routine.

【図１１】検索実行処理ルーチンのフローチャートであ
る。FIG. 11 is a flowchart of a search execution processing routine.

【図１２】クエリベクトル生成処理ルーチンのフローチ
ャートである。FIG. 12 is a flowchart of a query vector generation processing routine.

【図１３】類似度生成処理ルーチンのフローチャートで
ある。FIG. 13 is a flowchart of a similarity generation processing routine.

【図１４】本発明に係る情報検索装置としての文書検索
装置の第２の実施の形態を示すブロック構成図である。FIG. 14 is a block diagram showing a second embodiment of a document search device as an information search device according to the present invention.

【図１５】第２の実施の形態におけるクエリベクトル生
成処理ルーチンのフローチャートである。FIG. 15 is a flowchart of a query vector generation processing routine according to the second embodiment.

[Explanation of symbols]

１入力装置３ＨＤ９ＣＰＵ 1 input device 3 HD 9 CPU

Claims

[Claims]

A search condition input unit for inputting a search condition; a morphological analysis unit for extracting a word from the search condition input by the search condition input unit; A co-occurrence relation storage means for storing;
A first polysemy removing means for extracting a co-occurrence relation of the search condition based on the co-occurrence information stored in the co-occurrence relation storage means and the analysis result of the morphological analysis means and eliminating polysemy;
An information search device for performing an information search based on the meaning of which the ambiguity has been eliminated by the first polysemy elimination device.

2. The information retrieval apparatus according to claim 1, further comprising a meaning extension means for performing an extension process on a meaning whose meaning has been eliminated by said first ambiguity eliminating means.

3. The meaning expansion means includes at least synonym expansion means for expanding the meanings into synonyms, related word expansion means for expanding related words, and combination expansion means combining these. The information retrieval device according to claim 2.

4. A user profile storage means for storing user profile information expressing user preferences,
A second ambiguity removing unit for removing ambiguity of the search condition based on the user profile information stored in the user profile storage unit and an analysis result of the morphological analysis unit; The cancellation means is the second
4. The information retrieval apparatus according to claim 1, wherein the information retrieval apparatus has priority over the polysemy elimination means.

5. The information retrieval apparatus according to claim 4, further comprising a meaning extension means for performing an extension process on a meaning whose ambiguity has been eliminated by said second polysemy elimination means.

6. The synonym expansion means includes at least synonym expansion means for expanding synonyms of meanings, related word expansion means for expanding related words, and combination expansion means combining these. The information retrieval device according to claim 5.

7. A document vector generating means for generating a document vector based on the meaning of which the ambiguity has been eliminated by the first ambiguity eliminating means, and the document vector generated by the document vector generating means is registered as a database. 7. The information search device according to claim 1, further comprising a registration unit that performs the registration.

8. A document vector generating means for generating a document vector based on the meaning of which the ambiguity has been eliminated by the first ambiguity eliminating means, and a user profile storing user profile information expressing user preferences. 8. A user profile information updating unit for updating a storage content of a storage unit based on a document vector generated by the document vector generating unit. Information retrieval device.

9. Search condition input means for inputting search conditions,
A morphological analysis unit for extracting a word from the search condition input by the search condition input unit, a user profile storage unit for storing user profile information expressing a user's preference, and the user profile information and the morphological analysis unit. Second ambiguity eliminating means for eliminating the ambiguity of the search condition based on the analysis result, and information retrieval means for performing information retrieval based on the meaning of the ambiguity eliminated by the second ambiguity eliminating means An information retrieval device comprising:

10. A search condition inputting step of inputting a search condition, a morphological analysis step of extracting a word from the search condition input in the search condition inputting step, and associating a co-occurrence relationship between a plurality of words with a meaning. A first polysemy elimination step for extracting a co-occurrence relation of the search condition based on the stored co-occurrence information and the analysis result in the morphological analysis step to eliminate polysemy, and the first polysemy elimination An information search step of performing an information search based on a meaning that has been eliminated in the step.

11. The information retrieval method according to claim 10, further comprising a meaning extension step of performing an extension process on the meaning whose ambiguity has been eliminated in said first ambiguity eliminating step.

12. The semantic expansion step includes at least a synonym expansion step of expanding a meaning into a synonym, a related word expansion step of expanding a related word, and a combination expansion step in which these are combined. The information retrieval method according to claim 11.

13. The method according to claim 1, further comprising a second polysemy resolving step for resolving polysemy of the search condition based on user profile information expressing user preference and an analysis result of the morphological analysis step. 13. The information search method according to claim 10, wherein the disambiguation step has priority over the second disambiguation step.

14. The information retrieval method according to claim 13, further comprising a semantic extension step of performing an extension process on the semantics whose ambiguity has been eliminated in the second ambiguity eliminating step.

15. The semantic expansion step includes at least a synonym expansion step of expanding the meaning into a synonym, a related word expansion step of expanding the meaning into a related word, and a combination expansion step combining these. The information retrieval device according to claim 14.

16. A document vector generating step of generating a document vector based on the meaning of which the ambiguity has been eliminated in the first ambiguity eliminating step, and registration for registering the generated document vector as a database in a registration unit. 16. The information search method according to claim 10, further comprising the steps of:

17. A document vector generating step of generating a document vector based on a meaning removed from the sense of ambiguity in the first ambiguity removing step, and a user profile storing user profile information expressing a user's preference. 17. The information retrieval method according to claim 10, further comprising a user profile information updating step of updating contents stored in a storage unit based on the generated document vector.

18. A search condition input step of inputting search conditions, a morphological analysis step of extracting words from the search conditions input by the search condition input means, and user profile information expressing user preferences are stored. A second ambiguity eliminating step for eliminating the ambiguity of the search condition based on the user profile information and the analysis result of the morphological analysis step, and a word having the ambiguity eliminated by the second ambiguity eliminating step. An information search step of performing an information search based on the information search method.

19. A computer-readable storage medium storing an information search procedure for performing an information search based on a search condition input from an input device, wherein the morphological analysis extracts a word from the input search condition. And extracting a co-occurrence relationship of the search condition based on co-occurrence information stored in association with a co-occurrence relationship between a plurality of words and a meaning and an analysis result of the morphological analysis procedure to eliminate polysemy. A computer-readable storage storing a first ambiguity eliminating procedure and an information retrieval procedure for performing information retrieval based on the meaning of the ambiguity eliminated by the first ambiguity eliminating procedure. Medium.

20. The storage medium according to claim 19, wherein a semantic extension procedure for performing an extension process on the semantics whose ambiguity has been eliminated in the first ambiguity eliminating procedure is stored.

21. A method of eliminating the ambiguity of the search condition based on user profile information expressing user's preference and an analysis result of the dependency analysis step, which is prioritized over the first ambiguity eliminating procedure. 21. The storage medium according to claim 19, wherein the second ambiguity resolution procedure is stored.

22. The storage medium according to claim 21, wherein a semantic extension procedure for performing an extension process on the semantics whose ambiguity has been eliminated in the second ambiguity eliminating procedure is stored.

23. A document vector generating procedure for generating a document vector based on the meaning of which the ambiguity has been eliminated in the first ambiguity eliminating procedure, and registration for registering the generated document vector as a database in a registration unit. 23. The storage medium according to claim 19, wherein the procedure is stored.

24. A document vector generating procedure for generating a document vector based on the meaning of which the ambiguity has been eliminated in the first ambiguity eliminating procedure, and a user profile storing user profile information expressing user preferences. 24. A user profile information updating step of updating the storage content of the storage unit based on the generated document vector.
The storage medium according to any one of the above.

25. A co-occurrence relationship between a word vector dictionary in which word meaning vectors are stored on a semantic basis, a document database to be searched, a user profile in which user preferences are recorded, and dependency words are stored. A computer-readable storage medium having a data structure including a co-occurrence database.

26. The storage medium according to claim 25, wherein an expansion dictionary for expanding a search condition is stored.

27. A computer-readable storage medium storing an information search procedure for performing an information search based on a search condition input from an input device, wherein the morphological analysis extracts a word from the input search condition. A second ambiguity eliminating procedure for eliminating the ambiguity of the search condition based on user profile information in which user profile information expressing user preference is stored and an analysis result of the morphological analysis procedure; A computer-readable storage medium storing an information search procedure for performing an information search based on a meaning that has been resolved by the second polysemy resolution procedure.