JP2010257155A

JP2010257155A - Information retrieval device, method, program, and computer-readable recording medium

Info

Publication number: JP2010257155A
Application number: JP2009105642A
Authority: JP
Inventors: Shunsuke Konagai; 俊介小長井; Yukio Uematsu; 幸生植松; Yoshihiko Kazuhara; 良彦数原; Ryoji Kataoka; 良治片岡
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2009-04-23
Filing date: 2009-04-23
Publication date: 2010-11-11

Abstract

<P>PROBLEM TO BE SOLVED: To retrieve a document having a specific block including all keywords, and to output it as an information retrieval result according to an output order. <P>SOLUTION: When a Web page is input, each document of the Web page is divided into text blocks, and the linguistic features (reliability) of the text blocks are calculated, and stored in a storage means, and when a plurality of keywords are designated as a retrieval request, the document having the specific text block including all the keywords is retrieved, and an information retrieval result is acquired, and the linguistic features (reliability) stored in the storage means are used as a parameter for determining the output order of information retrieval results, and the output order of the retrieved information retrieval results is determined, and the information retrieval result is output according to the output order. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、情報検索装置及び方法及びプログラム及びコンピュータ読み取り可能な記録媒体に係り、特に、インターネット上の検索エンジンをはじめとする情報検索システムにおいて、複数の文章が含まれる文書について文章の言語特性を反映した検索を行うための情報検索装置及び方法及びプログラム及びコンピュータ読み取り可能な記録媒体に関する。 The present invention relates to an information search apparatus and method, a program, and a computer-readable recording medium. In particular, in an information search system such as a search engine on the Internet, the linguistic characteristic of a sentence is determined for a document including a plurality of sentences. The present invention relates to an information search apparatus and method for performing a reflected search, a program, and a computer-readable recording medium.

近年、インターネットの普及によって、インターネット上の膨大な文書群から利用者が必要とする情報を適確に検索するシステム及びサービスの重要性が高まっている。一般に検索サービスにおいては、ユーザが入力した検索キーワードが検索対象の文書や該文書に対する別の文書からのリンクアンカーテキストに含まれる数に基づいた、検索キーワードと文書の一致度と、該文書が別の文書からどれだけ参照されているかといった文書の重要度から情報検索の出力順を決定している。 In recent years, with the spread of the Internet, the importance of systems and services for accurately retrieving information required by users from a huge document group on the Internet has increased. In general, in a search service, a search keyword input by a user is based on the number of search keywords and the number of link anchor texts from another document corresponding to the document to be searched, The output order of information retrieval is determined based on the importance of the document such as how many times it is referenced from the document.

検索キーワードと文書の一致度としては、tf-idfやBM25といった単語の統計量を用いた手法が一般的に利用されている。この手法は特定の文書群全体の平均と比較して文書に高い頻度で現われる単語が、該文書を特徴付けるものであるという推定に基づいて、ユーザによって指定された検索キーワードが文書の特徴と一致する度合いが高い文書ほど高い出力順位としている（例えば、非特許文献１参照）。 As a matching degree between a search keyword and a document, a technique using a word statistic such as tf-idf or BM25 is generally used. This approach matches the search keywords specified by the user with document features, based on the assumption that words that appear more frequently in a document compared to the average for a particular group of documents are characteristic of the document. A document with a higher degree has a higher output order (for example, see Non-Patent Document 1).

しかし、現在のインターネットを対象とした情報検索システムでは、検索対象とする文書数があまりに膨大であるため、情報検索ユーザが単一の検索キーワードで検索を行った場合には、ユーザの検索意図と異なった検索結果が多く含まれてしまうため、情報検索ユーザは２語以上の検索キーワードを入力して情報検索を行うことが一般的である。 However, in the current information search system for the Internet, the number of documents to be searched is so large that when an information search user performs a search using a single search keyword, Since many different search results are included, it is common for an information search user to perform an information search by inputting two or more search keywords.

こういった場合に、従来の情報検索システムでは、文書全体を一つの単位として扱うため、複数の異なった主題の文章を含む一つの文書が、情報検索ユーザが入力した複数のキーワードをそれぞれ別の文章に含んでいた場合に、該文書が検索結果に含まれてしまうという問題がある。 In such a case, the conventional information retrieval system treats the entire document as one unit, so that one document including a plurality of different subject texts has different keywords input by the information retrieval user. There is a problem that the document is included in the search result when it is included in the sentence.

典型的なものとして、インターネット上の電子掲示板システムにおいては、複数の筆者によって投稿された複数の文章が１つの文書に含まれていることが一般的であるし、オンライン日記やブログのように、一人の筆者によって書かれた文書であっても複数日分の日記や記事等で異なった主題の文章を複数含む文書も珍しくない。 As a typical example, in an electronic bulletin board system on the Internet, it is common that a plurality of sentences posted by a plurality of authors are included in one document, and like an online diary or blog, Even a document written by one writer is not uncommon for documents containing multiple subjects with different themes, such as diaries and articles for multiple days.

従来の情報検索システムでは、文書中の各単語の出現位置を記録しておいて、各検索キーワードの出現位置同士の距離を算出し、該距離の短いものに高いスコアを与えることによって、上記問題を軽減する手法も存在する。 In the conventional information retrieval system, the occurrence position of each word in the document is recorded, the distance between the appearance positions of each search keyword is calculated, and a high score is given to the short distance, thereby obtaining the above problem. There are also techniques to reduce this.

しかし、この場合には、全文書における全単語の出現位置を記録するために文書検索に用いるインデックスデ−タのデータ量が膨大になるという問題がある。 However, in this case, there is a problem that the data amount of index data used for document search for recording the appearance positions of all words in all documents becomes enormous.

また、複数の検索キーワードが極近い位置に出現したとしても、それが２つの異なる主題の文章にまたがって出現していないということを保証しないという問題がある。 In addition, even if a plurality of search keywords appear in close positions, there is a problem that it is not guaranteed that they do not appear across two different subject sentences.

この問題を解決するために、文書を文の単位に分割し、文書における単語の出現位置情報を単語単位ではなく文単位で記憶する方法（文単位転置インデックス）がある（例えば、非特許文献２参照）。 In order to solve this problem, there is a method (sentence transposition index) in which a document is divided into sentence units, and word appearance position information in the document is stored in sentence units instead of word units (for example, Non-Patent Document 2). reference).

この文単位転置インデックスを用いて、各検索キーワードの出現位置同士の距離を算出し、該距離の短いものに高いスコアを与えるようにすれば、全文書における全単語の出現位置を記録するのに必要なデータ量を削減することができる。 By using this sentence unit transposition index to calculate the distance between the appearance positions of each search keyword and to give a high score to those with a short distance, it is possible to record the appearance positions of all words in all documents. The amount of data required can be reduced.

また、文単位転置インデックを用いて、各検索キーワードの出現位置同士の距離を計算するのではなく、複数単語が同一の文内に出現する文書を検索結果とすると、複数の検索キーワードが異なる主題の文章にまたがって出現していないということを保証できる。 Also, instead of calculating the distance between the appearance positions of each search keyword using a sentence unit transposition index, if a search result is a document in which a plurality of words appear in the same sentence, the plurality of search keywords are different subjects. It can be guaranteed that it does not appear across the sentences.

S. Robertson, H Zaragoza, M Taylor 'Simple BM25 extension to multiple weighted fields' Proceedings of the thirteenth ACM international conference on Information and knowledge management 2004.S. Robertson, H Zaragoza, M Taylor 'Simple BM25 extension to multiple weighted fields' Proceedings of the thirteenth ACM international conference on Information and knowledge management 2004. 植松幸生、藤岡健吾、井上孝史、片岡良治、大和田勇人「文単位転置インデックスによる近接検索手法」DBSJ Letters Vol. 6, No.4 (2008,3,21)Yukio Uematsu, Kengo Fujioka, Takashi Inoue, Ryoji Kataoka, Hayato Owada "Proximity Search Method Using Per-Sentence Permutation Index" DBSJ Letters Vol. 6, No. 4 (2008, 3, 21)

しかしながら、上記非特許文献２に記載の文単位転置インデックスを用いて、各検索キーワードの出現位置同士の距離を計算するのではなく、複数単語が同一の文内に出現する文書を検索結果とした場合には、検索結果が絞り込み条件がきつくなりすぎてしまい、本来であれば、検索要求に合致する文書が検索結果に含まれなくなるという問題がある。 However, instead of calculating the distance between the appearance positions of each search keyword using the sentence unit transposed index described in Non-Patent Document 2, a document in which a plurality of words appear in the same sentence is used as a search result. In this case, there is a problem that the search result is too narrowed down and the document that matches the search request is not included in the search result.

本発明は、上記の点に鑑みなされたもので、検索要求として複数のキーワードが指定された場合に、当該キーワード全てを含む特定ブロックを有する文書を情報検索結果として出力することが可能な情報検索装置及び方法及びプログラム及びコンピュータ読み取り可能な記録媒体を提供することを目的とする。 The present invention has been made in view of the above points, and when a plurality of keywords are specified as a search request, an information search capable of outputting a document having a specific block including all the keywords as an information search result. An object is to provide an apparatus, a method, a program, and a computer-readable recording medium.

図１は、本発明の原理構成図である。 FIG. 1 is a principle configuration diagram of the present invention.

本発明（請求項１）は、Ｗｅｂページに代表される文書情報を検索対象とし、入力された各文書を文章ブロックに分割し、ブロック記憶手段１１に格納するブロック分割手段２と、検索要求として複数のキーワードが指定された場合に、該キーワード全てを含む特定の文章ブロックを有する文書を情報検索結果として出力する検索手段６と、を有する情報検索装置であって、
文章ブロックの言語的特徴を算出し、言語特徴記憶手段１２に格納する言語特徴算出手段４と、
言語特徴記憶手段１２に格納されている言語特徴を情報検索結果の出力順を決定するパラメータとして用いて、検索手段６により検索された情報検索結果の出力順を決定する出力順序決定手段５と、を有する。 The present invention (Claim 1) uses document information represented by a Web page as a search target, divides each inputted document into sentence blocks, and stores them in the block storage means 11 as a search request. A search unit 6 that outputs, as an information search result, a document having a specific sentence block including all of the keywords when a plurality of keywords are specified,
Language feature calculating means 4 for calculating linguistic features of the sentence block and storing them in the language feature storage means 12;
An output order determining means 5 for determining the output order of the information search results searched by the search means 6 using the language features stored in the language feature storage means 12 as a parameter for determining the output order of the information search results; Have

また、本発明（請求項２）の出力順序決定手段５は、ブロック分割手段２で分割された文章ブロックの文書長を言語特徴とする手段を含む。 Further, the output order determination means 5 of the present invention (Claim 2) includes means for characterizing the document length of the sentence block divided by the block dividing means 2.

また、本発明（請求項３）は、言語特徴を格納した言語特徴記憶手段と、
文章ブロックに含まれる文において、言語特徴記憶手段に登録された言語特徴と一致する言語特徴を計算する言語特徴計算手段と、を更に有し、
出力順序決定手段５は、言語特徴計算手段で計算された言語特徴を、情報検索結果の出力順を決定するパラメータとして用いる。 The present invention (Claim 3) includes language feature storage means for storing language features;
Language features calculating means for calculating language features that match the language features registered in the language feature storage means in the sentence included in the sentence block;
The output order determination means 5 uses the language features calculated by the language feature calculation means as parameters for determining the output order of the information search results.

また、本発明（請求項４）の言語特徴記憶手段は、言語特徴として文の文末表現を格納する。 The language feature storage means of the present invention (Claim 4) stores a sentence end expression as a language feature.

また、本発明（請求項５）の言語特徴記憶手段は、言語特徴として、文章ブロックの評価表現を格納する。 The language feature storage means of the present invention (Claim 5) stores an evaluation expression of a sentence block as a language feature.

図２は、本発明の原理を説明するための図である。 FIG. 2 is a diagram for explaining the principle of the present invention.

本発明（請求項６）は、Ｗｅｂページに代表される文書情報を検索対象とし、入力された各文書を文章ブロックに分割し（ステップ１）、検索要求として複数のキーワードが指定された場合に（ステップ３）、該キーワード全てを含む特定の文章ブロックを有する文書を検索して（ステップ４）情報検索結果として出力する情報検索方法であって、
文章ブロックの言語的特徴を算出し、言語特徴記憶手段に格納する言語特徴算出ステップ（ステップ２）と、
言語特徴記憶手段に格納されている言語特徴を情報検索結果の出力順を決定するパラメータとして用いて、検索手段により検索された情報検索結果の出力順を決定する出力順序決定ステップ（ステップ５）と、を行う。 The present invention (Claim 6) uses document information represented by a Web page as a search target, divides each input document into sentence blocks (step 1), and a plurality of keywords are specified as a search request. (Step 3) An information retrieval method for retrieving a document having a specific sentence block including all the keywords (Step 4) and outputting it as an information retrieval result,
A linguistic feature calculating step (step 2) for calculating a linguistic feature of the sentence block and storing it in the linguistic feature storage means;
An output order determining step (step 5) for determining the output order of the information search results searched by the search means using the language features stored in the language feature storage means as a parameter for determining the output order of the information search results; ,I do.

本発明（請求項７）は、出力順序決定ステップにおいて、分割された文章ブロックの文書長を言語特徴とし、情報検索結果の出力順を決定するパラメータとして利用する。 According to the present invention (Claim 7), in the output order determination step, the document length of the divided text block is used as a language feature, and is used as a parameter for determining the output order of the information search results.

本発明（請求項８）は、請求項１乃至５の何れか１項に記載の情報検索装置を構成する各手段としてコンピュータを機能させるための情報検索プログラムである。 The present invention (Claim 8) is an information search program for causing a computer to function as each means constituting the information search apparatus according to any one of Claims 1 to 5.

本発明（請求項９）は、請求項８記載の情報検索定プログラムを格納したコンピュータ読み取り可能な記録媒体である。 The present invention (Claim 9) is a computer-readable recording medium storing the information retrieval program according to Claim 8.

上記のように、請求項１、６に係る発明によれば、インターネット上の掲示電子版システムのように複数の筆者によって投稿された複数の文章が含まれる文書や、オンライン日記やブログのように、一人の筆者によって書かれた複数日分の日記や記事等で異なった主題の文章を複数含む文書で、別々の文章ブロックに複数の検索キーワードがそれぞれ含まれるものを検索結果から排除することが可能となる。 As described above, according to the inventions according to claims 1 and 6, such as a document containing a plurality of sentences posted by a plurality of authors such as a posted electronic version system on the Internet, an online diary or a blog It is possible to exclude from the search results documents that contain multiple texts of different themes in multiple days of diary or articles written by a single author, each containing multiple search keywords in separate text blocks. It becomes possible.

また、請求項２、７に係る発明によれば、従来手法のBM25のように検索要求に対する文書の評価スコアを文書全体の長さによって正規化するのではなく、検索キーワードが含まれた文章ブロックの長さによって正規化できるため、よりユーザの直感に合致した検索結果を提供できる。例えば、文書全体の長さを利用するBM25では、長い文書と短い文書に同じ数だけ検索キーワードが出現した場合には短い文書の評価スコアが高くなる。しかし、複数の主題に関する文章が含まれた文章では、長い文書と短い文書の間で検索キーワードが含まれる文章ブロックの長さが逆転している場合もあり得るため、従来手法ではユーザの直感に合致しない検索結果となってしまう。本発明では、文書の評価スコアは検索キーワードが含まれた文章ブロックの長さによって正規化するため、この問題が解決できる。 Further, according to the inventions according to claims 2 and 7, the document evaluation score for the retrieval request is not normalized by the length of the entire document as in the conventional method BM25, but the sentence block including the retrieval keyword is included. Therefore, it is possible to provide a search result that more matches the user's intuition. For example, in the BM 25 that uses the length of the entire document, the evaluation score of the short document increases when the same number of search keywords appear in the long document and the short document. However, in a sentence containing sentences on multiple subjects, the length of a sentence block containing a search keyword may be reversed between a long document and a short document. Search results that do not match. In the present invention, since the document evaluation score is normalized by the length of the sentence block including the search keyword, this problem can be solved.

また、請求項３に係る発明によれば、検索キーワードが含まれた文章ブロックの言語的特長に基づく評価による検索結果ランキングを検索ユーザに提供できる。 According to the third aspect of the invention, it is possible to provide a search user with a search result ranking based on evaluation based on the linguistic features of a sentence block including a search keyword.

また、請求項４に係る発明によれば、検索キーワードが含まれた文章ブロックの文末表現の特徴に基づく評価による検索結果ランキングを検索ユーザに提供できる。文末表現からは「である。」「かな？」といった文章の筆者の確信度の違いや、「いたしました。」「じゃん。」といった文のくだけ具合の違いを反映した検索結果ランキングを実現できる。 According to the invention of claim 4, it is possible to provide the search user with the search result ranking based on the evaluation based on the feature of the sentence end expression of the sentence block including the search keyword. From the end of the sentence expression, it is possible to realize a search result ranking that reflects the difference in the author's confidence in the sentences such as “is” and “kana?” And the difference in the state of the sentences such as “I did” and “Jan.” .

また、請求項５に係る発明によれば、検索キーワードが含まれた文章ブロックの評価表現の特徴に基づく評価による検索結果ランキングを検索ユーザに提供できる。評価表現からは「美味しい」「まずい」「良かった」「悪かった」といった肯定的か否定的かという特徴を反映した検索結果ランキングを実現できる。 According to the invention of claim 5, it is possible to provide a search user with a search result ranking based on evaluation based on the feature of the evaluation expression of the sentence block including the search keyword. From the evaluation expression, it is possible to realize a search result ranking reflecting a positive or negative feature such as “delicious”, “bad”, “good”, “bad”.

上記のように、本発明によれば、単語や文を単位とした転置リストに基づく方法と比較して全文検索インデックスのサイズを小さく抑えられるだけでなく、複数の検索キーワードによる検索において、複数の主題の文章が組み合わさった文書を誤って検索結果に含まれることを抑制できる。さらに、文書を分割したブロックを単位として、そのブロックの確信度や評価表現といった言語特徴を検索結果出力順位を決定する際のパラメータとして利用することで、より確信度の高い文章の優先度を高くして出力したり、ポジティブな表現を多く含む文章の優先度を高くして出力するといった多様な検索結果出力方法が可能となる。 As described above, according to the present invention, the size of the full-text search index can be reduced as compared with the method based on the transposed list in units of words and sentences. It is possible to prevent a search result from erroneously including a document in which the subject sentence is combined. Furthermore, by using the language features such as confidence level and evaluation expression of the block divided as a unit as a parameter when determining the search result output order, the priority of sentences with higher confidence level is increased. A variety of search result output methods are possible, such as output with high priority and sentences with high positive expressions.

本発明の原理構成図である。It is a principle block diagram of this invention. 本発明の原理を説明するための図である。It is a figure for demonstrating the principle of this invention. 本発明の一実施の形態における情報検索装置の構成図である。It is a block diagram of the information search device in one embodiment of the present invention. 本発明の一実施の形態におけるＷｅｂ文書の例である。It is an example of the Web document in one embodiment of this invention. 本発明の一実施の形態における文書インデックス記憶部のデータ例である。It is an example of data of the document index memory | storage part in one embodiment of this invention. 本発明の一実施の形態における確信度表現記憶部のデータ例である。It is an example of data of the certainty expression storage part in one embodiment of this invention. 本発明の一実施の形態におけるブロック確信度記憶部のデータ例である。It is a data example of the block certainty factor memory | storage part in one embodiment of this invention. 本発明の一実施の形態における情報検索装置のフローチャートである。It is a flowchart of the information search device in one embodiment of the present invention.

以下、図面と共に本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図３は、本発明の一実施の形態における情報検索装置の構成を示す。 FIG. 3 shows the configuration of the information retrieval apparatus in one embodiment of the present invention.

同図に示す情報検索装置１０は、Ｗｅｂ文書入力部１、ブロック分割部２、インデックス部３、確信度計算部４、総合ランキング計算部５、キーワード一致部６、文書インデックス記憶部１１、ブロック確信度記憶部１２、確信度表現記憶部１３を有する。 The information search apparatus 10 shown in the figure includes a Web document input unit 1, a block division unit 2, an index unit 3, a certainty factor calculation unit 4, a general ranking calculation unit 5, a keyword matching unit 6, a document index storage unit 11, and a block certainty. A degree storage unit 12 and a certainty factor expression storage unit 13.

文書インデックス記憶部１１は、インデックス部３で生成されたインデックを格納する記憶媒体である。 The document index storage unit 11 is a storage medium that stores the index generated by the index unit 3.

ブロック確信度記憶部１２は、確信度計算部４で求められた確信度を格納する。 The block certainty factor storage unit 12 stores the certainty factor obtained by the certainty factor calculation unit 4.

確信度表現記憶部１３は、文の書き手がどの程度確信を持って文を記述しているかを数値化した確信度を文末表現毎に割り付けたテーブルを有する記憶媒体である。 The certainty factor expression storage unit 13 is a storage medium having a table in which a certainty factor, which is a quantification of how confident a sentence writer is describing a sentence, is assigned to each sentence end expression.

Ｗｅｂ文書入力部１は、検索対象のＷｅｂ文書２１，２２を入力し、ブロック分割部２に渡す。ここで、Ｗｅｂ文書とは、インターネット上の電子掲示板に投稿された複数の文章が含まれる文書や、ブログ等の異なる主題の文章を含む文書を指す。入力されたＷｅｂ文書の例を図４に示す。 The Web document input unit 1 inputs Web documents 21 and 22 to be searched and passes them to the block division unit 2. Here, the Web document refers to a document including a plurality of sentences posted on an electronic bulletin board on the Internet or a document including sentences of different subjects such as a blog. An example of the input Web document is shown in FIG.

ブロック分割部２は、入力されたＷｅｂ文書を複数の文章ブロックに分割し、分割された各文章をインデックス部３と確信度計算部４に渡す。 The block division unit 2 divides the input Web document into a plurality of sentence blocks, and passes each divided sentence to the index unit 3 and the certainty factor calculation unit 4.

インデックス部３は文章ブロックを全文検索用の単位に分割して文書インデックス記憶部１１に格納する。文書インデックス記憶部１１の例を図５に示す。 The index unit 3 divides the text block into units for full text search and stores them in the document index storage unit 11. An example of the document index storage unit 11 is shown in FIG.

確信度計算部４は、ブロック分割部２から取得した文章ブロックの文末表現について、図６に示す確信度表現記憶部１３を参照して、当該文章ブロック全体の確信度を計算し、図７に示すように、文章ブロック毎にブロック確信度をブロック確信度記憶部１２に格納する。なお、本実施の形態では、各文章ブロックの言語特徴として、確信度を用いる例を示すが、この例に限定されることなく、種々の評価表現を用いることも可能である。 The certainty factor calculation unit 4 calculates the certainty factor of the entire sentence block with reference to the certainty factor expression storage unit 13 shown in FIG. 6 with respect to the sentence ending expression of the sentence block acquired from the block dividing unit 2, and FIG. As shown, the block certainty factor is stored in the block certainty factor storage unit 12 for each sentence block. In the present embodiment, an example is shown in which the certainty factor is used as the language feature of each sentence block. However, the present invention is not limited to this example, and various evaluation expressions can also be used.

キーワード一致度計算部６は、情報検索端末３０と接続され、当該情報検索端末３０から入力された検索キーワードを取得して、文書インデックス記憶部１１を参照して、当該検索キーワードを含む文章ブロックを取得し、検索キーワードとの一致度を計算し、総合ランキング計算部５に渡す。 The keyword matching degree calculation unit 6 is connected to the information search terminal 30, acquires a search keyword input from the information search terminal 30, refers to the document index storage unit 11, and reads a sentence block including the search keyword. The degree of coincidence with the search keyword is obtained and passed to the general ranking calculation unit 5.

総合ランキング計算部５は、キーワード一致度計算部６で求められた一致度とブロック確信度記憶部１２を参照して得た文章ブロックの確信度を統合して、情報検索結果の出力順位を決定して情報検索結果を情報検索端末３０に出力する。 The general ranking calculation unit 5 determines the output rank of the information search result by integrating the coincidence obtained by the keyword coincidence calculation unit 6 and the certainty of the sentence block obtained by referring to the block certainty storage unit 12. The information search result is output to the information search terminal 30.

上記の構成の一連の動作を以下に説明する。 A series of operations of the above configuration will be described below.

図８は、本発明の一実施の形態における情報検索装置のフローチャートである。 FIG. 8 is a flowchart of the information search device in one embodiment of the present invention.

ステップ１０１）Ｗｅｂ文書入力部１は、Ｗｅｂ文書２１，２２を取得し、ブロック分割部２に渡す。ブロック分割部２は、Ｗｅｂ文書２１，２２を文章ブロック単位に分割する。分割の方法はＷｅｂ文書のhtmlフォーマットの構造に基づいてもよいし、文書を表示した際のレイアウトに基づいてもよく、その他いかなる方法でも本発明の本質には関わらない。この例ではＷｅｂ文書２１は文書の頭から「である。」までのブロック２１−１と、残りの２１−２に分割され、Ｗｅｂ文書２２は文書の頭から「と思う。」までのブロック２２−１と、残りの２２−２に分割されたものとして説明する。 Step 101) The Web document input unit 1 acquires the Web documents 21 and 22 and passes them to the block dividing unit 2. The block dividing unit 2 divides the Web documents 21 and 22 into sentence blocks. The division method may be based on the structure of the html format of the Web document, may be based on the layout when the document is displayed, and any other method is not related to the essence of the present invention. In this example, the Web document 21 is divided into a block 21-1 from the head of the document to “is” and the remaining 21-2, and the Web document 22 is a block 22 from the head of the document to “I think”. -1 and the remaining 22-2 will be described.

ステップ１０２）ブロック分割部２から文章ブロックを取得したインデックス部３は、文章ブロックを単語やn-gram、サフィックスアレイといった全文検索用の単位に分割して文書インデックス記憶部１１に格納する。作成する文書インデックスの形式は、上記の他にいかなる形式であっても本発明の本質には関わらない。この例では、単語によるインデックスの一例として、単語「猫」が文章ブロック２１−１，２２−１を含む文書に出現している文書インデックスが作成されている。当該インデックスには通常の全文検索インデックスに含まれるtfやidf、htmlによる単語のマークアップ情報等が含まれていてもよいが、本発明の本質には関係しないため詳細は省略する。 Step 102) The index unit 3 that has acquired the sentence block from the block dividing unit 2 divides the sentence block into full-text search units such as words, n-grams, and suffix arrays, and stores them in the document index storage unit 11. The format of the document index to be created is not related to the essence of the present invention in any format other than the above. In this example, a document index in which the word “cat” appears in a document including the sentence blocks 21-1 and 21-2 is created as an example of an index based on words. The index may include tf, idf, and html word markup information included in the normal full-text search index, but the details are omitted because it is not related to the essence of the present invention.

ステップ１０３）確信度計算部５は、ブロック分割部２から文章ブロックを取得して、当該文章ブロックに含まれる各文の文末表現を、確信度表現記憶部１４と照らし合わせ、当該文章ブロック全体としての確信度を計算し、ブロック確信度記憶部１２に格納する。図６の例では、文章ブロック２１−１を構成する２つの文「…シンガプーラ…だ。」、「…猫…である。」それぞれの文末表現「だ。」、「である。」の確信度８．０と７．５を平均して、文章ブロック２１−１の確信度を７．７５としてブロック確信度記憶部１２に記録している。文章ブロック全体の確信度の計算方法は、平均に限らず、中心値や最頻値やその他文毎に異なった加重を掛けた線形結合等、どういった方法によるかは本発明の本質に関わらない。 Step 103) The certainty factor calculation unit 5 acquires the sentence block from the block dividing unit 2, compares the sentence end expression of each sentence included in the sentence block with the certainty factor expression storage unit 14, and determines the sentence block as a whole. The certainty factor is calculated and stored in the block certainty factor storage unit 12. In the example of FIG. 6, the certainty of the two sentence “… Singapura…” and “… cat…” constituting the sentence block 21-1 is expressed at the end of each sentence “da.”, “Is”. 8.0 and 7.5 are averaged, and the certainty factor of the sentence block 21-1 is recorded in the block certainty factor storage unit 12 as 7.75. The method of calculating the certainty factor of the entire sentence block is not limited to the average, and the method such as the central value, the mode value, and the linear combination with different weights applied to each sentence is related to the essence of the present invention. Absent.

ステップ１０４）情報検索ユーザは情報検索端末３０から１つまたは複数の検索キーワードを入力して情報検索システムに情報検索要求を送信する。 Step 104) The information search user inputs one or a plurality of search keywords from the information search terminal 30 and transmits an information search request to the information search system.

ステップ１０５）キーワード一致度計算部６は、入力された検索キーワードを用いて文書インデックス記憶部１１を参照し、検索キーワードを含むブロックをリストアップし、それらの検索キーワードとの一致度をtf・idfやBM２５といった方法で算出する。 Step 105) The keyword matching degree calculation unit 6 refers to the document index storage unit 11 using the input search keyword, lists blocks including the search keyword, and sets the degree of matching with these search keywords to tf · idf. Or BM25.

ステップ１０６）総合ランキング計算部５は、ステップ１０５で求められた一致度とブロック確信度記憶部１２を参照して取得した文章ブロックの確信度を統合して情報検索端末３０に返却する情報検索結果の出力順を決定する。例えば、『猫』と『シンガプーラ』の２単語を指定した情報検索要求では文章ブロック２１−１と文章ブロック２２−１とが両方のキーワードを含む検索結果として出力されるが、仮にキーワード一致度計算部６によるキーワード一致度算出結果が文章ブロック２１−１と、文章ブロック２２−１とで全く同一であった場合は、より高い確信度を持つ文章ブロック２１−１が文章ブロック２２−１より優先される。 Step 106) The comprehensive ranking calculation unit 5 integrates the degree of coincidence obtained in Step 105 and the certainty factor of the sentence block acquired with reference to the block certainty factor storage unit 12, and returns the information retrieval result to the information retrieval terminal 30 Determine the output order. For example, in an information search request that specifies two words “cat” and “singapura”, the text block 21-1 and the text block 22-1 are output as search results including both keywords. When the keyword matching degree calculation result by the unit 6 is exactly the same in the sentence block 21-1 and the sentence block 22-1, the sentence block 21-1 having a higher certainty factor has priority over the sentence block 22-1. Is done.

ステップ１０７）統合ランキング計算部５は、決定された順序に従って検索結果（上位N件）を情報検索端末３０に出力する。 Step 107) The integrated ranking calculation unit 5 outputs the search results (top N items) to the information search terminal 30 according to the determined order.

上記の情報検索装置の図３に示す構成要素の動作をプログラムとして構築し、情報検索装置として利用されるコンピュータにインストールして実行させる、または、ネットワークを介して流通させることが可能である。 The operations of the components shown in FIG. 3 of the information search device described above can be constructed as a program, installed in a computer used as the information search device, executed, or distributed via a network.

また、構築されたプログラムをハードディスクや、フレキシブルディスク・ＣＤ−ＲＯＭ等の可搬記憶媒体に格納し、コンピュータにインストールする、または、配布することが可能である。 Further, the constructed program can be stored in a portable storage medium such as a hard disk, a flexible disk, or a CD-ROM, and can be installed or distributed in a computer.

なお、本発明は、上記の実施の形態に限定されることなく、特許請求の範囲内において種々変更・応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made within the scope of the claims.

本発明は、インターネット上の電子掲示板システムや、オンライン日記、ブログ等の様に利用者によって投稿された文章を複数含む文書を検索対象として検索を行う技術に適用可能である。 The present invention can be applied to a technique for performing a search using a document including a plurality of sentences posted by a user, such as an electronic bulletin board system on the Internet, an online diary, and a blog.

１Ｗｅｂ文書入力部
２ブロック分割手段、ブロック分割部
３インデックス部
４言語特徴算出手段、確信度計算部
５出力順序決定手段、統合ランキング計算部
６検索手段、キーワード一致度計算部
１０情報検索装置
１１ブロック記憶手段、文書インデックス記憶部
１２言語特徴記憶手段、ブロック確信度記憶部
１３確信度表現記憶部
２１、２２Ｗｅｂ文書
３０情報検索端末 DESCRIPTION OF SYMBOLS 1 Web document input part 2 Block division means, Block division part 3 Index part 4 Language feature calculation means, Certainty factor calculation part 5 Output order determination means, Integrated ranking calculation part 6 Search means, Keyword matching degree calculation part 10 Information search apparatus 11 Block storage means, document index storage section 12 Language feature storage means, block certainty degree storage section 13 Certainty expression expression storage sections 21, 22 Web document 30 Information retrieval terminal

Claims

When document information represented by a Web page is a search target, each input document is divided into sentence blocks, stored in a block storage means, and when a plurality of keywords are specified as a search request, A search means for outputting a document having a specific sentence block including all keywords as an information search result, and an information search device comprising:
Linguistic feature calculating means for calculating a linguistic feature of the sentence block and storing it in a linguistic feature storing means;
An output order determination means for determining the output order of the information search results searched by the search means, using the language features stored in the language feature storage means as parameters for determining the output order of the information search results;
An information retrieval apparatus comprising:

The output order determining means includes
The information search apparatus according to claim 1, further comprising means for defining the document length of the sentence block divided by the block dividing means as the language feature.

Linguistic feature storage means for storing linguistic features;
Language features calculating means for calculating linguistic features that match the language features registered in the language feature storage means in the sentence included in the sentence block;
The output order determining means includes
The information search device according to claim 1 or 2, wherein the language feature calculated by the language feature calculation means is used as a parameter for determining an output order of the information search result.

The language feature storage means includes
The information retrieval apparatus according to claim 3, wherein a sentence end expression is stored as the language feature.

The language feature storage means includes
The information search device according to claim 3, wherein an evaluation expression of a sentence block is stored as the language feature.

Document information represented by a Web page is a search target, each input document is divided into text blocks, and when a plurality of keywords are specified as a search request, a document having a specific text block including all of the keywords Is an information search method for outputting as an information search result,
Calculating a linguistic feature of the sentence block and storing it in a linguistic feature storage means;
An output order determination step for determining the output order of the information search results searched by the search means, using the language features stored in the language feature storage means as parameters for determining the output order of the information search results;
An information retrieval method characterized by:

The output order determining step includes:
The information search method according to claim 6, wherein a document length of the divided text block is used as the language feature and is used as a parameter for determining an output order of information search results.

An information search program for causing a computer to function as each means constituting the information search device according to claim 1.

A computer-readable recording medium storing the information retrieval program according to claim 8.