JP2529418B2

JP2529418B2 - Document search device

Info

Publication number: JP2529418B2
Application number: JP1288361A
Authority: JP
Inventors: 敦史安藤; 佳宏早川
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1989-11-06
Filing date: 1989-11-06
Publication date: 1996-08-28
Anticipated expiration: 2011-08-28
Also published as: JPH03148765A

Description

【発明の詳細な説明】産業上の利用分野本発明は計算機を利用した文書検索装置に関するもの
である。Description: TECHNICAL FIELD The present invention relates to a document search device using a computer.

従来の技術近年、ワードプロセッサやパーソナルコンピュータの
普及、コンピュータによる文字認識の実用化に伴い、こ
れらによって作成される電子化文書が多くなってきた。
このため、大量の文書情報を蓄積し、必要に応じて文書
情報を検索するための文書データベースに対する関心が
高まってきている。従来の文書データベースでは、文書
を検索する場合、文書毎に付されたキーワードを利用す
るキーワード検索が一般的であった。しかし、キーワー
ド付け作業が蓄積文書の増加に間に合わない、時間が経
過するとキーワードが陳腐化する、データベース管理者
の予想を越えたキーワードによる検索には対応できず検
索漏れが多くなる、等の問題点があった。このような背
景から最近は、全文データベースと呼ばれる文書データ
ベースが注目されている。全文データベースでは、利用
者から与えられた検索条件と蓄積されている文書の全て
の情報との間で照合を行い、検索条件を満たす文書を出
力する。検索条件では、従来のキーワードのような単語
以外に文などの文字列を用いても良く、また従来と同様
に、正規表現などの記法を用いることができる。検索条
件と文書情報との照合は、有限状態オートマトンなどを
利用して、文字列の一致、不一致を判定して行う。2. Description of the Related Art In recent years, with the spread of word processors and personal computers and the practical use of character recognition by computers, the number of electronic documents created by these has increased.
For this reason, interest in a document database for accumulating a large amount of document information and searching for the document information as needed is increasing. In a conventional document database, when searching for a document, a keyword search using a keyword attached to each document is general. However, problems such as keyword addition work not keeping up with the number of stored documents, keywords becoming obsolete over time, and being unable to respond to searches with keywords that exceed the expectations of the database administrator, resulting in frequent omission of searches. was there. Against this background, a document database called a full-text database has recently attracted attention. The full-text database collates the search condition given by the user with all the information of the stored documents, and outputs the document satisfying the search condition. In the search condition, a character string such as a sentence may be used in addition to a word such as a conventional keyword, and a notation such as a regular expression can be used as in the conventional case. The matching between the search condition and the document information is performed by using a finite state automaton or the like to determine whether the character strings match or do not match.

このような全文データベースは、前記キーワード検索
による文書データベースに比べて、次のような利点があ
る。まず第一に、キーワード付け作業が不要になるの
で、文書データベースの構築が容易である。第二に、文
書中の全ての情報が照合対照であるので、検索条件設定
の自由度が高く、検索漏れを少なくすることができる。
第三に、時間経過に伴うキーワードの陳腐化がないの
で、文書データベースの保守が容易である。Such a full-text database has the following advantages over the document database based on the keyword search. First of all, it is easy to construct a document database because the keyword addition work is unnecessary. Secondly, since all the information in the document is a reference for comparison, the degree of freedom in setting search conditions is high and the omission of search can be reduced.
Thirdly, since the keywords do not become obsolete over time, the document database can be easily maintained.

発明が解決しようとする課題ところが全文データベースでは、被照合文字列が文書
中のどこにあっても照合が成功するように検索条件を設
定するため、利用者の予想しなかった文字列と検索条件
との照合が成功してしまい、その結果不必要な文書を検
索ノイズとして出力してしまうことが多くなる。例え
ば、「火事」という検索条件に対する「放火事件」、
「文化」という検索条件に対する「条文化する」などで
ある。この種の検索ノイズは、照合文字列の長さが短い
ほど増加する。The problem to be solved by the invention However, in the full-text database, the search condition is set so that the matching succeeds no matter where the collated character string is in the document. Is often successful, and as a result, unnecessary documents are often output as search noise. For example, "arson incident" for the search condition "fire",
For example, "article culture" for the search condition "culture". This type of search noise increases as the length of the matching character string decreases.

また従来の検索条件で表現できる複数の照合文字列の
間の関係は、論理和と論理積である。したがって、例え
ば「白い犬」のような「白」と「犬」の間の修飾被修飾
の関係を表す検索条件は、「白」と「犬」の論理積で表
されるのが一般的である。このような検索条件では「白
い雪原を犬が」といった文字列でも照合が成功してしま
い、これも検索ノイズとなる。Further, the relationship between a plurality of matching character strings that can be expressed by a conventional search condition is a logical sum and a logical product. Therefore, for example, a search condition that represents a modified modification relation between “white” and “dog” such as “white dog” is generally expressed as a logical product of “white” and “dog”. is there. Under such a search condition, even a character string such as "a dog on white snowfield" succeeds in matching, which also becomes a search noise.

本発明は、以上のような全文データベースでの検索ノ
イズの増加を鑑み、文書情報を検索する際、検索漏れを
増加させることなしに、検索ノイズを減少させることを
目的とする。In view of the increase in search noise in the full text database as described above, it is an object of the present invention to reduce search noise without increasing search omissions when searching document information.

課題を解決するための手段上記目的を達成するために、本発明は、利用者による
検索条件の設定を文を入力することで行い、この条件文
を解析して構文情報を取り出し、この構文情報を検索条
件として文書情報の検索を行うようにしたものである。Means for Solving the Problems In order to achieve the above object, the present invention sets a search condition by a user by inputting a sentence, analyzes the conditional sentence, extracts syntax information, and extracts the syntax information. The document information is searched by using as a search condition.

作用上記構成における作用は次のようになる。利用者によ
って入力された検索条件設定のための条件文を解析して
構文情報を取り出す。つぎに条件文の構文情報から検索
条件に適した形態素文字列を選び出し、さらに当該形態
素と同一の概念を表す文字列を同義語辞書から取り出
し、これらの文字列を用いて文字列間の論理和と論理積
の組み合わせによる検索式を作る。この検索式を用いて
文書データベースに対して文字列照合による一次検索を
行う。つぎに、一次検索結果中の文を解析して構文情報
を取り出し、これらの情報と条件文から得た情報との照
合を行い、照合に成功した文を持つ文書を最終検索結果
として出力する。Operation The operation in the above configuration is as follows. The conditional information for the search condition setting input by the user is parsed and the syntax information is extracted. Next, a morpheme character string suitable for the search condition is selected from the syntactic information of the conditional sentence, a character string representing the same concept as the morpheme is extracted from the synonym dictionary, and the logical sum between the character strings is used using these character strings. Create a search expression by combining and AND. A primary search by character string collation is performed on the document database using this search formula. Next, the sentence in the primary search result is analyzed to extract the syntactic information, the information obtained is compared with the information obtained from the conditional sentence, and the document having the sentence that is successfully collated is output as the final retrieval result.

実施例以下、本発明の一実施例について図面を参照しながら
説明する。Embodiment An embodiment of the present invention will be described below with reference to the drawings.

第１図は本発明の一実施例における文書検索装置のブ
ロック図である。第１図において、101は利用者が検索
範囲と検索条件設定のための条件文を入力するキーワー
ド等より成る検索条件入力手段、102は文を解析して構
文情報を取り出す解析手段、103は検索式生成手段、104
は同義語辞書105は検索式生成手段103で生成された検索
式を利用者が編集するための検索式編集手段、106は検
索式の検索条件を満たす文を文字列照合により検索する
文字列検索手段、107は文書情報が格納されている文書
データベース、108は文字列検索手段106から得られた候
補文が条件文と同一の構文情報を持つか否かを判定する
構文照合手段、109は最終検索結果を表示し、さらに以
降の処理要求を受け付ける結果表示手段である。FIG. 1 is a block diagram of a document search device according to an embodiment of the present invention. In FIG. 1, 101 is a search condition input means including a keyword or the like by which a user inputs a search range and a conditional sentence for setting a search condition, 102 is an analysis means for analyzing a sentence to extract syntax information, and 103 is a search Expression generation means, 104
Is a synonym dictionary 105 is a search formula editing means for the user to edit the search formula generated by the search formula generating means 103, and 106 is a character string search for searching a sentence satisfying the search condition of the search formula by character string matching. Means 107, a document database storing document information, 108 a syntax collating means for judging whether the candidate sentence obtained from the character string searching means 106 has the same syntax information as the conditional sentence, 109 a final It is a result display means for displaying the search result and accepting subsequent processing requests.

本実施例の文書検索装置において扱う構文情報は、２
個の文節情報と、それらの間の関係からなる三項デー
タ、もしくは三項データの集合で表される。扱う関係
は、文節間の主部述部の関係、修飾被修飾の関係で、こ
れら２種類以外の関係は、使用しない。また係り受けの
関係は、第２項の文節情報を「係り」に、第３項の文節
情報を「受け」に対応させることで表す。構文情報をBN
F記法（バッカス−ナウア記法）で表すと第２図のよう
になる。The syntax information handled by the document search device of this embodiment is 2
It is represented by ternary data consisting of individual clause information and the relation between them, or a set of ternary data. The relations to be handled are the relations of the main part predicates between clauses and the relations of the modified target. The relations other than these two types are not used. Further, the relationship of dependency is represented by associating the clause information of the second term with “dependency” and the clause information of the third term with “acceptance”. BN for syntax information
The F notation (Bacchus-Naur notation) is as shown in FIG.

つぎに本実施例における文書検索装置の具体的な動作
について説明する。Next, a specific operation of the document search device according to this embodiment will be described.

利用者は、所望の文書中に存在すると思われる文もし
くは文の一部を条件文ａとして検索条件入力手段101を
介して解析手段102に入力する。また検索範囲ｂを、検
索条件入力手段101を介して検索式生成手段103へ入力す
る。The user inputs a sentence or a part of the sentence that is considered to exist in a desired document as a conditional sentence a into the analysis unit 102 via the search condition input unit 101. Further, the search range b is input to the search expression generation means 103 via the search condition input means 101.

解析手段102では、条件文ａに対して形態素解析と構
文解析を施し、構文情報ｃを取り出し、検索式生成手段
103と構文照合手段108へ入力する。解析によって複数の
構文情報がある場合には、その旨を検索条件入力手段10
1を介して利用者に伝え、利用者が複数の構文情報の中
から適当と思われる１個の構文情報を選択することで処
理を継続する。また解析不可能な場合には、新たな条件
文の入力要求を検索条件入力手段101を介して利用者に
伝える。The analysis means 102 performs morphological analysis and syntactic analysis on the conditional sentence a, extracts syntactic information c, and retrieves expression generating means.
103 and the syntax collating means 108. When there is a plurality of pieces of syntax information by analysis, the fact is input to the search condition input means 10
The processing is continued by notifying the user via 1 and selecting one piece of syntax information that is considered appropriate from the plurality of pieces of syntax information. If the analysis is impossible, a request for inputting a new conditional sentence is sent to the user via the search condition input means 101.

検索式生成手段103では、条件文ａの構文情報から検
索条件に適した形態素、例えば自立語を取り出し、さら
に同義語辞書104を用いて当該形態素と同一概念の語を
付け加えた後、検索式ｄを生成する。検索式ｄは、形態
素を表す文字列と、それら文字列の間の論理和、論理積
の関係を表す記号、ならびに検索条件入力手段101を介
して入力された検索範囲ｂから成っている。生成された
検索式ｄは検索式編集手段105を介して利用者に提示さ
れ、利用者は必要に応じて検索式ｄを編集する。確定し
た検索式ｄは文字列検索手段106へ、また検索式中の同
義語情報ｅは構文照合手段108へ入力される。The search expression generation unit 103 extracts a morpheme suitable for the search condition, for example, an independent word from the syntax information of the conditional sentence a, adds a word having the same concept as the morpheme using the synonym dictionary 104, and then adds the search expression d. To generate. The search expression d is composed of a character string representing a morpheme, a symbol representing a logical sum or logical product relationship between the character strings, and a search range b input via the search condition input means 101. The generated search formula d is presented to the user via the search formula editing means 105, and the user edits the search formula d as necessary. The confirmed search expression d is input to the character string search means 106, and the synonym information e in the search expression is input to the syntax matching means 108.

文字列検索手段106では、文書データベース107の指定
された検索範囲内の文書の内容と、検索式ｄで指定され
た検索条件との間で文字列照合を行い、検索条件を満た
す文を検索する。検索結果は、候補文ｆと当該候補文ｆ
が属している文書を表す文書識別子ｇの対からなる集合
である。候補文ｆは、一文毎に解析手段102へ入力さ
れ、それと同時に文書識別子ｇが構文照合手段108へ入
力される。The character string search means 106 performs character string matching between the content of the document within the specified search range of the document database 107 and the search condition specified by the search expression d, and searches for a sentence that satisfies the search condition. . The search result is the candidate sentence f and the candidate sentence f.
Is a set of pairs of document identifiers g representing the documents to which the belongs. Each candidate sentence f is input to the analysis unit 102 for each sentence, and at the same time, the document identifier g is input to the syntax matching unit 108.

解析手段102では、候補文ｆに対して形態素解析と構
文解析を施し、構文情報ｈを取り出し、構文照合手段10
8へ入力する。解析のとき複数の構文情報がある場合に
は、最も確からしい順に、予め指定してある個数の構文
情報を構文照合手段108へ入力する。The analysis unit 102 performs morphological analysis and syntactic analysis on the candidate sentence f, extracts syntactic information h, and syntactic collation unit 10
Enter in 8. When there is a plurality of pieces of syntax information at the time of analysis, a predetermined number of pieces of syntax information are input to the syntax collating means 108 in the most probable order.

構文照合手段108では、以前に解析手段102から入力さ
れている条件文ａの構文情報ｃならびに検索式生成手段
103から入力されている同義語情報ｅと、解析手段102か
ら入力された候補文ｆの構文情報ｈとの照合を行う。候
補文ｆの構文情報が条件文ａの構文情報を満たしていれ
ば構文照合に成功したとしてし、文字列検索手段106か
ら入力されている候補文ｆが属する文書を表す文書識別
子ｇを結果表示手段109へ文書識別子ｉとして入力す
る。候補文ｆの構文情報が複数個ある場合には、複数の
構文情報の中のいずれか１個の構文情報が条件文の構文
情報を満たしていれば、候補文ｆと条件文ａとの照合が
成功したと見なす。In the syntax collating means 108, the syntax information c of the conditional sentence a previously input from the analyzing means 102 and the search expression generating means.
The synonym information e input from 103 is collated with the syntax information h of the candidate sentence f input from the analyzing unit 102. If the syntax information of the candidate sentence f satisfies the syntax information of the conditional sentence a, it is assumed that the syntax matching has succeeded, and the document identifier g representing the document to which the candidate sentence f to which the character string searching means 106 belongs is displayed. The document identifier i is input to the means 109. When there are a plurality of pieces of syntax information of the candidate statement f, if any one of the pieces of syntax information satisfies the syntax information of the conditional statement, the candidate statement f and the conditional statement a are collated. Are considered successful.

結果表示手段109では、構文照合手段108からの文書識
別子ｉを利用者に提示する。また利用者からの指示にし
たがって、文書識別子ｉで表される文書の内容を文書デ
ータベース107から取り出して表示する、新たな検索を
行うために検索条件入力手段101または検索式編集手段1
05に制御を移す、等の処理を行う。The result display unit 109 presents the document identifier i from the syntax matching unit 108 to the user. Further, according to an instruction from the user, the content of the document represented by the document identifier i is retrieved from the document database 107 and displayed, and the search condition input means 101 or the search formula editing means 1 for performing a new search.
Transfer control to 05, and perform other processing.

以上が、本実施例の文書検索装置の動作である。さら
に、例文を用いて動作を説明する。例えば条件文が「白
い大きな犬が走る」の場合、解析手段102で得られる構
文情報は第３図に示したようになる。第３図において形
態素情報は、文中の当該形態素を表す文字列である。検
索式生成手段103で生成される検索条件は、条件文の構
文情報から自立語のみを取り出して得られる。即ち検索
条件は、「｛白｝かつ｛大き｝かつ｛犬｝があり、さら
に｛走｝または｛駆け｝がある文」となる。この場合、
｛駆け｝が｛走｝の同義語として付加されている。文字
列検索手段106では、この検索条件を満たす文を文書デ
ータベース107から文字列照合によって検索する。文字
列検索手段106での検索によって第４図に示すような候
補文が得られたとする。構文照合手段108では、文節情
報間の関係を使った照合と、文節を構成する形態素情報
の中の自立語に関する情報を使った照合を行う。第４図
の候補文１、候補文２、候補文３の構文情報は、それぞ
れ第５図、第６図、第７図のようになる。第７図では、
２種類の構文情報が得られている。第５図において、
｛大き｝と｛犬｝と｛白｝が、第３図に示した条件文の
文節情報中にあり、さらにそれらを含んだ文節情報間の
関係も同じである。また、｛駆け｝は｛走｝と同義語で
あり、｛犬｝を含んだ文節情報と｛駆け｝を含んだ文節
情報との間の関係は、条件文の構文情報における｛犬｝
を含んだ文節情報と｛走｝を含んだ文節情報との間の関
係と同じである。したがって、候補文１と条件文とは、
構文照合に成功し、候補文１が属する文書の文書識別子
が結果表示手段109に送られる。一方第６図、第７図に
それぞれ示した、候補文２、候補文３の構文情報は、
｛白｝を含んだ文節情報と｛大き｝を含んだ文節情報の
どちらか一方が、｛犬｝を含んだ文節情報に係っていな
いので、構文照合に失敗する。こうして候補文１のみが
所望の検索文として結果表示手段109に表示される。The above is the operation of the document search apparatus of the present embodiment. Further, the operation will be described using an example sentence. For example, when the conditional sentence is "a large white dog runs", the syntax information obtained by the analyzing means 102 is as shown in FIG. In FIG. 3, the morpheme information is a character string representing the morpheme in the sentence. The search condition generated by the search formula generating means 103 is obtained by extracting only an independent word from the syntax information of the conditional sentence. That is, the search condition is "a sentence with {white}, {large}, {dog}, and {running} or {running}". in this case,
{Running} is added as a synonym for {Running}. The character string search means 106 searches the document database 107 for a sentence satisfying this search condition by character string matching. It is assumed that the candidate sentence as shown in FIG. 4 is obtained by the search by the character string search means 106. The syntax matching means 108 performs matching using the relationship between the phrase information and matching using the information about the independent word in the morpheme information forming the phrase. The syntax information of candidate sentence 1, candidate sentence 2, and candidate sentence 3 in FIG. 4 is as shown in FIG. 5, FIG. 6, and FIG. 7, respectively. In FIG. 7,
Two types of syntactic information are obtained. In FIG.
{Size}, {Dog}, and {White} are in the clause information of the conditional sentence shown in FIG. 3, and the relation between the clause information including them is also the same. Also, {run} is a synonym for {run}, and the relationship between the phrase information including {dog} and the phrase information including {run} is {dog} in the syntax information of the conditional sentence.
It is the same as the relation between the phrase information including the phrase and the phrase information including the {run}. Therefore, the candidate sentence 1 and the conditional sentence are
The document identifier of the document to which the candidate sentence 1 belongs is sent to the result display means 109 after the syntactic matching is successful. On the other hand, the syntax information of the candidate sentence 2 and the candidate sentence 3 shown in FIGS.
Either one of the bunsetsu information including {white} and the bunsetsu information including {size} is not related to the bunsetsu information including {dog}, so that the syntax matching fails. In this way, only the candidate sentence 1 is displayed on the result display means 109 as a desired search sentence.

発明の効果以上説明したように本発明は、利用者によって入力さ
れた検索条件設定のための条件文を解析して得た構文情
報と同義語辞書を用いて、文字列間の論理和と論理積の
組み合わせによる検索式を作り、この検索式を用いて文
書データベースに対して文字列照合による一次検索を行
うことで、検索漏れの増加を抑え、つぎに、一次検索結
果中の文を解析して得た構文情報と条件文から得た後部
情報との照合を行い、照合に成功した文を持つ文書を最
終検索結果として出力することで、検索ノイズを少なく
することができる。As described above, the present invention uses the syntactic information and synonym dictionary obtained by analyzing the conditional sentence for the search condition setting input by the user, and the logical sum and the logical sum between the character strings. By creating a search formula based on a combination of products and performing a primary search by character string matching on the document database using this search formula, the increase in search omissions is suppressed, and then the sentences in the primary search results are analyzed. Search noise can be reduced by collating the obtained syntactic information with the rear information obtained from the conditional sentence and outputting the document having the sentence successfully collated as the final retrieval result.

[Brief description of drawings]

第１図は本発明の一実施例における文書検索装置のブロ
ック図、第２図は本発明の実施例で扱う構文情報のBNF
記法（バッカス−ナウア記法）による表現の一例を示す
図、第３図は本発明に使用される条件文の例の構文情報
を示す図、第４図は本発明に使用される候補文の例を示
す図、第５図、第６図、第７図はそれぞれ第４図に示し
た各候補文の構文情報を示す図である。 101……検索条件入力手段、102……解析手段、103……
検索式生成手段、104……同義語辞書、105……検索式編
集手段、106……文字列検索手段、107……文書データベ
ース、108……構文照合手段、109……結果表示手段。FIG. 1 is a block diagram of a document retrieval device in an embodiment of the present invention, and FIG. 2 is a BNF of syntax information handled in the embodiment of the present invention.
FIG. 3 is a diagram showing an example of an expression by a notation (Bacchus-Naur notation), FIG. 3 is a diagram showing syntax information of an example of a conditional sentence used in the present invention, and FIG. 4 is an example of candidate sentence used in the present invention. FIG. 5, FIG. 6, FIG. 6 and FIG. 7 are diagrams showing the syntax information of each candidate sentence shown in FIG. 101 …… Search condition input means, 102 …… Analysis means, 103 ……
Search expression generation means, 104 ... Synonym dictionary, 105 ... Search expression editing means, 106 ... Character string search means, 107 ... Document database, 108 ... Syntax matching means, 109 ... Result display means.

Claims

(57) [Claims]

1. A search condition input means for a user to input a search range and a conditional sentence for setting a search condition, an analyzing means for analyzing a sentence to extract syntax information, a synonym dictionary, and the analysis. A sentence having a character string satisfying the search condition of the search expression obtained from the search expression generating means, and a search expression generating means for generating a search expression using the syntactic information of the conditional sentence obtained from the means and the synonym dictionary. A character string search means for searching, a document database in which document information is stored, syntax information obtained by analyzing the sentence obtained by the character string search means by the analysis means, and syntax information of the conditional sentence. It comprises a syntax collating means for collating and outputting a document identifier corresponding to the sentence successfully collated, and a display means for displaying a document corresponding to the document identifier obtained from the syntax collating means in accordance with an instruction from the user. Book search device.