JPH08339383A

JPH08339383A - Document retrieving device and dictionary preparing device

Info

Publication number: JPH08339383A
Application number: JP7146680A
Authority: JP
Inventors: Yasutsugu Ogawa; 泰嗣小川
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1995-04-11
Filing date: 1995-06-14
Publication date: 1996-12-24

Abstract

PURPOSE: To excellently generate a retrieval condition even if the independent words and accessory words of a retrieval request are numerous when a retrieval condition is automatically generated from the retrieval request of document data inputted by a natural language. CONSTITUTION: When the retrieval request of document data is inputted in a request input means 4 by a natural language, a language analysis means 5 extracts independent words and accessory words from this retrieval request. Because a condition generation means 6 converts each of accessory words into each of plural preliminarily set operators, combines each operator with the corresponding independent word and generates a retrieval condition, a data retrieval means 7 retrieves document data from a database 2 by this retrieval condition. Because even plural accessory words are converted into each of proper operators and even many independent words are properly combined by the operators, an optimum retrieval condition is generated from the retrieval request of the natural language inputted by a user.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、データベースから文書
データを検索する文書検索装置、及び、この文書検索装
置に利用する分割前後辞書を作成する辞書作成装置、に
関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document retrieval device for retrieving document data from a database and a dictionary production device for producing a pre-division dictionary before and after the document retrieval device.

【０００２】[0002]

【従来の技術】文書管理システムや画像管理システムな
どの文書検索装置は、文書データや画像データなどが複
数の文書データとして予め格納されたデータベースを有
しており、このデータベースから所望の文書データを検
索して出力することができる。このようなデータ検索
は、一般的に単語と演算子とを組み合わせた検索条件を
設定することにより実行されるが、このような検索条件
の設定は煩雑であり、一般ユーザでは適切な検索条件が
設定できないこともある。2. Description of the Related Art A document retrieval apparatus such as a document management system or an image management system has a database in which document data, image data, etc. are stored in advance as a plurality of document data, and desired document data can be retrieved from this database. You can search and output. Such data search is generally executed by setting search conditions that combine words and operators, but setting such search conditions is complicated, and general users cannot find appropriate search conditions. In some cases, it cannot be set.

【０００３】このような課題を解決するため、特開平6-
75996 号公報に開示された文書検索装置では、ユーザが
自然言語により検索要求を入力すると、この検索要求が
同義表現に展開されてから文書データが検索される。よ
り詳細には、ユーザが文書データの検索要求を自然言語
により入力すると、この検索要求が形態素解析される。
つぎに、この形態素解析された検索要求が同義表現に展
開され、形態素解析された検索要求と、その同義表現と
により、検索条件が生成される。この検索条件との照合
によりデータベースの文書データが検索され、検索され
た文書データがデータベースから読み出される。In order to solve such a problem, Japanese Patent Laid-Open No. 6-
In the document retrieval device disclosed in Japanese Patent No. 75996, when a user inputs a retrieval request in natural language, the retrieval request is expanded into a synonymous expression and then the document data is retrieved. More specifically, when the user inputs a document data search request in natural language, the search request is morphologically analyzed.
Next, the morphologically analyzed search request is expanded into a synonymous expression, and the morphologically analyzed search request and the synonymous expression generate a search condition. Document data in the database is searched by matching with this search condition, and the searched document data is read from the database.

【０００４】この特開平6-75996 号公報には、自然言語
の検索要求を同義表現にまで展開してから検索条件を生
成するため、以下のような方法が開示されている。Japanese Patent Laid-Open No. 6-75996 discloses the following method for generating a search condition after expanding a natural language search request into synonymous expressions.

【０００５】１．検索要求の分解検索要求が“Ａ＋Ｂ”の複合語か“ＡのＢ”“ＡをＢす
る”の語句の場合、同義表現を“ＡアンドＢ”や“Ａ”
“Ｂ”とする。1. Decomposition of search request If the search request is a compound word of "A + B" or a phrase of "B of A" and "A to B", the synonyms are "A and B" or "A".
Call it "B".

【０００６】２．検索要求の結合検索要求が“ＡのＢ”“ＡをＢする”の語句の場合、同
義表現を“Ａ＋Ｂ”の複合語とする。2. Combining Search Requests When the search request is a phrase “B of A” or “B of A”, the synonymous expression is a compound word of “A + B”.

【０００７】３．助動詞や助詞の削除検索要求が“Ａする”“Ａ＋助動詞”“Ａ＋助詞”の場
合、同義表現を“Ａ”とする。3. Deletion of auxiliary verbs and particles If the search request is “A do”, “A + auxiliary verb” or “A + particle”, the synonymous expression is “A”.

【０００８】４．活用語尾の削除検索要求が用語（動詞／形容詞／形容動詞）で終了する
語句の場合、その用語の活用語尾を削除したものを同義
表現とする。4. Deletion of inflectional endings When a search request is a term that ends with a term (verb / adjective / adjective verb), the derivation of the inflectional ending of the term is made synonymous.

【０００９】[0009]

【発明が解決しようとする課題】上述した文書検索装置
では、自然言語により入力された検索要求が同義表現に
展開されるので、ユーザの設定より広範囲に文書データ
を検索することができる。In the above document retrieval apparatus, since the retrieval request input in natural language is expanded into the synonymous expression, the document data can be retrieved in a wider range than the user's setting.

【００１０】しかし、上述した文書検索装置は、検索要
求を同義表現に展開して検索条件を生成する場合、検索
要求が“Ａ＋Ｂ”“ＡのＢ”“ＡをＢする”の場合にし
か対応していないので、検索要求が多数の単語からなる
と適切な検索条件を生成することができない。However, the above-described document search device corresponds only to the case where the search request is expanded into a synonymous expression to generate the search condition, when the search request is "A + B", "B of A", and "B of A". Therefore, if the search request consists of a large number of words, it is impossible to generate an appropriate search condition.

【００１１】また、上述のような検索要求を“Ａ”
“Ｂ”に分解して検索条件を生成する場合、検索条件が
“Ａ”“Ｂ”“ＡアンドＢ”しか生成されないので、ユ
ーザの検索要求に広範囲に対応することができない。Further, the search request as described above is "A".
When the search condition is generated by decomposing it into “B”, only “A”, “B”, and “A and B” are generated as the search condition, so that it is not possible to widely respond to the user's search request.

【００１２】[0012]

【課題を解決するための手段】請求項１記載の発明の文
書検索装置は、複数の文書データが予め格納されたデー
タベースを設け、文書データの検索要求を自然言語によ
り入力する要求入力手段を設け、自然言語により入力さ
れた検索要求から言語解析により自立語と付属語とを抽
出する言語解析手段を設け、抽出された付属語の各々を
予め設定された複数の演算子の一つに個々に変換して対
応する自立語と組み合わせることにより検索条件を生成
する条件生成手段を設け、生成された検索条件により前
記データベースから文書データを検索するデータ検索手
段を設けた。According to another aspect of the present invention, there is provided a document search device, which comprises a database in which a plurality of document data are stored in advance, and request input means for inputting a search request for document data in natural language. , Providing a language analysis means for extracting an independent word and an adjunct word by a language analysis from a search request input in natural language, and individually extracting each adjunct word to one of a plurality of preset operators A condition generating means for generating a search condition by converting and combining with a corresponding independent word is provided, and a data searching means for searching the document data from the database according to the generated search condition is provided.

【００１３】請求項２記載の発明の文書検索装置では、
請求項１記載の発明の文書検索装置において、条件生成
手段の演算子を、アンド、オア、アンドノット、として
設定した。According to the document search apparatus of the invention described in claim 2,
In the document search apparatus according to the first aspect of the present invention, the operators of the condition generation means are set to AND, OR, and NOT.

【００１４】請求項３記載の発明の文書検索装置では、
請求項１又は２記載の発明の文書検索装置において、検
索要求から抽出された自立語から複合語を検出する複合
語検出手段を設け、検出された複合語を複数の構成単語
に分割する複合語分割手段を設けた。According to the document retrieval apparatus of the invention described in claim 3,
The document search device according to claim 1 or 2, further comprising a compound word detecting means for detecting a compound word from the independent word extracted from the search request, and dividing the detected compound word into a plurality of constituent words. A dividing means is provided.

【００１５】請求項４記載の発明の文書検索装置では、
請求項３記載の発明の文書検索装置において、複合語の
構成単語が予め格納された構成単語辞書を設け、この構
成単語辞書を参照して複合語分割手段が複合語を複数の
構成単語に分割する。According to another aspect of the document search apparatus of the present invention,
In the document retrieval apparatus according to the invention of claim 3, a constituent word dictionary in which constituent words of the compound word are stored in advance is provided, and the compound word dividing means refers to this constituent word dictionary to divide the compound word into a plurality of constituent words. To do.

【００１６】請求項５記載の発明の文書検索装置では、
請求項３記載の発明の文書検索装置において、複合語の
分割点に位置する文字列が予め格納された分割文字辞書
を設け、この分割文字辞書を参照して複合語分割手段が
複合語を複数の構成単語に分割する。According to the document search apparatus of the invention described in claim 5,
In the document retrieval apparatus according to the invention as set forth in claim 3, there is provided a divided character dictionary in which the character strings located at the division points of the compound word are stored in advance, and the compound word dividing means refers to the divided character dictionary to generate a plurality of compound words. It is divided into the constituent words of.

【００１７】請求項６記載の発明の文書検索装置では、
請求項３記載の発明の文書検索装置において、複合語の
分割点の前後に位置する文字種の組み合わせの可否が予
め設定された分割可否テーブルを設け、この分割可否テ
ーブルを参照して複合語分割手段が複合語を複数の構成
単語に分割する。According to the document retrieval apparatus of the invention described in claim 6,
In the document search device according to the invention of claim 3, there is provided a division propriety table in which the propriety of the combination of the character types positioned before and after the division point of the compound word is preset, and the compound word division means is referred to with reference to this division propriety table. Splits a compound word into multiple constituent words.

【００１８】請求項７記載の発明の文書検索装置では、
請求項３記載の発明の文書検索装置において、複合語の
分割点の直前に位置する文字と直後に位置する文字とが
個々に予め格納された分割前後辞書を設け、この分割前
後辞書を参照して複合語分割手段が複合語を複数の構成
単語に分割する。According to the document retrieval apparatus of the invention of claim 7,
In the document search device according to the invention of claim 3, a pre-division dictionary is provided in which a character positioned immediately before a division point of a compound word and a character positioned immediately after the division point of the compound word are individually stored in advance, and the pre-division dictionary is referred to. The compound word dividing means divides the compound word into a plurality of constituent words.

【００１９】請求項８記載の発明の文書検索装置では、
請求項７記載の発明の文書検索装置において、分割前後
辞書に格納された文字の各々に末尾に位置する確率と単
語の先頭に位置する確率とを設定し、これらの確率を参
照して複合語分割手段が複合語を複数の構成単語に分割
する。According to the document search apparatus of the invention described in claim 8,
In the document search device according to the invention as set forth in claim 7, the probability of being located at the end and the probability of being located at the beginning of the word are set for each of the characters stored in the pre-division dictionary and the compound word is referred to with reference to these probabilities. The dividing means divides the compound word into a plurality of constituent words.

【００２０】請求項９記載の発明の文書検索装置では、
請求項７記載の発明の文書検索装置において、複合語分
割手段が予め設定された特定の文字種からなる複合語の
み複数の構成単語に分割する。According to the document retrieval apparatus of the invention described in claim 9,
In the document retrieval apparatus according to the seventh aspect of the present invention, the compound word dividing unit divides only a compound word consisting of a preset specific character type into a plurality of constituent words.

【００２１】請求項１０記載の発明の文書検索装置で
は、請求項１，２，３，４，５，６，７，８又は９記載
の発明の文書検索装置において、同義語となる単語の組
み合わせが予め格納された同義語辞書を設け、この同義
語辞書に格納された単語を検索要求から抽出された自立
語から検出する同義語検出手段を設け、検出された自立
語を他の同義語に展開する同義語展開手段を設けた。According to a tenth aspect of the document search apparatus of the present invention, in the document retrieval apparatus of the first, second, third, fourth, fifth, sixth, seventh, eighth or ninth aspect, a combination of words that are synonyms Is provided with a pre-stored synonym dictionary, provided with a synonym detection means for detecting the word stored in this synonym dictionary from the independent word extracted from the search request, and the detected independent word to another synonym A synonym expanding means for expanding is provided.

【００２２】請求項１１記載の発明の辞書作成装置は、
文書データから単語を取得する単語取得手段と、取得さ
れた単語を構成する文字を分別する文字分別手段と、分
別された文字の各々に対して単語に出現する出現頻度と
単語の末尾に位置する末尾頻度と先頭に位置する先頭頻
度とを算出する頻度算出手段と、文字の各々に対して出
現頻度に対する末尾頻度の割合である末尾確率と先頭頻
度の割合である先頭確率とを算出する確率算出手段と、
分割前後辞書に末尾確率が高い文字を複合語の分割点の
直前に位置する文字として設定すると共に先頭確率が高
い文字を複合語の分割点の直後に位置する文字として設
定する文字設定手段と、を有する。A dictionary creating apparatus according to the invention of claim 11 is
A word acquisition unit that acquires a word from document data, a character classification unit that classifies the characters that make up the acquired word, an occurrence frequency that appears in a word for each of the classified characters, and a position at the end of the word Frequency calculating means for calculating the tail frequency and the head frequency located at the head, and probability calculation for calculating the tail probability that is the ratio of the tail frequency to the appearance frequency and the head probability that is the ratio of the head frequency for each character. Means and
A character setting means for setting a character having a high end probability as a character positioned immediately before the division point of the compound word in the pre-division dictionary and setting a character having a high start probability as a character positioned immediately after the division point of the compound word, Have.

【００２３】請求項１２記載の発明の辞書作成装置で
は、請求項１１記載の発明の辞書作成装置において、算
出された出現頻度が低い文字を判定する文字判定手段を
設け、出現頻度が低い文字の末尾確率と先頭確率とを全
部の文字の末尾確率の平均値と先頭確率の平均値とに各
々置換する確率設定手段を設けた。According to a twelfth aspect of the dictionary creating apparatus of the present invention, in the dictionary creating apparatus of the eleventh aspect of the present invention, a character determining unit for determining a calculated character having a low appearance frequency is provided to detect a character having a low appearance frequency. Probability setting means for replacing the tail probability and the head probability with the mean value of the tail probabilities and the mean value of the head probabilities of all characters are provided.

【００２４】[0024]

【作用】請求項１記載の発明の文書検索装置では、文書
データの検索要求を自然言語により要求入力手段に入力
すると、この自然言語により入力された検索要求から言
語解析手段が言語解析により自立語と付属語とを抽出す
る。条件生成手段は、抽出された付属語の各々を予め設
定された複数の演算子の一つに個々に変換して対応する
自立語と組み合わせることにより検索条件を生成するの
で、この生成された検索条件によりデータ検索手段がデ
ータベースから文書データを検索する。付属語が複数で
も、これが個々に適切な演算子に変換され、自立語が多
数でも、これが演算子により適切に組み合わされるの
で、ユーザが入力した自然言語の検索要求から最適な検
索条件が生成される。In the document searching apparatus according to the first aspect of the present invention, when a search request for document data is input to the request input means in natural language, the language analysis means uses the search request input in natural language to perform independent language analysis by language analysis. And the adjunct. The condition generating means individually converts each of the extracted auxiliary words into one of a plurality of preset operators and combines them with the corresponding independent word to generate a search condition. The data retrieval means retrieves the document data from the database according to the condition. Even if there are multiple adjunct words, these are converted to appropriate operators individually, and even if there are many independent words, these are combined appropriately by the operators so that the optimum search conditions are generated from the natural language search request entered by the user. It

【００２５】請求項２記載の発明の文書検索装置では、
検索要求から抽出された複数の付属語の各々が、アン
ド、オア、アンドノット、の一つに個々に変換されるの
で、ユーザが所望する検索要求を検索条件に適切に反映
させることができる。According to the document search apparatus of the invention described in claim 2,
Since each of the plurality of attached words extracted from the search request is individually converted into one of AND, OR, and AND NOT, the search request desired by the user can be appropriately reflected in the search condition.

【００２６】請求項３記載の発明の文書検索装置では、
検索要求から抽出された自立語から複合語検出手段が複
合語を検出すると、この検出された複合語を複合語分割
手段が複数の構成単語に分割するので、ユーザが入力し
た検索要求の自立語がデータ検索に不適な複合語でも、
これがデータ検索に最適な構成単語に分割される。According to the document retrieval apparatus of the invention described in claim 3,
When the compound word detecting means detects a compound word from the independent word extracted from the search request, the compound word dividing means divides the detected compound word into a plurality of constituent words. Is a compound word that is not suitable for data retrieval,
This is divided into optimum constituent words for data retrieval.

【００２７】請求項４記載の発明の文書検索装置では、
複合語分割手段が複合語を複数の構成単語に分割する
際、構成単語辞書に予め設定された複合語の構成単語を
参照するので、複合語が予想された構成単語に分割され
る。According to the document retrieval apparatus of the invention described in claim 4,
When the compound word dividing unit divides the compound word into a plurality of constituent words, since the constituent words of the compound word preset in the constituent word dictionary are referred to, the compound word is divided into the expected constituent words.

【００２８】請求項５記載の発明の文書検索装置では、
複合語分割手段が複合語を複数の構成単語に分割する
際、分割文字辞書に予め設定された複合語の分割点に位
置する文字列を参照するので、複合語が予想された文字
列の位置で分割される。According to the document search apparatus of the invention described in claim 5,
When the compound word dividing unit divides the compound word into a plurality of constituent words, it refers to the character string located at the dividing point of the compound word preset in the divided character dictionary. Is divided by.

【００２９】請求項６記載の発明の文書検索装置では、
複合語分割手段が複合語を複数の構成単語に分割する
際、分割可否テーブルに予め設定された複合語の分割点
の前後に位置する文字種の組み合わせの可否を参照する
ので、複合語が不適な文字種の組み合わせの位置で分割
されない。According to the document search apparatus of the invention described in claim 6,
When the compound word dividing unit divides the compound word into a plurality of constituent words, the compound word is not suitable because it refers to the possibility of the combination of the character types located before and after the dividing point of the compound word set in advance in the dividing possibility table. It is not divided at the position of the combination of character types.

【００３０】請求項７記載の発明の文書検索装置では、
複合語分割手段が複合語を複数の構成単語に分割する
際、分割前後辞書に個々に予め設定された複合語の分割
点の直前に位置する文字と直後に位置する文字とを参照
するので、複合語が予想された分割点の前後の文字に従
って分割される。According to the document search apparatus of the invention described in claim 7,
When the compound word dividing unit divides the compound word into a plurality of constituent words, it refers to the character positioned immediately before the compound word division point and the character positioned immediately after the compound word preset individually in the pre-division dictionary. The compound word is split according to the characters before and after the expected split point.

【００３１】請求項８記載の発明の文書検索装置では、
複合語分割手段が複合語を複数の構成単語に分割する
際、分割前後辞書に格納された文字の各々に設定された
末尾に位置する確率と単語の先頭に位置する確率とを参
照するので、複合語が分割点の前後の文字の確率に従っ
て分割される。According to the document search apparatus of the invention described in claim 8,
When the compound word dividing unit divides the compound word into a plurality of constituent words, it refers to the probability of being located at the end and the probability of being located at the beginning of the word set for each of the characters stored in the dictionary before and after the division. The compound word is divided according to the probabilities of the characters before and after the division point.

【００３２】請求項９記載の発明の文書検索装置では、
複合語分割手段が予め設定された特定の文字種からなる
複合語のみ複数の構成単語に分割するので、分割点を前
後の文字で決定できない片仮名からなる複合語などに対
し、単語分割の無駄な処理を防止できる。According to the document search apparatus of the invention described in claim 9,
Since the compound word dividing means divides only a compound word consisting of a preset specific character type into a plurality of constituent words, useless processing of word division is performed for a compound word such as a katakana word whose division point cannot be determined by the preceding and following characters. Can be prevented.

【００３３】請求項１０記載の発明の文書検索装置で
は、同義語辞書に格納された単語を検索要求から抽出さ
れた自立語から同義語検出手段が検出すると、この検出
された自立語を同義語展開手段が他の同義語に展開する
ので、ユーザが入力した検索要求の自立語がデータ検索
に不適な単語でも、これに同義語が存在する場合には、
この同義語もデータ検索に利用される。In the document retrieval apparatus according to the present invention, when the synonym detecting means detects a word stored in the synonym dictionary from the independent word extracted from the search request, the detected independent word is synonymous. Since the expanding means expands to another synonym, even if the independent word of the search request input by the user is a word unsuitable for data search, if a synonym exists in this,
This synonym is also used for data search.

【００３４】請求項１１記載の発明の辞書作成装置で
は、単語取得手段が文書データから単語を取得すると、
この取得された単語を構成する文字を文字分別手段が分
別する。この分別された文字の各々に対し、単語に出現
する出現頻度と、単語の末尾に位置する末尾頻度と、先
頭に位置する先頭頻度とを、頻度算出手段が算出するの
で、文字の各々に対し、出現頻度に対する末尾頻度の割
合である末尾確率と、先頭頻度の割合である先頭確率と
を、確率算出手段が算出する。文字設定手段が、末尾確
率が高い文字を複合語の分割点の直前に位置する文字、
先頭確率が高い文字を複合語の分割点の直後に位置する
文字、として分割前後辞書に設定するので、この分割前
後辞書には、複合語の分割点の直前に位置する文字と直
後に位置する文字とが適切に格納される。In the dictionary creating apparatus according to the eleventh aspect of the present invention, when the word acquiring means acquires a word from the document data,
The character classification means classifies the characters forming the acquired word. For each of the separated characters, the frequency calculation means calculates the frequency of appearance in the word, the end frequency of the word at the end, and the head frequency of the word at the beginning. The probability calculation means calculates a tail probability that is the ratio of the tail frequency to the appearance frequency and a head probability that is the ratio of the head frequency. The character setting means sets a character having a high end probability to a character positioned immediately before the division point of the compound word,
Since a character with a high start probability is set as a character located immediately after the division point of the compound word in the before-and-after-dictionary dictionary, the character before and after the division point of the compound word is located in this before-and-after-dictionary dictionary. Characters and are properly stored.

【００３５】請求項１２記載の発明の辞書作成装置で
は、算出された出現頻度が低い文字を文字判定手段が判
定すると、この出現頻度が低い文字の末尾確率と先頭確
率とを、全部の文字の末尾確率の平均値と先頭確率の平
均値とに、確率設定手段が各々置換するので、分割前後
辞書に確度が低い確率が設定されない。In the dictionary creating apparatus according to the twelfth aspect of the present invention, when the character determining means determines the calculated character having a low appearance frequency, the tail probability and the head probability of the character having a low appearance frequency are calculated for all the characters. Since the probability setting means replaces the average value of the tail probabilities with the average value of the head probabilities, the probability with low accuracy is not set in the pre-division dictionary.

【００３６】[0036]

【実施例】本発明の文書検索装置の第一の実施例を図１
ないし図３に基づいて以下に説明する。まず、本実施例
の文書検索装置１は、図１に示すように、文書データベ
ース２とデータ処理部３とを有している。例えば、前記
文書データベース２は、ＨＤ(Hard Disk）やＭＯ(Magne
to Optical Disk)などの大容量の記憶デバイスからな
り、前記データ処理部３は、パーソナルコンピュータや
ワークステーションからなる。前記文書データベース２
には、多数の文書データが予め格納されており、この文
書データは、自然言語により表現されたテキストファイ
ルからなる。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS A first embodiment of the document retrieval apparatus of the present invention is shown in FIG.
1 to 3 will be described below. First, the document search device 1 of the present embodiment has a document database 2 and a data processing unit 3, as shown in FIG. For example, the document database 2 may be HD (Hard Disk) or MO (Magneto).
to optical disk), and the data processing section 3 is composed of a personal computer or a workstation. Document database 2
A large number of document data are stored in advance, and this document data is composed of a text file expressed in natural language.

【００３７】前記データ処理部３は、要求入力手段であ
る要求入力部４を有しており、この要求入力部４には、
言語解析手段である言語解析部５が接続されている。こ
の言語解析部５には、条件生成手段である条件生成部６
が接続されており、この条件生成部６には、データ検索
手段である検索処理部７が接続されている。この検索処
理部７は、前記文書データベース２に接続されており、
結果出力手段である結果出力部８が接続されている。The data processing section 3 has a request input section 4 which is a request input means. The request input section 4 includes:
A language analysis unit 5, which is a language analysis means, is connected. The language analysis unit 5 includes a condition generation unit 6 which is a condition generation unit.
Is connected, and the condition generation unit 6 is connected to a search processing unit 7 which is a data search means. This search processing unit 7 is connected to the document database 2,
A result output unit 8 which is result output means is connected.

【００３８】前記要求入力部４は、例えば、文字データ
を入力操作できるキーボードなどのデータ入力装置を有
しており、文書データの検索要求を自然言語により入力
する。前記言語解析部５は、詳細には後述するように、
言語解析である形態素解析と構文解析とにより、自然言
語の検索要求から自立語と付属語とを検出し、これらの
単語により構成される複数の文節の係り受け関係を解析
する。そして、この係り受け関係に従って各文節を連結
することにより、図２に示すように、自然言語の検索要
求から構文解析木を生成する。The request input unit 4 has, for example, a data input device such as a keyboard capable of inputting and operating character data, and inputs a search request for document data in natural language. The language analysis unit 5, as will be described later in detail,
Morphological analysis and syntactic analysis, which are linguistic analyses, detect an independent word and an adjunct word from a natural language search request, and analyze the dependency relation of a plurality of clauses formed by these words. Then, by connecting the clauses in accordance with this dependency relationship, as shown in FIG. 2, a syntactic analysis tree is generated from the natural language search request.

【００３９】前記条件生成部６は、“アンド、オア、ア
ンドノット”の三つが演算子として予め設定されてお
り、抽出された付属語の各々を予め設定された三つの演
算子の一つに個々に変換し、図３に示すように、これを
対応する自立語と組み合わせることにより、検索条件を
検索条件木として生成する。この条件生成部６は、演算
子対応表（図示せず）を有しており、この演算子対応表
には、“または（並列詞）→オア”“を（格助詞）→ア
ンド”のように、自然言語の付属語と三つの演算子との
対応関係が予め設定されている。なお、“Ａアンドノッ
トＢ”は、“Ａ”を有して“Ｂ”を有しないことを意味
する。The condition generating section 6 has three "and, or, and not" preset as operators, and each of the extracted adjuncts is set as one of the three preset operators. The search conditions are generated as search condition trees by individually converting and combining them with corresponding independent words as shown in FIG. This condition generation unit 6 has an operator correspondence table (not shown), and in this operator correspondence table, "or (parallel word) → or""(case particle) → and" The correspondence between the natural language adjunct and the three operators is preset. Note that "A and knot B" means having "A" and not having "B".

【００４０】前記検索処理部７は、例えば、生成された
検索条件により前記文書データベース２から文書データ
を検索し、この検索された文書データを前記文書データ
ベース２から読み出す。前記結果出力部８は、例えば、
ディスプレイやプリンタなどのデータ出力装置を有して
おり、文書データを表示や印刷により出力する。The retrieval processing unit 7 retrieves document data from the document database 2 according to the produced retrieval condition, and reads the retrieved document data from the document database 2. The result output unit 8 is, for example,
It has a data output device such as a display or a printer, and outputs document data by displaying or printing.

【００４１】このような構成において、本実施例の文書
検索装置１は、文書データの検索要求を自然言語により
入力すると、検索条件が自動的に生成され、文書データ
ベース２から文書データが検索される。そこで、この処
理動作を以下に順次説明する。In the document retrieval apparatus 1 of this embodiment having such a configuration, when a retrieval request for document data is input in natural language, retrieval conditions are automatically generated and document data is retrieved from the document database 2. . Therefore, this processing operation will be sequentially described below.

【００４２】まず、ユーザが文書データの検索要求を考
え、これを自然言語により要求入力部４に入力すると、
言語解析部５が、形態素解析と構文解析とにより、検索
要求から自立語と付属語とを抽出し、検索要求の文節の
係り受け関係を解析する。より具体的には、検索要求が
自然言語により“明度または色調を調整できる発光式イ
メージ表示装置”として入力された場合、自立語として
“明度，色調，調整，発光式イメージ表示装置”が順番
に抽出され、付属語として“または，を，できる”が順
番に抽出される。First, when the user considers a document data retrieval request and inputs it in the request input section 4 in natural language,
The language analysis unit 5 extracts an independent word and an adjunct word from the search request by morphological analysis and syntactic analysis, and analyzes the dependency relation of the clause of the search request. More specifically, if a search request is input in natural language as a “light emitting type image display device capable of adjusting lightness or color tone”, then “lightness, color tone, adjustment, light emitting type image display device” as independent words are sequentially displayed. “Or, can” is extracted in order as an accessory word.

【００４３】そして、これらの単語からなる複数の文節
の係り受け関係が解析されることにより、図２に示すよ
うに、自然言語の検索要求から構文解析木が生成され
る。この構文解析木では、各文節が末端ノードで表現さ
れており、各文節を構成する単語には矩形の枠組が表記
されている。ある文節の係り先は、最も近い親ノードか
ら図中右側の子ノードを辿ることにより到達する末端ノ
ード（以下では最右子ノードと呼称する）であり、これ
が自身となる場合には親ノードを順次遡る。By analyzing the dependency relations of a plurality of clauses composed of these words, a syntactic analysis tree is generated from the natural language search request as shown in FIG. In this parse tree, each clause is represented by a terminal node, and the words that make up each clause have a rectangular framework. The target of a certain clause is the end node (hereinafter called the rightmost child node) that is reached by following the child node on the right side of the figure from the closest parent node. If this is the self node, the parent node is Go back in sequence.

【００４４】より具体的には、最初の文節“明度また
は”は、構文解析木の最も左側の子ノードであり、その
構成単語は“明度”“または”である。“明度または”
の係り先は、その親ノードの最右子ノードが自身でない
ので、親ノードの最右子ノードの文節“色調を”とな
る。なお、この文節“色調を”の係り先は、親ノードの
最右子ノードが自身であるので、一つ遡った親ノードの
最右子ノードの文節“調整できる”となる。More specifically, the first clause "lightness or" is the leftmost child node of the parse tree, and its constituent words are "lightness""or". “Brightness or”
Since the rightmost child node of the parent node is not itself, the destination of the above is the phrase “color tone” of the rightmost child node of the parent node. In addition, since the rightmost child node of the parent node is itself, the destination of this phrase “color tone” is the phrase “adjustable” of the rightmost child node of the parent node that has gone back one step.

【００４５】つぎに、条件生成部６は、抽出された付属
語の各々を“アンド，オア，アンドノット”の三つの演
算子の一つに個々に変換し、これを対応する自立語と組
み合わせることにより検索条件を生成する。より具体的
には、単語選択手段が単語選択処理を実行し、検索要求
の構成単語から自立語である“明度，色調，調整，発光
式イメージ表示装置”が選択される。つぎに、演算子選
択手段が演算子対応表を参照して演算子選択処理を実行
し、付属語である“または，を，できる”の各々が演算
子である“オア，アンド，アンド”に個々に変換され
る。Next, the condition generator 6 individually converts each of the extracted adjunct words into one of the three operators "and, or, and knot" and combines it with the corresponding independent word. By doing so, the search condition is generated. More specifically, the word selecting means executes the word selecting process, and the independent word "brightness, color tone, adjustment, light emitting image display device" is selected from the constituent words of the search request. Next, the operator selecting means refers to the operator correspondence table and executes the operator selecting process, and each of the adjuncts "or, can," is converted to the operator "or, and, and". Converted individually.

【００４６】このように変換された演算子は対応する自
立語に組み合わされるので、図３に示すように、自立語
からなる末端ノードが付属語からなる親ノードに連結さ
れた検索条件木が検索条件として生成される。この検索
条件木をテキスト形式で表現すると、（((明度オア色調）アンド制御）アンド発光式
イメージ表示装置）となり、既存の検索条件と同様に取り扱える。そこで、
検索処理部７が、生成された検索条件により文書データ
ベース２から文書データを検索して読み出すので、結果
出力部８が、文書データを表示や印刷により出力する。Since the operators thus converted are combined with the corresponding independent words, as shown in FIG. 3, a search condition tree in which an end node consisting of an independent word is connected to a parent node consisting of an adjunct word is searched. It is generated as a condition. When this search condition tree is expressed in text format, it becomes (((brightness or color tone) and control) and light emitting image display device), which can be handled in the same way as existing search conditions. Therefore,
The search processing unit 7 searches and reads the document data from the document database 2 according to the generated search condition, and the result output unit 8 outputs the document data by displaying or printing.

【００４７】本実施例の文書検索装置１は、上述のよう
にユーザが検索要求を自然言語により入力すると、適切
な検索条件が自動的に生成されて文書データが検索され
るので、ユーザが検索条件を設定する必要がない。しか
も、自然言語により入力された検索要求に多数の自立語
と付属語とが存在しても、付属語の各々が演算子に変換
されて対応する自立語に組み合わされるので、適切な検
索条件が生成されて文書データが良好に検索される。さ
らに、検索条件を生成する演算子として“アンド、オ
ア、アンドノット”の三つが設定されているので、ユー
ザの検索要求に広範囲に対応することができる。In the document retrieval apparatus 1 of this embodiment, when the user inputs a retrieval request in natural language as described above, appropriate retrieval conditions are automatically generated and document data is retrieved, so that the user retrieves. There is no need to set conditions. Moreover, even if a large number of independent words and adjunct words are present in the search request entered in natural language, each of the adjunct words is converted into an operator and combined with the corresponding independent word, so that appropriate search conditions are required. The generated document data is searched well. Furthermore, since three operators "and, or, and knot" are set as operators for generating the search condition, it is possible to meet a wide range of user search requests.

【００４８】つぎに、本発明の文書検索装置の第二の実
施例を図４に基づいて以下に説明する。なお、この第二
の実施例に関し、上述した第一の実施例と同一の部分
は、同一の名称及び符号を利用して詳細な説明は省略す
る。Next, a second embodiment of the document retrieval apparatus of the present invention will be described below with reference to FIG. With regard to the second embodiment, the same parts as those in the first embodiment described above are designated by the same names and reference numerals, and detailed description thereof will be omitted.

【００４９】まず、本実施例の文書検索装置１では、条
件生成手段である条件生成部６が、同義語辞書、同義語
検出手段、同義語展開手段、を有している。前記同義語
辞書は、“明度，明るさ”のように、同義語となる単語
の組み合わせが予め格納されており、前記同義語検出手
段は、前記同義語辞書に格納された単語を検索要求の自
立語から検出する。前記同義語展開手段は、検出された
自立語を他の同義語に展開するので、図示するように、
条件生成部６は、展開された複数の同義語により検索条
件を生成することができる。First, in the document search device 1 of the present embodiment, the condition generating section 6 which is a condition generating means has a synonym dictionary, a synonym detecting means, and a synonym expanding means. The synonym dictionary previously stores a combination of words that are synonyms such as “brightness, brightness”, and the synonym detection means requests the words stored in the synonym dictionary as a search request. Detect from independent words. Since the synonym expanding means expands the detected independent word into another synonym, as shown in the figure,
The condition generation unit 6 can generate a search condition using the expanded synonyms.

【００５０】このような構成において、本実施例の文書
検索装置１も、第一の実施例と同様に、文書データの検
索要求が自然言語により入力されると、検索条件を自動
的に生成して文書データを検索する。With such a configuration, the document search apparatus 1 of this embodiment also automatically generates search conditions when a document data search request is input in natural language, as in the first embodiment. To search the document data.

【００５１】条件生成部６が検索要求から検索条件を生
成する場合、検索要求の自立語から同義語辞書に格納さ
れた単語が同義語検出手段により検出し、この検出され
た自立語が同義語展開手段により他の同義語に展開され
る。例えば、検索要求の自立語に“明度”が存在する場
合、その同義語として“明るさ”が検出され、これらの
同義語が“オア”で連結される。この場合、用語として
“明度”が記述されず“明るさ”が記述された文書デー
タも検索されるので、良好に文書データを検索すること
ができ、ユーザの検索要求に広範囲に対応することがで
きる。When the condition generating unit 6 generates the search condition from the search request, the synonym detection means detects a word stored in the synonym dictionary from the independent word of the search request, and the detected independent word is a synonym. Expanded to another synonym by the expansion means. For example, when "lightness" is present in the independent word of the search request, "brightness" is detected as a synonym, and these synonyms are connected by "OR". In this case, document data in which "brightness" is not described as a term and "brightness" is described is also searched, so that the document data can be searched well and a wide range of user search requests can be met. it can.

【００５２】つぎに、本発明の文書検索装置の第三の実
施例を図５に基づいて以下に説明する。なお、この第三
の実施例に関し、前述した第一の実施例と同一の部分
は、同一の名称及び符号を利用して詳細な説明は省略す
る。Next, a third embodiment of the document retrieval apparatus of the present invention will be described below with reference to FIG. With regard to the third embodiment, the same parts as those of the first embodiment described above are designated by the same names and reference numerals, and detailed description thereof will be omitted.

【００５３】まず、本実施例の文書検索装置１では、条
件生成手段である条件生成部６が、字種判別手段、複合
語検出手段、複合語分割手段、を有している。前記字種
判別手段は、検索要求から抽出された自立語の文字の字
種（漢字，片仮名，平仮名，数字，アルファベット，
…）を判別し、前記複合語検出手段は、字種が変化して
いる自立語を複合語として検出する。前記複合語分割手
段は、検出された複合語を、字種の変化点を分割点とし
て複数の構成単語に分割するので、条件生成部６は、図
示するように、分割された複数の構成単語により検索条
件を生成することができる。First, in the document search device 1 of the present embodiment, the condition generating section 6 which is a condition generating means has a character type discriminating means, a compound word detecting means, and a compound word dividing means. The character type discriminating means determines the character type of the independent word extracted from the search request (Kanji, Katakana, Hiragana, number, alphabet,
...) is determined, and the compound word detecting means detects an independent word having a changed character type as a compound word. Since the compound word dividing unit divides the detected compound word into a plurality of constituent words with the change point of the character type as a dividing point, the condition generation unit 6 causes the condition generating unit 6 to divide the plurality of divided constituent words as illustrated. The search condition can be generated by.

【００５４】このような構成において、本実施例の文書
検索装置１では、条件生成部６が検索要求から検索条件
を生成する場合、検索要求の自立語の字種が字種判別手
段により判別され、字種が変化している自立語は、複合
語検出手段により複合語として検出されてから、複合語
分割手段により複数の構成単語に分割される。With such a configuration, in the document search device 1 of this embodiment, when the condition generator 6 generates a search condition from a search request, the character type of the independent word of the search request is determined by the character type determination means. An independent word of which the character type is changed is detected as a compound word by the compound word detecting means, and then divided into a plurality of constituent words by the compound word dividing means.

【００５５】例えば、検索要求の自立語に“発光式イメ
ージ表示装置”が存在する場合、これは字種が“漢字／
片仮名／漢字”に変化しているので、これは複合語とし
て“発光式”“イメージ”“表示装置”の構成単語に分
割されて“アンド”の演算子により連結される。この場
合、“発光式イメージ表示装置”が一つの用語として記
述された文書データだけでなく、“発光式”“イメー
ジ”“表示装置”なる三つの用語が分割して記述された
文書データも検索されるので、良好に文書データを検索
することができ、ユーザの検索要求に広範囲に対応する
ことができる。For example, when "light emitting type image display device" exists in the independent word of the search request, this means that the character type is "Kanji /
Since it has been changed to katakana / kanji, this is divided into compound words of "light emitting type", "image", and "display device" as compound words, and they are connected by the operator of "and". Not only the document data in which the "type image display device" is described as one term, but also the document data in which the three terms "light emitting type", "image", and "display device" are described separately are searched. Document data can be searched for, and a wide range of user search requests can be met.

【００５６】なお、本実施例では構成単語を“アンド”
の演算子により連結することを例示したが、これを“オ
ア”の演算子により連結することも可能である。In this embodiment, the constituent word is "and".
Although the connection with the operator of is illustrated, it is also possible to connect with the operator of “or”.

【００５７】また、本実施例では一つの自立語である複
合語を複数の自立語である構成単語に分割することを例
示したが、さらに、この後段に同義語の展開を実行する
ことも可能である。この場合、図６に示すように、“発
光式イメージ表示装置”が複合語として“発光式”“イ
メージ”“表示装置”の構成単語に分割されてから、
“イメージ”が同義語の“画像”にも展開されるので、
より良好に文書データを検索することができる。Further, in the present embodiment, an example is shown in which a compound word which is one independent word is divided into a plurality of constituent words which are independent words, but it is also possible to execute the expansion of synonyms at the latter stage. Is. In this case, as shown in FIG. 6, after the "light emitting type image display device" is divided into the constituent words of "light emitting type""image""displaydevice" as a compound word,
Since "image" is also expanded to the synonym "image",
Document data can be searched better.

【００５８】上述した文書検索装置１では、字種の判別
により自立語を複合語に同定して構成単語に分割するの
で、その処理動作が簡易であり、構造を簡略化すること
ができる。しかし、本発明は上記実施例に限定されるも
のではなく、複合語を分割する方法は、字種の判別に限
定されない。In the above-described document retrieval device 1, since the independent word is identified as the compound word by the character type discrimination and divided into the constituent words, the processing operation is simple and the structure can be simplified. However, the present invention is not limited to the above embodiment, and the method of dividing the compound word is not limited to the discrimination of the character type.

【００５９】例えば、複合語の構成単語が予め格納され
た構成単語辞書を設け、条件生成部６の複合語分割手段
が、構成単語辞書を参照して複合語を複数の構成単語に
分割することも可能である。この場合、構成単語辞書が
適切に設定されていれば、字種が一つの複合語である
“表示装置”を“表示”“装置”の構成単語に分割する
ことができるので、より良好に文書データを検索するこ
とができる。For example, a constituent word dictionary in which constituent words of a compound word are stored in advance is provided, and the compound word dividing means of the condition generation unit 6 refers to the constituent word dictionary to divide the compound word into a plurality of constituent words. Is also possible. In this case, if the constituent word dictionary is properly set, the “display device” that is a compound word with one character type can be divided into the constituent words of “display” and “device”, so that the document is better documented. You can search the data.

【００６０】なお、自立語から構成単語を検出する場
合、構成単語が重複することが想定できる。つまり、
“東京都”なる自立語からは“東京”“京都”なる構成
単語が検出されるが、これらは一部が重複している。こ
のような場合、両方を構成単語として文書データの検索
範囲を拡大することも可能であるが、自立語の分割を中
止して不適な文書データの検索を防止することが望まし
い。また、一部が重複した構成単語の両方を採用する場
合は、各々の構成単語により複合語を分割して“アン
ド”の演算子で連結し、これらを“オア”の演算子で連
結することも可能である。例えば、自立語が“東京都”
の場合は、 ((東京アンド都）オア（東アンド京都)) となる。この場合、不適な文書データの検索を最小にし
ながら、文書データの検索範囲を拡大することができ
る。When detecting a constituent word from an independent word, it can be assumed that the constituent words overlap. That is,
From the independent word "Tokyo", the constituent words "Tokyo" and "Kyoto" are detected, but these are partially duplicated. In such a case, it is possible to expand the search range of the document data using both as constituent words, but it is desirable to stop the division of the independent word to prevent the search of inappropriate document data. Also, when adopting both of the constituent words that are partially duplicated, divide the compound words by each constituent word and connect them with the "and" operator, and connect them with the "or" operator. Is also possible. For example, the independent word is "Tokyo"
In case of, it becomes ((Tokyo and Miyako) or (East and Kyoto)). In this case, the search range of the document data can be expanded while minimizing the search for inappropriate document data.

【００６１】さらに、複合語の分割点に位置する文字列
が予め格納された分割文字辞書を設け、条件生成部６の
複合語分割手段が、分割文字辞書を参照して複合語を複
数の構成単語に分割することも可能である。例えば、分
割文字辞書に“示装”なる文字列が設定されていれば、
字種が一つの複合語である“表示装置”が“表示”“装
置”の構成単語に分割されるので、より良好に文書デー
タを検索することができる。Further, a divided character dictionary in which the character strings located at the division points of the compound word are stored in advance is provided, and the compound word dividing means of the condition generation unit 6 refers to the divided character dictionary to form a plurality of compound words. It is also possible to divide it into words. For example, if the character string "indication" is set in the divided character dictionary,
Since the “display device” which is a compound word of one character type is divided into the constituent words of “display” and “device”, the document data can be searched more favorably.

【００６２】また、下記の表１に示すように、複合語の
分割点の前後に位置する文字種の組み合わせの可否が予
め設定された分割可否テーブルを設け、条件生成部６の
複合語分割手段が、分割可否テーブルを参照して複合語
を複数の構成単語に分割することも可能である。Further, as shown in Table 1 below, a division permission / prohibition table in which permission / prohibition of the combination of the character types located before and after the division point of the compound word is preset is provided, and the compound word division means of the condition generation unit 6 is provided. It is also possible to divide the compound word into a plurality of constituent words by referring to the division possibility table.

【００６３】[0063]

【表１】 [Table 1]

【００６４】この分割可否テーブルには、分割できる字
種の組み合わせには“○”が設定されており、分割でき
ない字種の組み合わせには“×”が設定されている。例
えば、複合語の分割は、“平仮名／漢字”の位置では行
なわれるが、“漢字／平仮名”の位置では行なわれな
い。In this division permission / prohibition table, "O" is set for the combination of character types that can be divided, and "X" is set for the combination of character types that cannot be divided. For example, division of a compound word is performed at the "Hiragana / Kanji" position, but not at the "Kanji / Hiragana" position.

【００６５】より具体的には、検索要求が自然言語によ
り“割込み信号入出力端子のあるマイクロプロセッサー
システム”として入力された場合、図７に示すように、
自立語として“割込み信号入出力端子，マイクロプロセ
ッサーシステム”が抽出され、付属語として“のある”
が抽出される。つぎに、図８に示すように、この付属語
として抽出された“のある”は“アンド”の演算子に変
換され、自立語として抽出された“割込み信号入出力端
子”は、字種が“漢字／平仮名／漢字”に変化している
ので複合語として検出される。More specifically, when a search request is input by a natural language as a "microprocessor system having an interrupt signal input / output terminal", as shown in FIG.
"Interrupt signal input / output terminal, microprocessor system" is extracted as an independent word, and "is present" as an accessory word
Is extracted. Next, as shown in FIG. 8, "NO-ARU" extracted as this adjunct word is converted into an "AND" operator, and the "interrupt signal input / output terminal" extracted as an independent word has a character type Since it is changed to "Kanji / Hiragana / Kanji", it is detected as a compound word.

【００６６】そこで、この複合語である“割込み信号入
出力端子”は字種の変化に基づいて複数の構成単語に分
割されるが、分割可否テーブルの設定により“平仮名／
漢字”に変化する位置では分割されるが“漢字／平仮
名”に変化する位置では分割されない。このため、“漢
字／平仮名／漢字”の“割込み信号入出力端子”は、
“割込”“み”“信号入出力端子”とは分割されず、図
９に示すように、“平仮名／漢字”の“割込み”“信号
入出力端子”に分割される。Therefore, this compound word "interrupt signal input / output terminal" is divided into a plurality of constituent words based on the change of the character type.
It is divided at the position that changes to “Kanji” but not at the position that changes to “Kanji / Hiragana.” Therefore, the “interrupt signal input / output terminal” of “Kanji / Hiragana / Kanji” is
It is not divided into "interrupt", "mi", and "signal input / output terminal", but is divided into "interrupt" and "signal input / output terminal" of "Hiragana / Kanji" as shown in FIG.

【００６７】つまり、上述のような分割可否テーブルを
複合語分割手段が参照することにより、送り仮名が漢字
から分割されるような不適な分割が防止されるので、複
合語を複数の構成単語に分割する処理を適切に行なわせ
ることができる。なお、このように複合語の分割を制御
しても、字種が一つの複合語である“信号入出力端子”
を“信号”“入出力”“端子”の構成単語に分割するこ
とはできないので、ここには前述した構成単語辞書を利
用することが望ましい。That is, since the compound word dividing means refers to the above-mentioned division permission / inhibition table by the compound word dividing means, it is possible to prevent improper division such that the kana is divided from the kanji, so that the compound word is divided into a plurality of constituent words. The dividing process can be appropriately performed. Even if the division of a compound word is controlled in this way, a "signal input / output terminal" that is a compound word with one character type
Cannot be divided into the constituent words of "signal", "input / output", and "terminal", so it is preferable to use the constituent word dictionary described above.

【００６８】しかし、このような構成単語辞書は、必然
的に膨大な単語を適切に設定しておく必要があり、その
作成の困難が予想される。そこで、このような課題を解
決するため、下記の表２に示すように、複合語の分割点
の直前に位置する文字と直後に位置する文字とが個々に
予め格納された分割前後辞書を設け、条件生成部６の複
合語分割手段が、分割前後辞書を参照して複合語を複数
の構成単語に分割することも可能である。However, in such a constituent word dictionary, it is inevitably necessary to appropriately set a huge number of words, and it is expected that the creation thereof will be difficult. Therefore, in order to solve such a problem, as shown in Table 2 below, a pre-division dictionary is provided in which the character immediately before the division point of the compound word and the character immediately after the division point are individually stored in advance. The compound word dividing unit of the condition generation unit 6 can also divide the compound word into a plurality of constituent words by referring to the pre-division dictionary.

【００６９】[0069]

【表２】 [Table 2]

【００７０】つまり、複合語は複数の構成単語に分割さ
れるので、この分割点の直前には構成単語の末尾の文字
が位置し、分割点の直後には構成単語の先頭の文字が位
置する。そして、単語の先頭や末尾に位置しやすい文字
は予想できるので、これを分割前後辞書に格納しておけ
ば複合語を良好に構成単語に分割することができる。That is, since the compound word is divided into a plurality of constituent words, the character at the end of the constituent word is located immediately before this dividing point, and the character at the beginning of the constituent word is located immediately after the dividing point. . Since it is possible to predict a character that is likely to be located at the beginning or end of a word, if this character is stored in the pre / post-dictionary dictionary, the compound word can be properly divided into constituent words.

【００７１】より具体的には、上記の表２に示すよう
に、分割点の直前の文字として“入，出，端，…”が設
定され、直後の文字として“的，化，号，力…”が設定
されている場合、前述した“信号入出力端子”は“号／
入”“力／端”の位置で分割されて“信号”“入出力”
“端子”となる。つまり、上述のような分割前後辞書を
複合語分割手段が参照することにより、膨大な単語を設
定しなくとも、字種が一つの複合語を複数の構成単語に
良好に分割することができる。More specifically, as shown in Table 2 above, "input, output, end, ..." Is set as the character immediately before the division point, and "target, character, number, force" is set as the character immediately after. ... "is set, the above-mentioned" signal input / output terminal "is" No.
"Signal""Input / output" divided at the input "force / end" position
It becomes a "terminal". That is, by referring to the above-mentioned pre-division dictionary and the compound word dividing means, it is possible to favorably divide a compound word having one character type into a plurality of constituent words without setting a huge number of words.

【００７２】しかし、このような分割前後辞書も、構成
単語の先頭や末尾の文字を適切に予想して設定する必要
があり、その作成には専門的な能力が要求される。この
ような課題を解決するため、分割前後辞書は辞書作成装
置により自動的に作成することが望ましい。However, also in such a pre-division dictionary, it is necessary to appropriately predict and set the leading and trailing characters of the constituent words, and its creation requires specialized ability. In order to solve such a problem, it is desirable that the pre-division dictionary is automatically created by the dictionary creating device.

【００７３】ここで、本発明の辞書作成装置の一実施例
を図１０に基づいて以下に説明する。まず、本実施例の
辞書作成装置１１は、例えば、ＨＤなどの記憶デバイス
からなる分割前後辞書１２に、オンラインやオフライン
で文字を格納する。このため、本実施例の辞書作成装置
１１は、文書入力手段である文書入力部１３、単語取得
手段である単語取得部１４、文字分別手段である文字検
出部１５、頻度算出手段である頻度算出部１６、確率算
出手段である確率算出部１７、文字設定手段である文字
設定部１８、を有しており、この文字設定部１８が分割
前後辞書１２に接続されている。An embodiment of the dictionary creating apparatus of the present invention will be described below with reference to FIG. First, the dictionary creating device 11 of the present embodiment stores characters online or offline in the pre-division dictionary 12 composed of a storage device such as an HD. Therefore, the dictionary creating apparatus 11 of the present embodiment has a document input unit 13 as a document input unit, a word acquisition unit 14 as a word acquisition unit, a character detection unit 15 as a character classification unit, and a frequency calculation unit as a frequency calculation unit. It has a unit 16, a probability calculating unit 17 which is a probability calculating unit, and a character setting unit 18 which is a character setting unit, and the character setting unit 18 is connected to the pre-division dictionary 12.

【００７４】前記文書入力部１３は、例えば、電子ファ
イルシステムなどからなり、自然言語により表現された
テキストファイルの文書データを入力する。前記単語取
得部１４は、形態素解析や構文解析などの言語解析によ
り、文書データから単語を取得し、前記文字検出部１５
は、取得された単語を構成する文字を分別する。前記頻
度算出部１６は、分別された文字の各々に対し、単語に
出現する出現頻度、単語の末尾に位置する末尾頻度、先
頭に位置する先頭頻度、を各々算出し、前記確率算出部
１７は、出現頻度に対する末尾頻度の割合である末尾確
率と、出現頻度に対する先頭頻度の割合である先頭確率
とを、分別された文字の各々に対して算出する。前記文
字設定部１８は、末尾確率が高い文字と先頭確率が高い
文字とを、複合語の分割点の直前に位置する文字と直後
に位置する文字として、前記分割前後辞書１２に各々設
定する。The document input unit 13 is composed of, for example, an electronic file system, and inputs document data of a text file expressed in natural language. The word acquisition unit 14 acquires a word from document data by language analysis such as morphological analysis or syntactic analysis, and the character detection unit 15
Separates the letters that make up the retrieved word. The frequency calculation unit 16 calculates an appearance frequency that appears in a word, a tail frequency that is located at the end of the word, and a head frequency that is located at the beginning of each of the classified characters, and the probability calculation unit 17 A tail probability, which is the ratio of the tail frequency to the appearance frequency, and a head probability, which is the ratio of the head frequency to the appearance frequency, are calculated for each of the separated characters. The character setting unit 18 sets a character having a high ending probability and a character having a high starting probability in the pre-division dictionary 12 as a character positioned immediately before and a character positioned immediately after a division point of a compound word.

【００７５】このような構成において、本実施例の辞書
作成装置１１は、適当な文書データをサンプルとして入
力すると、単語の先頭や末尾に位置する文字が自動的に
検出される。このように検出された文字は、複合語の分
割点の前後に位置する文字として分割前後辞書１２に設
定されるので、文書検索装置１の分割前後辞書１２を自
動的に作成することができる。With such a configuration, the dictionary creating apparatus 11 of the present embodiment automatically detects the characters located at the beginning or end of a word when appropriate document data is input as a sample. The characters detected in this way are set in the before-and-after-division dictionary 12 as the characters located before and after the division point of the compound word, so that the before-and-after-division dictionary 12 of the document search device 1 can be automatically created.

【００７６】より具体的には、自然言語により表現され
たテキストファイルの文書データを文書入力部１３が入
力すると、単語取得部１４が、下記の表３に示すよう
に、形態素解析や構文解析などの言語解析により文書デ
ータから単語を取得する。More specifically, when the document input section 13 inputs the document data of a text file expressed in natural language, the word acquisition section 14 causes the morphological analysis, the syntactic analysis, etc. as shown in Table 3 below. A word is acquired from the document data by the language analysis of.

【００７７】[0077]

【表３】 [Table 3]

【００７８】すると、文字検出部１５が、取得された単
語を構成する文字を分別するので、頻度算出部１６が、
分別された文字の各々に対し、単語に出現する出現頻
度、単語の末尾に位置する末尾頻度、先頭に位置する先
頭頻度、を各々算出する。例えば、文字として“出”に
注目すると、上記の表３に示すように、その総数である
出現頻度は“12”である。この“出”が単語の末尾に位
置する総数である末尾頻度は“２”、単語の先頭に位置
する総数である末尾頻度は“８”である。Then, the character detection unit 15 classifies the characters forming the acquired word, so that the frequency calculation unit 16
For each of the separated characters, the appearance frequency of the word, the tail frequency of the word, and the head frequency of the word are calculated. For example, when attention is paid to "out" as a character, as shown in Table 3 above, the total number of appearance frequencies is "12". The end frequency, which is the total number of "outs" located at the end of the word, is "2", and the end frequency, which is the total number of "outs" located at the beginning of the word, is "8".

【００７９】そこで、確率算出部１７は、下記の表４に
示すように、出現頻度に対する末尾頻度の割合である末
尾確率と、出現頻度に対する先頭頻度の割合である先頭
確率とを、分別された文字の各々に対して算出する。例
えば、上記した“出”に注目すると、出現頻度が“12”
で末尾頻度は“２”なので末尾確率は“２／12＝0.17”
となり、先頭確率は“８／12＝0.67”となる。Therefore, as shown in Table 4 below, the probability calculation unit 17 separates the tail probability, which is the ratio of the tail frequency to the appearance frequency, and the head probability, which is the ratio of the head frequency to the appearance frequency. Calculate for each character. For example, paying attention to the "out" mentioned above, the appearance frequency is "12".
Since the tail frequency is "2", the tail probability is "2/12 = 0.17".
And the leading probability is “8/12 = 0.67”.

【００８０】[0080]

【表４】 [Table 4]

【００８１】文字設定部１８は、末尾確率が高い文字と
先頭確率が高い文字とを、複合語の分割点の直前に位置
する文字と直後に位置する文字として、分割前後辞書１
２に各々設定する。例えば、この設定の閾値を“0.50”
とすると、この“0.50”より末尾確率が高い文字が複合
語の直前に位置する文字として分割前後辞書１２に設定
され、“0.50”より先頭確率が高い文字が複合語の直後
に位置する文字として分割前後辞書１２に設定される。
この場合、上記の表４に示すように、“信，入，出，
端，…”などが分割点の直後に位置する文字として設定
され、“号，力，子，…”などが複合語の直前に位置す
る文字として設定される。The character setting unit 18 regards a character having a high end probability and a character having a high head probability as a character positioned immediately before and a character positioned immediately after the division point of the compound word.
Set to 2 respectively. For example, the threshold for this setting is "0.50"
Then, the character having a higher end probability than "0.50" is set in the pre-division dictionary 12 as the character positioned immediately before the compound word, and the character having a higher start probability than "0.50" is the character positioned immediately after the compound word. It is set in the pre-division dictionary 12.
In this case, as shown in Table 4 above, the "incoming, incoming, outgoing,
"End, ..." is set as the character positioned immediately after the division point, and "go, power, child, ..." is set as the character positioned immediately before the compound word.

【００８２】つまり、本実施例の辞書作成装置１１で
は、サンプルとして入力した文書データから、複合語の
分割点の前後の文字を自動的に検出して分割前後辞書１
２に設定することができるので、専門的な能力が要求さ
れる煩雑な作業をユーザが実行する必要がない。特に、
ユーザが文書検索装置１での検索に利用する文書データ
と同様な内容の文書データをサンプルとして辞書作成装
置１１に入力すれば、分割前後辞書１２をユーザの作業
に最適な形態に作成することができる。That is, in the dictionary creating apparatus 11 of the present embodiment, the characters before and after the division point of the compound word are automatically detected from the document data input as a sample, and the dictionary before and after the division 1
Since it can be set to 2, it is not necessary for the user to perform complicated work requiring specialized ability. In particular,
If the user inputs document data having the same content as the document data used for the search by the document search device 1 to the dictionary creation device 11 as a sample, the pre / post-division dictionary 12 can be created in an optimum form for the user's work. it can.

【００８３】なお、上述のようにして辞書作成装置１１
により分割前後辞書１２を作成する場合、サンプルとし
て入力する文書データが多量であるほど、分割前後辞書
１２の設定内容は正確になるが、これでは作業の負担と
時間とが増大する。この点、文書データが少量であれば
作業の負担や時間は減少するが、この場合は分割前後辞
書１２の確度が期待できない。The dictionary creating device 11 is used as described above.
When the pre-division dictionary 12 is created by the above, the more the document data input as a sample, the more accurate the setting contents of the pre-division dictionary 12, but this increases the work load and time. In this respect, if the amount of document data is small, the work load and time are reduced, but in this case, the accuracy of the pre-division dictionary 12 cannot be expected.

【００８４】そこで、このことが問題となる場合には、
辞書作成装置１１に、文字判定手段である文字判定部と
確率設定手段である確率設定部とを設けることが望まし
い。この場合、文字判定部は、出現頻度が低い文字を判
定し、確率設定部は、出現頻度が低い文字の末尾確率と
先頭確率とを、全部の文字の末尾確率の平均値と先頭確
率の平均値とに各々置換する。If this is a problem, then
It is desirable that the dictionary creation device 11 be provided with a character determination unit that is a character determination unit and a probability setting unit that is a probability setting unit. In this case, the character determination unit determines a character with a low appearance frequency, and the probability setting unit determines the tail probability and the head probability of the character with a low appearance frequency as the average value of the tail probabilities of all the characters and the average of the head probabilities. Replace with and respectively.

【００８５】つまり、出現頻度が低い文字は、出現頻度
に対する末尾頻度や先頭頻度の割合から先頭確率や末尾
確率を算出しても、その確度は充分でない。そこで、こ
のような場合には、その文字の末尾や先頭の確率を、全
部の文字の末尾や先頭の確率の平均値に置換することに
より、不適な確率の設定を防止する。That is, for a character having a low appearance frequency, the accuracy is not sufficient even if the head probability or the tail probability is calculated from the ratio of the tail frequency or the head frequency to the appearance frequency. Therefore, in such a case, the probability of the end or the beginning of the character is replaced with the average value of the probabilities of the end or the beginning of all the characters to prevent the setting of the inappropriate probability.

【００８６】より具体的には、前述した表３と表４との
場合、“子”なる文字は“端子”の三つに出現するだけ
なので、その先頭確率は“0.00”、末尾確率は“1.00”
である。しかし、これはサンプルの文書データが電子機
器に関する内容で“子”の出現頻度が低いために発生し
た偶然であり、“子”が先頭に位置する単語は“子供，
子機，子ノード”などがあるので、その先頭確率を“0.
00”とすることは適切でない。More specifically, in the case of Tables 3 and 4 described above, since the character "child" appears only in three of the "terminals", its leading probability is "0.00" and its trailing probability is " 1.00 ”
Is. However, this is a coincidence that the sample document data is related to electronic devices and the frequency of appearance of "child" is low, and the word in which "child" is located at the beginning is "child,
Since there is a child machine, child node, etc., its leading probability is "0.
Setting 00 ”is not appropriate.

【００８７】この場合、全部の文字の出現頻度が“4
0”、末尾頻度が“14”、先頭頻度が“14”なので、全
部の文字の末尾確率の平均値は“14／40＝0.35”とな
り、先頭確率の平均値も“0.35”となる。そこで、上述
のように出現確率が低い文字の確率を上述のような平均
値に置換することにより、不適な確率の設定を防止する
ことができ、分割前後辞書１２の確度を改善することが
できる。In this case, the appearance frequency of all characters is "4.
Since the ending frequency is “0”, the ending frequency is “14”, and the starting frequency is “14”, the average value of the ending probabilities of all characters is “14/40 = 0.35”, and the average value of the starting probabilities is also “0.35”. As described above, by replacing the probability of a character having a low appearance probability with the average value as described above, it is possible to prevent the setting of an unsuitable probability and improve the accuracy of the pre-division dictionary 12.

【００８８】なお、このように複合語の分割点の直前や
直後に位置する文字を分割前後辞書１２に設定すること
により、文書検索装置１の複合語分割手段は、複合語を
複数の構成単語に良好に分割することができるが、上述
のように辞書作成装置１１が算定した末尾確率や先頭確
率を文字と共に分割前後辞書１２に設定し、文書検索装
置１の複合語分割手段が、これらの確率を参照して複合
語分割手段が複合語を複数の構成単語に分割することも
可能である。As described above, by setting the characters located immediately before and after the dividing point of the compound word in the before-and-after-dictionary dictionary 12, the compound word dividing means of the document retrieval apparatus 1 makes the compound word into a plurality of constituent words. However, the end probability and the start probability calculated by the dictionary creation device 11 as described above are set in the dictionary 12 before and after the division together with the characters, and the compound word division means of the document search device 1 uses these. It is also possible for the compound word dividing means to divide the compound word into a plurality of constituent words by referring to the probability.

【００８９】より具体的には、分割前後辞書１２には、
前述した表４に示すように、文字の各々に先頭確率と末
尾確率とが設定されるので、文書検索装置１は、複合語
の分割点の前後の文字の確率を乗算して分割評価値と
し、この分割評価値を閾値と比較して分割の可否を決定
する。例えば、複合語である“信号入出力端子”を分割
する場合、その分割評価値は下記のように算定されるの
で、信／号 → 0.00×0.00 ＝０．００号／入 → 0.88×0.70 ＝０．６１入／出 → 0.30×0.67 ＝０．２０出／力 → 0.17×0.06 ＝０．０１力／端 → 0.50×0.75 ＝０．３８端／子 → 0.25×0.00 ＝０．００これらの分割評価値を閾値“0.30”と比較すると、“号
／入”“力／端”の位置が分割点となり、これは“信
号”“入出力”“端子”に分割される。More specifically, the pre-division dictionary 12 includes
As shown in Table 4 above, since the leading probability and the ending probability are set for each character, the document retrieval apparatus 1 multiplies the probabilities of the characters before and after the compound word division point to obtain the division evaluation value. The division evaluation value is compared with a threshold value to determine whether or not division is possible. For example, when dividing a compound word “signal input / output terminal”, the divided evaluation value is calculated as follows, so the signal / number → 0.00 × 0.00 = 0.00 number / input → 0.88 × 0.70 = 0.61 input / output → 0.30 × 0.67 = 0.20 output / force → 0.17 × 0.06 = 0.01 force / end → 0.50 × 0.75 = 0.38 end / child → 0.25 × 0.00 = 0.00 These divisions When the evaluation value is compared with the threshold value "0.30", the positions of "go / enter" and "force / end" are divided points, which are divided into "signal", "input / output" and "terminal".

【００９０】また、上記した複合語を分割する各種の方
法では、複合語の前後の文字に注目しているが、例え
ば、表音文字である平仮名や片仮名の場合、二つの文字
に注目しても分割点を決定することは困難である。換言
すると、平仮名や片仮名からなる複合語の場合、分割点
を決定する演算処理は無駄である。In the various methods of dividing the compound word, attention is paid to the characters before and after the compound word. For example, in the case of phonetic hiragana or katakana, two characters are noted. Even the dividing point is difficult to determine. In other words, in the case of a compound word composed of hiragana or katakana, the calculation process for determining the division point is useless.

【００９１】つまり、上述のように文書検索装置１が、
分割点の前後の文字に注目して複合語を構成単語に分割
する場合に、複合語分割手段が、漢字などとして予め設
定された特定の文字種からなる複合語のみ複数の構成単
語に分割することにより、平仮名や片仮名からなる複合
語の分割を省略し、処理の負担や時間を削減することが
できる。That is, as described above, the document retrieval device 1
When dividing a compound word into constituent words by paying attention to the characters before and after the dividing point, the compound word dividing means should divide only the compound word consisting of a specific character type preset as Kanji into multiple constituent words. This makes it possible to omit the division of the compound word consisting of hiragana or katakana, and reduce the processing load and time.

【００９２】なお、ここでは複合語を構成単語に分割す
る方法として、各種の方法を例示した。例えば、字種を
判別する方法では、予め辞書を作成しておく必要がな
く、辞書の参照も不要であるが、字種が一つの複合語は
構成単語に分割できない。一方、辞書を参照する方法で
は、字種が一つの複合語も構成単語に分割できるが、予
め辞書を作成しておく必要があり、辞書の参照も必要で
ある。辞書を参照する方法では、単語分割の処理精度が
高いものほど、データ容量や処理負担を軽減できるもの
ほど、傾向にある。つまり、上述した各種の方法は、ユ
ーザの要望や装置の仕様など、各種条件により選択され
ることが好ましい。Here, various methods are exemplified as the method of dividing the compound word into the constituent words. For example, in the method of discriminating the character type, it is not necessary to create a dictionary in advance and it is not necessary to refer to the dictionary, but a compound word with one character type cannot be divided into constituent words. On the other hand, in the method of referring to the dictionary, a compound word having one character type can be divided into constituent words, but it is necessary to create the dictionary in advance and also refer to the dictionary. In the method of referring to the dictionary, the higher the word division processing accuracy is, the more the data capacity and the processing load can be reduced. That is, the various methods described above are preferably selected according to various conditions such as the user's request and the specifications of the device.

【００９３】[0093]

【発明の効果】請求項１記載の発明の文書検索装置は、
自然言語により入力された検索要求から言語解析により
自立語と付属語とを抽出する言語解析手段を設け、抽出
された付属語の各々を予め設定された複数の演算子の一
つに個々に変換して対応する自立語と組み合わせること
により検索条件を生成する条件生成手段を設けたことに
より、付属語が複数でも個々に適切な演算子に変換さ
れ、自立語が多数でも演算子により適切に組み合わされ
るので、ユーザが入力した自然言語の検索要求から適切
な検索条件を生成することができ、文書データを良好に
検索することができる。According to the document retrieval apparatus of the invention described in claim 1,
Providing language analysis means for extracting independent words and adjuncts from a search request entered in natural language by language analysis, and converting each of the extracted adjuncts individually into one of a plurality of preset operators By providing the condition generation means that generates search conditions by combining with the corresponding independent word, even if multiple adjunct words are individually converted into appropriate operators, even if multiple independent words are combined appropriately by the operator As a result, appropriate search conditions can be generated from the natural language search request input by the user, and document data can be searched well.

【００９４】請求項２記載の発明の文書検索装置は、条
件生成手段の演算子を、アンド、オア、アンドノット、
として設定したことにより、ユーザが所望する検索要求
を検索条件に適切に反映させることができ、ユーザの検
索要求に広範囲に対応することができる。According to another aspect of the document search apparatus of the present invention, the operators of the condition generating means are AND, OR, AND NOT,
By setting as, it is possible to appropriately reflect the search request desired by the user in the search condition, and it is possible to widely respond to the search request of the user.

【００９５】請求項３記載の発明の文書検索装置は、検
索要求から抽出された自立語から複合語を検出する複合
語検出手段を設け、検出された複合語を複数の構成単語
に分割する複合語分割手段を設けたことにより、ユーザ
が入力した検索要求の自立語がデータ検索に不適な複合
語でも、これがデータ検索に適切な構成単語に自動的に
分割されるので、文書データを良好に検索することがで
きる。The document retrieval apparatus according to the third aspect of the present invention is provided with a compound word detecting means for detecting a compound word from the independent word extracted from the search request, and divides the detected compound word into a plurality of constituent words. By providing the word dividing means, even if the independent word of the search request entered by the user is a compound word that is not suitable for data search, it is automatically divided into constituent words suitable for data search, so that the document data can be satisfactorily improved. You can search.

【００９６】請求項４記載の発明の文書検索装置は、複
合語の構成単語が予め格納された構成単語辞書を設け、
この構成単語辞書を参照して複合語分割手段が複合語を
複数の構成単語に分割することにより、ユーザが入力し
た検索要求の自立語がデータ検索に不適な複合語でも、
これがデータ検索に適切な構成単語に自動的に分割され
るので、文書データを良好に検索することができる。According to another aspect of the document retrieval apparatus of the present invention, a constituent word dictionary in which constituent words of a compound word are stored in advance is provided.
By referring to this constituent word dictionary and dividing the compound word into a plurality of constituent words by the compound word dividing means, even if the independent word of the search request input by the user is not suitable for data retrieval,
Since this is automatically divided into constituent words suitable for data retrieval, the document data can be retrieved well.

【００９７】請求項５記載の発明の文書検索装置は、複
合語の分割点に位置する文字列が予め格納された分割文
字辞書を設け、この分割文字辞書を参照して複合語分割
手段が複合語を複数の構成単語に分割することにより、
ユーザが入力した検索要求の自立語がデータ検索に不適
な複合語でも、これがデータ検索に適切な構成単語に自
動的に分割されるので、文書データを良好に検索するこ
とができる。According to the document retrieval apparatus of the fifth aspect of the present invention, there is provided a divided character dictionary in which the character strings positioned at the division points of the compound word are stored in advance, and the compound word dividing means refers to this divided character dictionary to combine the compound word dividing means. By splitting a word into multiple constituent words,
Even if the independent word of the search request input by the user is a compound word that is not suitable for data search, it is automatically divided into constituent words suitable for data search, so that document data can be searched well.

【００９８】請求項６記載の発明の文書検索装置は、複
合語の分割点の前後に位置する文字種の組み合わせの可
否が予め設定された分割可否テーブルを設け、この分割
可否テーブルを参照して複合語分割手段が複合語を複数
の構成単語に分割することにより、例えば、漢字／平仮
名の組み合わせを不可として設定しておけば、送り仮名
が漢字から分割されるような不適な分割を防止すること
ができるので、複合語を複数の構成単語に分割する処理
を適切に行なわせることができる。According to the document retrieval apparatus of the present invention, a division permission / prohibition table in which the permission / prohibition of the combination of the character types located before and after the division point of the compound word is set in advance is provided. By dividing the compound word into a plurality of constituent words by the word dividing means, for example, if a combination of kanji / hiragana is set to be impossible, it is possible to prevent an improper division in which a kana is divided from kanji. Therefore, it is possible to appropriately perform the process of dividing the compound word into a plurality of constituent words.

【００９９】請求項７記載の発明の文書検索装置は、複
合語の分割点の直前に位置する文字と直後に位置する文
字とが個々に予め格納された分割前後辞書を設け、この
分割前後辞書を参照して複合語分割手段が複合語を複数
の構成単語に分割することにより、構成単語の先頭や末
尾に位置しやすい文字を予想して分割前後辞書に格納し
ておけば、良好に複合語を分割することができ、膨大な
構成単語を予想して格納しておく必要がないので、辞書
の容量を削減することもできる。According to the document retrieval apparatus of the invention as defined in claim 7, a pre-division dictionary is provided in which the character immediately before the division point of the compound word and the character immediately after the division point of the compound word are individually stored in advance. If the compound word dividing means divides the compound word into a plurality of constituent words by referring to, and predicts characters that are likely to be located at the beginning or end of the constituent word and stores them in the dictionary before and after the division, the compound word can be well composed. Since the words can be divided and it is not necessary to predict and store a huge number of constituent words, the capacity of the dictionary can be reduced.

【０１００】請求項８記載の発明の文書検索装置は、分
割前後辞書に格納された文字の各々に末尾に位置する確
率と単語の先頭に位置する確率とを設定し、これらの確
率を参照して複合語分割手段が複合語を複数の構成単語
に分割することにより、文字を構成単語の先頭や末尾に
位置する確率と共に予想して分割前後辞書に格納してお
けば、これらの確率により分割点の確度を評価して複合
語を構成単語に分割することができるので、より良好に
複合語を分割することができる。According to the document retrieval apparatus of the present invention, the probability of being located at the end and the probability of being located at the beginning of a word are set for each character stored in the pre-division dictionary and the probabilities are referred to. The compound word dividing means divides the compound word into a plurality of constituent words, and if the character is predicted and stored in the dictionary before and after the division together with the probability of being located at the beginning or end of the constituent word, it is divided by these probabilities. Since the compound word can be divided into the constituent words by evaluating the accuracy of the points, the compound word can be divided better.

【０１０１】請求項９記載の発明の文書検索装置は、複
合語分割手段が予め設定された特定の文字種からなる複
合語のみ複数の構成単語に分割することにより、例え
ば、漢字からなる複合語のみ分割するように設定すれ
ば、片仮名からなる複合語のように、二つの文字に注目
しても分割点を決定することが困難な処理を省略し、処
理の負担や時間を削減することができる。In the document retrieval apparatus according to the ninth aspect of the present invention, the compound word dividing means divides only a compound word consisting of a preset specific character type into a plurality of constituent words. If it is set to be divided, it is possible to reduce the processing load and time by omitting the processing that is difficult to determine the division point even when focusing on two characters, such as a compound word consisting of katakana. .

【０１０２】請求項１０記載の発明の文書検索装置は、
同義語となる単語の組み合わせが予め格納された同義語
辞書を設け、この同義語辞書に格納された単語を検索要
求から抽出された自立語から検出する同義語検出手段を
設け、検出された自立語を他の同義語に展開する同義語
展開手段を設けたことにより、ユーザが入力した検索要
求の自立語がデータ検索に不適な単語でも、これに同義
語が存在する場合には、この同義語もデータ検索に利用
されるので、文書データを良好に検索することができ
る。The document retrieval apparatus according to the invention of claim 10 is:
A synonym dictionary in which combinations of words that are synonyms are stored in advance is provided, and a synonym detection unit that detects the word stored in this synonym dictionary from the independent word extracted from the search request is provided. By providing the synonym expansion means for expanding a word into another synonym, even if the independent word of the search request input by the user is not suitable for data search, if a synonym exists in this word, this synonym Since the word is also used for the data search, the document data can be searched well.

【０１０３】請求項１１記載の発明の辞書作成装置は、
文書データから単語を取得する単語取得手段と、取得さ
れた単語を構成する文字を分別する文字分別手段と、分
別された文字の各々に対して単語に出現する出現頻度と
単語の末尾に位置する末尾頻度と先頭に位置する先頭頻
度とを算出する頻度算出手段と、文字の各々に対して出
現頻度に対する末尾頻度の割合である末尾確率と先頭頻
度の割合である先頭確率とを算出する確率算出手段と、
分割前後辞書に末尾確率が高い文字を複合語の分割点の
直前に位置する文字として設定すると共に先頭確率が高
い文字を複合語の分割点の直後に位置する文字として設
定する文字設定手段と、を有することにより、複合語の
分割点の前後に位置する文字を文書データから自動的に
検出して分割前後辞書に設定できるので、このような煩
雑な作業をユーザが行なう必要がない。The dictionary creating apparatus according to the invention of claim 11
A word acquisition unit that acquires a word from document data, a character classification unit that classifies the characters that make up the acquired word, an occurrence frequency that appears in a word for each of the classified characters, and a position at the end of the word Frequency calculating means for calculating the tail frequency and the head frequency located at the head, and probability calculation for calculating the tail probability that is the ratio of the tail frequency to the appearance frequency and the head probability that is the ratio of the head frequency for each character. Means and
A character setting means for setting a character having a high end probability as a character positioned immediately before the division point of the compound word in the pre-division dictionary and setting a character having a high start probability as a character positioned immediately after the division point of the compound word, With this, since the characters located before and after the division point of the compound word can be automatically detected from the document data and set in the pre-division dictionary, the user does not need to perform such complicated work.

【０１０４】請求項１２記載の発明の辞書作成装置は、
算出された出現頻度が低い文字を判定する文字判定手段
を設け、出現頻度が低い文字の末尾確率と先頭確率とを
全部の文字の末尾確率の平均値と先頭確率の平均値とに
各々置換する確率設定手段を設けたことにより、複合語
の分割点の前後に位置する文字を、その確率と共に分割
前後辞書に設定できるので、これらの確率により文書検
索装置が複合語を良好に分割することができる。According to the twelfth aspect of the present invention, there is provided a dictionary creating device,
A character determining means for determining a calculated character having a low appearance frequency is provided, and the tail probability and the head probability of a character having a low appearance frequency are respectively replaced with the average value of the tail probabilities of all the characters and the average value of the head probabilities. By providing the probability setting means, the characters positioned before and after the division point of the compound word can be set together with the probability in the before-and-after-dictionary dictionary, so that the document retrieval device can appropriately divide the compound word by these probabilities. it can.

[Brief description of drawings]

【図１】本発明文書検索装置の第一の実施例を示すブロ
ック図である。FIG. 1 is a block diagram showing a first embodiment of a document search device of the present invention.

【図２】言語解析された自然言語の検索要求を示す模式
図である。FIG. 2 is a schematic diagram showing a search request for a natural language subjected to language analysis.

【図３】検索要求から生成された検索条件を示す模式図
である。FIG. 3 is a schematic diagram showing search conditions generated from a search request.

【図４】文書検索装置の第二の実施例において、自立語
の一つに同義語を付加した検索条件を示す模式図であ
る。FIG. 4 is a schematic diagram showing a search condition in which a synonym is added to one of the independent words in the second embodiment of the document search device.

【図５】文書検索装置の第三の実施例において、自立語
である一つの複合語を三つの構成単語に分割した検索条
件を示す模式図である。FIG. 5 is a schematic diagram showing search conditions obtained by dividing one compound word, which is an independent word, into three constituent words in the third embodiment of the document search device.

【図６】文書検索装置の一変形例において、複合語の分
割と同義語の付加とを併用した検索条件を示す模式図で
ある。FIG. 6 is a schematic diagram showing a search condition in which a compound word is divided and a synonym is added in a modified example of the document search device.

【図７】文書検索装置の他の変形例において、言語解析
された自然言語の検索要求を示す模式図である。FIG. 7 is a schematic diagram showing a search request for a natural language subjected to language analysis in another modification of the document search device.

【図８】検索要求から生成された検索条件を示す模式図
である。FIG. 8 is a schematic diagram showing search conditions generated from a search request.

【図９】一つの複合語を二つの構成単語に分割した検索
条件を示す模式図である。FIG. 9 is a schematic diagram showing search conditions in which one compound word is divided into two constituent words.

【図１０】本発明の辞書作成装置の一実施例を示すブロ
ック図である。FIG. 10 is a block diagram showing an embodiment of a dictionary creating device of the present invention.

[Explanation of symbols]

１文書検索装置２データベース４要求入力手段５言語解析手段６条件生成手段７データ検索手段１１辞書作成装置１４単語取得手段１５文字分別手段１６頻度算出手段１７確率算出手段１８文字設定手段 1 Document Retrieval Device 2 Database 4 Request Input Means 5 Language Analysis Means 6 Condition Generation Means 7 Data Retrieval Means 11 Dictionary Creation Equipment 14 Word Acquisition Means 15 Character Sorting Means 16 Frequency Calculating Means 17 Probability Calculating Means 18 Character Setting Means

─────────────────────────────────────────────────────
─────────────────────────────────────────────────── ───

【手続補正書】[Procedure amendment]

【提出日】平成７年７月２５日[Submission date] July 25, 1995

【手続補正１】[Procedure Amendment 1]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】請求項５[Name of item to be corrected] Claim 5

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【手続補正２】[Procedure Amendment 2]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】請求項７[Name of item to be corrected] Claim 7

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【手続補正３】[Procedure 3]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】請求項８[Name of item to be corrected] Claim 8

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【手続補正４】[Procedure amendment 4]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】請求項１１[Name of item to be corrected] Claim 11

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【手続補正５】[Procedure Amendment 5]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】０００１[Correction target item name] 0001

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【０００１】[0001]

【産業上の利用分野】本発明は、データベースから文書
データを検索する文書検索装置、及び、この文書検索装
置に利用する分割前後文字辞書を作成する辞書作成装
置、に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document retrieval device for retrieving document data from a database, and a dictionary production device for producing a pre / post-division character dictionary used in this document retrieval device.

【手続補正６】[Procedure correction 6]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００１６[Correction target item name] 0016

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【００１６】請求項５記載の発明の文書検索装置では、
請求項３記載の発明の文書検索装置において、複合語の
分割点に成り得る文字列が予め格納された分割文字辞書
を設け、この分割文字辞書を参照して複合語分割手段が
複合語を複数の構成単語に分割する。According to the document search apparatus of the invention described in claim 5,
In the document search apparatus of the invention of claim 3, wherein, provided the divided character dictionary string that can become the division point of the compound word is stored in advance, a compound word dividing means with reference to the divided character dictionary compound words Divide into multiple constituent words.

【手続補正７】[Procedure Amendment 7]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００１８[Correction target item name] 0018

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【００１８】請求項７記載の発明の文書検索装置では、
請求項３記載の発明の文書検索装置において、複合語の
分割点の直前に位置する文字と直後に位置する文字とが
個々に予め格納された分割前後文字辞書を設け、この分
割前後文字辞書を参照して複合語分割手段が複合語を複
数の構成単語に分割する。According to the document retrieval apparatus of the invention of claim 7,
In the document search apparatus of the invention of claim 3, wherein, the character located immediately after the character located immediately before the division point of the compound word previously stored divided before and after character dictionary individually provided, this amount
The compound word dividing means divides the compound word into a plurality of constituent words by referring to the character dictionary before and after the division.

【手続補正８】[Procedure Amendment 8]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００１９[Correction target item name] 0019

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【００１９】請求項８記載の発明の文書検索装置では、
請求項７記載の発明の文書検索装置において、分割前後
文字辞書に格納された文字の各々に末尾に位置する確率
と単語の先頭に位置する確率とを設定し、これらの確率
を参照して複合語分割手段が複合語を複数の構成単語に
分割する。According to the document search apparatus of the invention described in claim 8,
In the document search device according to the invention of claim 7, before and after division.
The probability of being located at the end and the probability of being located at the beginning of the word are set for each of the characters stored in the character dictionary , and the compound word dividing means divides the compound word into a plurality of constituent words with reference to these probabilities. .

【手続補正９】[Procedure Amendment 9]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００２２[Name of item to be corrected] 0022

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【００２２】請求項１１記載の発明の辞書作成装置は、
文書データから単語を取得する単語取得手段と、取得さ
れた単語を構成する文字を分別する文字分別手段と、分
別された文字の各々に対して単語に出現する出現頻度と
単語の末尾に位置する末尾頻度と先頭に位置する先頭頻
度とを算出する頻度算出手段と、文字の各々に対して出
現頻度に対する末尾頻度の割合である末尾確率と先頭頻
度の割合である先頭確率とを算出する確率算出手段と、
分割前後文字辞書に末尾確率が高い文字を複合語の分割
点の直前に位置する文字として設定すると共に先頭確率
が高い文字を複合語の分割点の直後に位置する文字とし
て設定する分割性判断手段と、を有する。A dictionary creating apparatus according to the invention of claim 11 is
A word acquisition unit that acquires a word from document data, a character classification unit that classifies the characters that make up the acquired word, an occurrence frequency that appears in a word for each of the classified characters, and a position at the end of the word Frequency calculating means for calculating the tail frequency and the head frequency located at the head, and probability calculation for calculating the tail probability that is the ratio of the tail frequency to the appearance frequency and the head probability that is the ratio of the head frequency for each character. Means and
When set to divide front and rear character dictionary as a character located the head probable characters immediately after the dividing point of the compound word and sets the character position just prior to the dividing point of the compound word trailing probable character breaks judgment And means .

【手続補正１０】[Procedure Amendment 10]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００２８[Correction target item name] 0028

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【００２８】請求項５記載の発明の文書検索装置では、
複合語分割手段が複合語を複数の構成単語に分割する
際、分割文字辞書に予め設定された複合語の分割点に成
り得る文字列を参照するので、複合語が予想された文字
列の位置で分割される。According to the document search apparatus of the invention described in claim 5,
When the compound word dividing means divides the compound word into a plurality of constituent words, the division point of the preset compound word split character dictionary
Since reference to string obtaining Ri, compound words are divided at the position of the expected string.

【手続補正１１】[Procedure Amendment 11]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００３０[Name of item to be corrected] 0030

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【００３０】請求項７記載の発明の文書検索装置では、
複合語分割手段が複合語を複数の構成単語に分割する
際、分割前後文字辞書に個々に予め設定された複合語の
分割点の直前に位置する文字と直後に位置する文字とを
参照するので、複合語が予想された分割点の前後の文字
に従って分割される。According to the document search apparatus of the invention described in claim 7,
When the compound word dividing unit divides the compound word into a plurality of constituent words, it refers to the character positioned immediately before the compound word division point and the character positioned immediately after the compound word preset point in the character dictionary before and after the division. , The compound word is split according to the characters before and after the expected split point.

【手続補正１２】[Procedure Amendment 12]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００３１[Correction target item name] 0031

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【００３１】請求項８記載の発明の文書検索装置では、
複合語分割手段が複合語を複数の構成単語に分割する
際、分割前後文字辞書に格納された文字の各々に設定さ
れた末尾に位置する確率と単語の先頭に位置する確率と
を参照するので、複合語が分割点の前後の文字の確率に
従って分割される。According to the document search apparatus of the invention described in claim 8,
When the compound word dividing unit divides the compound word into a plurality of constituent words, it refers to the probability of being located at the end and the probability of being located at the beginning of the word set for each of the characters stored in the character dictionary before and after the division . , The compound word is divided according to the probabilities of the characters before and after the division point.

【手続補正１３】[Procedure Amendment 13]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００３４[Correction target item name] 0034

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【００３４】請求項１１記載の発明の辞書作成装置で
は、単語取得手段が文書データから単語を取得すると、
この取得された単語を構成する文字を文字分別手段が分
別する。この分別された文字の各々に対し、単語に出現
する出現頻度と、単語の末尾に位置する末尾頻度と、先
頭に位置する先頭頻度とを、頻度算出手段が算出するの
で、文字の各々に対し、出現頻度に対する末尾頻度の割
合である末尾確率と、先頭頻度の割合である先頭確率と
を、確率算出手段が算出する。分割性判断手段が、末尾
確率が高い文字を複合語の分割点の直前に位置する文
字、先頭確率が高い文字を複合語の分割点の直後に位置
する文字、として分割前後文字辞書に設定するので、こ
の分割前後文字辞書には、複合語の分割点の直前に位置
する文字と直後に位置する文字とが適切に格納される。In the dictionary creating apparatus according to the eleventh aspect of the present invention, when the word acquiring means acquires a word from the document data,
The character classification means classifies the characters forming the acquired word. For each of the separated characters, the frequency calculation means calculates the frequency of appearance in the word, the end frequency of the word at the end, and the head frequency of the word at the beginning. The probability calculation means calculates a tail probability that is the ratio of the tail frequency to the appearance frequency and a head probability that is the ratio of the head frequency . The dividing property determining means sets a character having a high end probability as a character positioned immediately before the division point of the compound word and a character having a high start probability as a character positioned immediately after the division point of the compound word in the character dictionary before and after the division. Therefore, the character dictionary immediately before and the character immediately after the division point of the compound word are appropriately stored in the character dictionary before and after the division.

【手続補正１４】[Procedure Amendment 14]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００３５[Correction target item name] 0035

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【００３５】請求項１２記載の発明の辞書作成装置で
は、算出された出現頻度が低い文字を文字判定手段が判
定すると、この出現頻度が低い文字の末尾確率と先頭確
率とを、全部の文字の末尾確率の平均値と先頭確率の平
均値とに、確率設定手段が各々置換するので、分割前後
文字辞書に確度が低い確率が設定されない。In the dictionary creating apparatus according to the twelfth aspect of the present invention, when the character determining means determines the calculated character having a low appearance frequency, the tail probability and the head probability of the character having a low appearance frequency are calculated for all the characters. Since the probability setting means replaces the average value of the tail probability and the average value of the head probability respectively, before and after the division.
The probability of low accuracy is not set in the character dictionary .

【手続補正１５】[Procedure Amendment 15]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００６１[Correction target item name] 0061

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【００６１】さらに、複合語の分割点に成り得る文字列
が予め格納された分割文字辞書を設け、条件生成部６の
複合語分割手段が、分割文字辞書を参照して複合語を複
数の構成単語に分割することも可能である。例えば、分
割文字辞書に“示装”なる文字列が設定されていれば、
字種が一つの複合語である“表示装置”が“表示”“装
置”の構成単語に分割されるので、より良好に文書デー
タを検索することができる。[0061] Further, provided the divided character dictionary string that can become the division point of the compound word is stored in advance, a compound word dividing means condition generating unit 6, a compound word several Referring to split character dictionary It is also possible to divide it into constituent words. For example, if the character string "indication" is set in the divided character dictionary,
Since the “display device” which is a compound word of one character type is divided into the constituent words of “display” and “device”, the document data can be searched more favorably.

【手続補正１６】[Procedure Amendment 16]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００６８[Correction target item name] 0068

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【００６８】しかし、このような構成単語辞書は、必然
的に膨大な単語を適切に設定しておく必要があり、その
作成の困難が予想される。そこで、このような課題を解
決するため、下記の表２に示すように、複合語の分割点
の直前に位置する文字と直後に位置する文字とが個々に
予め格納された分割前後文字辞書を設け、条件生成部６
の複合語分割手段が、分割前後文字辞書を参照して複合
語を複数の構成単語に分割することも可能である。However, in such a constituent word dictionary, it is inevitably necessary to appropriately set a huge number of words, and it is expected that the creation thereof will be difficult. Therefore, in order to solve such a problem, as shown in Table 2 below, a pre / post-division character dictionary in which a character immediately before a division point of a compound word and a character immediately after the division point are individually stored in advance is provided. Provided, condition generation unit 6
The compound word dividing means can also divide the compound word into a plurality of constituent words by referring to the character dictionary before and after the division .

【手続補正１７】[Procedure Amendment 17]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００７０[Name of item to be corrected] 0070

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【００７０】つまり、複合語は複数の構成単語に分割さ
れるので、この分割点の直前には構成単語の末尾の文字
が位置し、分割点の直後には構成単語の先頭の文字が位
置する。そして、単語の先頭や末尾に位置しやすい文字
は予想できるので、これを分割前後文字辞書に格納して
おけば複合語を良好に構成単語に分割することができ
る。That is, since the compound word is divided into a plurality of constituent words, the character at the end of the constituent word is located immediately before this dividing point, and the character at the beginning of the constituent word is located immediately after the dividing point. . Since it is possible to predict a character that is likely to be located at the beginning or end of a word, storing this in the before- and- after character dictionary allows the compound word to be properly divided into constituent words.

【手続補正１８】[Procedure 18]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００７１[Correction target item name] 0071

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【００７１】より具体的には、上記の表２に示すよう
に、分割点の直前の文字として“入，出，端，…”が設
定され、直後の文字として“的，化，号，力…”が設定
されている場合、前述した“信号入出力端子”は“号／
入”“力／端”の位置で分割されて“信号”“入出力”
“端子”となる。つまり、上述のような分割前後文字辞
書を複合語分割手段が参照することにより、膨大な単語
を設定しなくとも、字種が一つの複合語を複数の構成単
語に良好に分割することができる。More specifically, as shown in Table 2 above, "input, output, end, ..." Is set as the character immediately before the division point, and "target, character, number, force" is set as the character immediately after. ... "is set, the above-mentioned" signal input / output terminal "is" No.
"Signal""Input / output" divided at the input "force / end" position
It becomes a "terminal". In other words, character letters before and after splitting as described above
By referring to the book by the compound word dividing means, it is possible to favorably divide the compound word having one character type into a plurality of constituent words without setting a huge number of words.

【手続補正１９】[Procedure Amendment 19]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００７２[Name of item to be corrected] 0072

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【００７２】しかし、このような分割前後文字辞書も、
構成単語の先頭や末尾の文字を適切に予想して設定する
必要があり、その作成には専門的な能力が要求される。
このような課題を解決するため、分割前後文字辞書は辞
書作成装置により自動的に作成することが望ましい。However, such a character dictionary before and after division also
It is necessary to properly predict and set the letters at the beginning and the end of the constituent words, and to create them requires specialized ability.
In order to solve such a problem, it is desirable that the character dictionary before and after division is automatically created by a dictionary creating device.

【手続補正２０】[Procedure amendment 20]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００７３[Correction target item name] 0073

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【００７３】ここで、本発明の辞書作成装置の一実施例
を図１０に基づいて以下に説明する。まず、本実施例の
辞書作成装置１１は、例えば、ＨＤなどの記憶デバイス
からなる分割前後文字辞書１２に、オンラインやオフラ
インで文字を格納する。このため、本実施例の辞書作成
装置１１は、文書入力手段である文書入力部１３、単語
取得手段である単語取得部１４、文字分別手段である文
字検出部１５、頻度算出手段である頻度算出部１６、確
率算出手段である確率算出部１７、分割性判断手段であ
る文字設定部１８、を有しており、この文字設定部１８
が分割前後文字辞書１２に接続されている。An embodiment of the dictionary creating apparatus of the present invention will be described below with reference to FIG. First, the dictionary creating apparatus 11 of the present embodiment stores characters online or offline in the before-and-after character dictionary 12 composed of a storage device such as an HD. Therefore, the dictionary creating apparatus 11 of the present embodiment has a document input unit 13 as a document input unit, a word acquisition unit 14 as a word acquisition unit, a character detection unit 15 as a character classification unit, and a frequency calculation unit as a frequency calculation unit. The character setting unit 18 includes a unit 16, a probability calculating unit 17 that is a probability calculating unit , and a character setting unit 18 that is a dividing property determining unit.
Is connected to the before / after character dictionary 12.

【手続補正２１】[Procedure correction 21]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００７４[Correction target item name] 0074

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【００７４】前記文書入力部１３は、例えば、電子ファ
イルシステムなどからなり、自然言語により表現された
テキストファイルの文書データを入力する。前記単語取
得部１４は、形態素解析や構文解析などの言語解析によ
り、文書データから単語を取得し、前記文字検出部１５
は、取得された単語を構成する文字を分別する。前記頻
度算出部１６は、分別された文字の各々に対し、単語に
出現する出現頻度、単語の末尾に位置する末尾頻度、先
頭に位置する先頭頻度、を各々算出し、前記確率算出部
１７は、出現頻度に対する末尾頻度の割合である末尾確
率と、出現頻度に対する先頭頻度の割合である先頭確率
とを、分別された文字の各々に対して算出する。前記文
字設定部１８は、末尾確率が高い文字と先頭確率が高い
文字とを、複合語の分割点の直前に位置する文字と直後
に位置する文字として、前記分割前後文字辞書１２に各
々設定する。The document input unit 13 is composed of, for example, an electronic file system, and inputs document data of a text file expressed in natural language. The word acquisition unit 14 acquires a word from document data by language analysis such as morphological analysis or syntactic analysis, and the character detection unit 15
Separates the letters that make up the retrieved word. The frequency calculation unit 16 calculates an appearance frequency that appears in a word, a tail frequency that is located at the end of the word, and a head frequency that is located at the beginning of each of the classified characters, and the probability calculation unit 17 A tail probability, which is the ratio of the tail frequency to the appearance frequency, and a head probability, which is the ratio of the head frequency to the appearance frequency, are calculated for each of the separated characters. The character setting unit 18 sets a character having a high end probability and a character having a high start probability in the before-and-after- division character dictionary 12 as a character positioned immediately before and a character positioned immediately after a division point of a compound word. .

【手続補正２２】[Procedure correction 22]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００７５[Correction target item name] 0075

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【００７５】このような構成において、本実施例の辞書
作成装置１１は、適当な文書データをサンプルとして入
力すると、単語の先頭や末尾に位置する文字が自動的に
検出される。このように検出された文字は、複合語の分
割点の前後に位置する文字として分割前後文字辞書１２
に設定されるので、文書検索装置１の分割前後文字辞書
１２を自動的に作成することができる。With such a configuration, the dictionary creating apparatus 11 of the present embodiment automatically detects the characters located at the beginning or end of a word when appropriate document data is input as a sample. The character detected in this manner is used as a character positioned before and after the division point of the compound word, and is used as the character dictionary before and after division 12
Is set to, the pre-division character dictionary 12 of the document search device 1 can be automatically created.

【手続補正２３】[Procedure amendment 23]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００８１[Correction target item name] 0081

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【００８１】文字設定部１８は、末尾確率が高い文字と
先頭確率が高い文字とを、複合語の分割点の直前に位置
する文字と直後に位置する文字として、分割前後文字辞
書１２に各々設定する。例えば、この設定の閾値を“0.
50”とすると、この“0.50”より末尾確率が高い文字が
複合語の直前に位置する文字として分割前後文字辞書１
２に設定され、“0.50”より先頭確率が高い文字が複合
語の直後に位置する文字として分割前後文字辞書１２に
設定される。この場合、上記の表４に示すように、
“信，入，出，端，…”などが分割点の直後に位置する
文字として設定され、“号，力，子，…”などが複合語
の直前に位置する文字として設定される。The character setting unit 18 regards a character having a high end probability and a character having a high start probability as a character positioned immediately before and a character positioned immediately after a division point of a compound word, and is used as a character before and after division.
Each is set in Book 12. For example, the threshold for this setting is "0.
If it is set to 50 ”, the character with a higher end probability than“ 0.50 ”is divided into the character before and after the compound word.
The character set to 2 and having a higher start probability than "0.50" is set in the divided before-and-after character dictionary 12 as a character located immediately after the compound word. In this case, as shown in Table 4 above,
"Bin, In, Out, End, ..." is set as the character positioned immediately after the division point, and "Go, Power, Child, ..." is set as the character positioned immediately before the compound word.

【手続補正２４】[Procedure correction 24]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００８２[Correction target item name] 0082

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【００８２】つまり、本実施例の辞書作成装置１１で
は、サンプルとして入力した文書データから、複合語の
分割点の前後の文字を自動的に検出して分割前後文字辞
書１２に設定することができるので、専門的な能力が要
求される煩雑な作業をユーザが実行する必要がない。特
に、ユーザが文書検索装置１での検索に利用する文書デ
ータと同様な内容の文書データをサンプルとして辞書作
成装置１１に入力すれば、分割前後文字辞書１２をユー
ザの作業に最適な形態に作成することができる。That is, the dictionary creating apparatus 11 of the present embodiment automatically detects the characters before and after the dividing point of the compound word from the document data input as a sample, and the character words before and after the dividing.
Since it can be set in the document 12, the user does not need to perform the complicated work that requires specialized ability. In particular, if the user inputs the document data having the same content as the document data used for the search by the document search device 1 to the dictionary creation device 11 as a sample, the pre / post-division character dictionary 12 is created in the optimum form for the user's work. can do.

【手続補正２５】[Procedure correction 25]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００８３[Name of item to be corrected] 0083

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【００８３】なお、上述のようにして辞書作成装置１１
により分割前後文字辞書１２を作成する場合、サンプル
として入力する文書データが多量であるほど、分割前後
文字辞書１２の設定内容は正確になるが、これでは作業
の負担と時間とが増大する。この点、文書データが少量
であれば作業の負担や時間は減少するが、この場合は分
割前後文字辞書１２の確度が期待できない。The dictionary creating device 11 is used as described above.
Optionally creating a split longitudinal character dictionary 12, as document data to be input as a sample is a large amount, divided before and after
Although the setting contents of the character dictionary 12 are accurate, this increases the work burden and time. In this regard, the burden and time of work if the document data is a small amount is decreased, this case is the minute
The accuracy of the before-and-after character dictionary 12 cannot be expected.

【手続補正２６】[Procedure Amendment 26]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００８７[Correction target item name] 0087

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【００８７】この場合、全部の文字の出現頻度が“4
0”、末尾頻度が“14”、先頭頻度が“14”なので、全
部の文字の末尾確率の平均値は“14／40＝0.35”とな
り、先頭確率の平均値も“0.35”となる。そこで、上述
のように出現確率が低い文字の確率を上述のような平均
値に置換することにより、不適な確率の設定を防止する
ことができ、分割前後文字辞書１２の確度を改善するこ
とができる。In this case, the appearance frequency of all characters is "4.
Since "0", end frequency is "14", and start frequency is "14", the average value of the end probabilities of all characters is "14/40 = 0.35", and the average value of the start probabilities is also "0.35". By replacing the probability of a character having a low appearance probability with the average value as described above, it is possible to prevent the setting of an inappropriate probability and improve the accuracy of the character dictionary 12 before and after division. .

【手続補正２７】[Procedure Amendment 27]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００８８[Correction target item name] 0088

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【００８８】なお、このように複合語の分割点の直前や
直後に位置する文字を分割前後文字辞書１２に設定する
ことにより、文書検索装置１の複合語分割手段は、複合
語を複数の構成単語に良好に分割することができるが、
上述のように辞書作成装置１１が算定した末尾確率や先
頭確率を文字と共に分割前後文字辞書１２に設定し、文
書検索装置１の複合語分割手段が、これらの確率を参照
して複合語分割手段が複合語を複数の構成単語に分割す
ることも可能である。As described above, by setting the characters located immediately before and after the division point of the compound word in the before-and-after character dictionary 12, the compound word dividing means of the document retrieval apparatus 1 forms a plurality of compound words. Can be well divided into words,
As described above, the trailing probability and the leading probability calculated by the dictionary creating device 11 are set in the character dictionary 12 before and after the division together with the characters, and the compound word dividing means of the document search device 1 refers to these probabilities and the compound word dividing means. It is also possible to divide a compound word into a plurality of constituent words.

【手続補正２８】[Procedure correction 28]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００８９[Correction target item name] 0089

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【００８９】より具体的には、分割前後文字辞書１２に
は、前述した表４に示すように、文字の各々に先頭確率
と末尾確率とが設定されるので、文書検索装置１は、複
合語の分割点の前後の文字の確率を乗算して分割評価値
とし、この分割評価値を閾値と比較して分割の可否を決
定する。例えば、複合語である“信号入出力端子”を分
割する場合、その分割評価値は下記のように算定される
ので、信／号 → 0.00×0.00 ＝０．００号／入 → 0.88×0.70 ＝０．６１入／出 → 0.30×0.67 ＝０．２０出／力 → 0.17×0.06 ＝０．０１力／端 → 0.50×0.75 ＝０．３８端／子 → 0.25×0.00 ＝０．００これらの分割評価値を閾値“0.30”と比較すると、“号
／入”“力／端”の位置が分割点となり、これは“信
号”“入出力”“端子”に分割される。More specifically, since the leading probability and the ending probability are set for each character in the before-and-after character dictionary 12 as shown in Table 4, the document retrieval apparatus 1 uses the compound word. The probability of characters before and after the division point is multiplied to obtain a division evaluation value, and this division evaluation value is compared with a threshold to determine whether or not division is possible. For example, when dividing a compound word “signal input / output terminal”, the divided evaluation value is calculated as follows, so the signal / number → 0.00 × 0.00 = 0.00 number / input → 0.88 × 0.70 = 0.61 input / output → 0.30 × 0.67 = 0.20 output / force → 0.17 × 0.06 = 0.01 force / end → 0.50 × 0.75 = 0.38 end / child → 0.25 × 0.00 = 0.00 These divisions When the evaluation value is compared with the threshold value "0.30", the positions of "go / enter" and "force / end" are divided points, which are divided into "signal", "input / output" and "terminal".

【手続補正２９】[Procedure correction 29]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００９７[Correction target item name] 0097

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【００９７】請求項５記載の発明の文書検索装置は、複
合語の分割点に成り得る文字列が予め格納された分割文
字辞書を設け、この分割文字辞書を参照して複合語分割
手段が複合語を複数の構成単語に分割することにより、
ユーザが入力した検索要求の自立語がデータ検索に不適
な複合語でも、これがデータ検索に適切な構成単語に自
動的に分割されるので、文書データを良好に検索するこ
とができる。[0097] document search apparatus of the invention of claim 5 wherein is provided with a divided character dictionary string that can become the division point of the compound word is stored in advance, a compound word dividing means with reference to the divided character dictionary By splitting a compound word into multiple constituent words,
Even if the independent word of the search request input by the user is a compound word that is not suitable for data search, it is automatically divided into constituent words suitable for data search, so that document data can be searched well.

【手続補正３０】[Procedure amendment 30]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００９９[Correction target item name] 0099

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【００９９】請求項７記載の発明の文書検索装置は、複
合語の分割点の直前に位置する文字と直後に位置する文
字とが個々に予め格納された分割前後文字辞書を設け、
この分割前後文字辞書を参照して複合語分割手段が複合
語を複数の構成単語に分割することにより、構成単語の
先頭や末尾に位置しやすい文字を予想して分割前後文字
辞書に格納しておけば、良好に複合語を分割することが
でき、膨大な構成単語を予想して格納しておく必要がな
いので、辞書の容量を削減することもできる。According to a seventh aspect of the present invention, there is provided a document search device, which comprises a pre- division character dictionary in which a character positioned immediately before a division point of a compound word and a character positioned immediately after the division point are individually stored in advance.
By dividing the compound word dividing means with reference to the divided characters before and after dictionary compound words into a plurality of constituent words, the front and rear divided in anticipation of likely character located at the beginning or end of the configuration word character
If stored in the dictionary , the compound word can be properly divided, and it is not necessary to predict and store a huge number of constituent words, so that the capacity of the dictionary can be reduced.

【手続補正３１】[Procedure correction 31]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】０１００[Correction target item name] 0100

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【０１００】請求項８記載の発明の文書検索装置は、分
割前後文字辞書に格納された文字の各々に末尾に位置す
る確率と単語の先頭に位置する確率とを設定し、これら
の確率を参照して複合語分割手段が複合語を複数の構成
単語に分割することにより、文字を構成単語の先頭や末
尾に位置する確率と共に予想して分割前後文字辞書に格
納しておけば、これらの確率により分割点の確度を評価
して複合語を構成単語に分割することができるので、よ
り良好に複合語を分割することができる。[0100] document search apparatus of the present invention according to claim 8, the partial
The probability of being located at the end and the probability of being located at the beginning of the word are set for each of the characters stored in the character dictionary before and after splitting, and the compound word dividing means refers to these probabilities to divide the compound word into a plurality of constituent words. By dividing and predicting the character with the probability of being located at the beginning or end of the constituent word and storing it in the character dictionary before and after the division , the probability of the division point is evaluated by these probabilities and the compound word becomes the constituent word. Since it can be divided, the compound word can be divided better.

【手続補正３２】[Procedure correction 32]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】０１０３[Correction target item name] 0103

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【０１０３】請求項１１記載の発明の辞書作成装置は、
文書データから単語を取得する単語取得手段と、取得さ
れた単語を構成する文字を分別する文字分別手段と、分
別された文字の各々に対して単語に出現する出現頻度と
単語の末尾に位置する末尾頻度と先頭に位置する先頭頻
度とを算出する頻度算出手段と、文字の各々に対して出
現頻度に対する末尾頻度の割合である末尾確率と先頭頻
度の割合である先頭確率とを算出する確率算出手段と、
分割前後文字辞書に末尾確率が高い文字を複合語の分割
点の直前に位置する文字として設定すると共に先頭確率
が高い文字を複合語の分割点の直後に位置する文字とし
て設定する分割性判断手段と、を有することにより、複
合語の分割点の前後に位置する文字を文書データから自
動的に検出して分割前後文字辞書に設定できるので、こ
のような煩雑な作業をユーザが行なう必要がない。The dictionary creating apparatus according to the invention of claim 11
A word acquisition unit that acquires a word from document data, a character classification unit that classifies the characters that make up the acquired word, an occurrence frequency that appears in a word for each of the classified characters, and a position at the end of the word Frequency calculating means for calculating the tail frequency and the head frequency located at the head, and probability calculation for calculating the tail probability that is the ratio of the tail frequency to the appearance frequency and the head probability that is the ratio of the head frequency for each character. Means and
When set to divide front and rear character dictionary as a character located the head probable characters immediately after the dividing point of the compound word and sets the character position just prior to the dividing point of the compound word trailing probable character breaks judgment By including means , it is possible to automatically detect the characters located before and after the division point of the compound word from the document data and set them in the character dictionary before and after division, so that the user needs to perform such complicated work. Absent.

【手続補正３３】[Procedure amendment 33]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】０１０４[Correction target item name] 0104

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【０１０４】請求項１２記載の発明の辞書作成装置は、
算出された出現頻度が低い文字を判定する文字判定手段
を設け、出現頻度が低い文字の末尾確率と先頭確率とを
全部の文字の末尾確率の平均値と先頭確率の平均値とに
各々置換する確率設定手段を設けたことにより、複合語
の分割点の前後に位置する文字を、その確率と共に分割
前後文字辞書に設定できるので、これらの確率により文
書検索装置が複合語を良好に分割することができる。According to the twelfth aspect of the present invention, there is provided a dictionary creating device,
A character determining means for determining a calculated character having a low appearance frequency is provided, and the tail probability and the head probability of a character having a low appearance frequency are respectively replaced with the average value of the tail probabilities of all the characters and the average value of the head probabilities. by providing the probability setting unit, a character located before and after the dividing point of the compound word, split along with the probability
Since it can be set in the preceding and following character dictionary , the document retrieval device can appropriately divide the compound word by these probabilities.

【手続補正３４】[Procedure amendment 34]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】図面の簡単な説明[Name of item to be corrected] Brief description of the drawing

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【図面の簡単な説明】[Brief description of drawings]

【符号の説明】１文書検索装置２データベース４要求入力手段５言語解析手段６条件生成手段７データ検索手段１１辞書作成装置１２分割前後文字辞書１４単語取得手段１５文字分別手段１６頻度算出手段１７確率算出手段１８分割性判断手段 [Explanation of Codes] 1 Document Retrieval Device 2 Database 4 Request Input Means 5 Language Analysis Means 6 Condition Generation Means 7 Data Retrieval Means 11 Dictionary Creation Device 12 Pre / Post Character Dictionary 14 Word Acquisition Means 15 Character Classification Means 16 Frequency Calculation Means 17 Probability Calculation means 18 Dividability determination means

【手続補正３５】[Procedure amendment 35]

【補正対象書類名】図面[Document name to be corrected] Drawing

【補正対象項目名】図１０[Name of item to be corrected] Fig. 10

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【図１０】 [Figure 10]

Claims

[Claims]

1. A database in which a plurality of pieces of document data are stored in advance, a request input means for inputting a search request for document data in natural language, an independent word and an adjunct word by language analysis from the search request input in natural language. And a condition analyzing means for individually converting each of the extracted auxiliary words into one of a plurality of preset operators and combining them with the corresponding independent word to generate a search condition. And a data search means for searching document data from the database according to the generated search condition.

2. The document search device according to claim 1, wherein the operator of the condition generation means is set to AND, OR, and AND NOT.

3. A compound word detecting means for detecting a compound word from an independent word extracted from a search request, and a compound word dividing means for dividing the detected compound word into a plurality of constituent words. The document search device according to claim 1 or 2.

4. A constituent word dictionary in which constituent words of a compound word are stored in advance, and the compound word dividing means divides the compound word into a plurality of constituent words with reference to this constituent word dictionary. Item 3. The document search device according to item 3.

5. A divided character dictionary in which a character string located at a division point of a compound word is stored in advance, and the compound word dividing means divides the compound word into a plurality of constituent words with reference to this divided character dictionary. The document search device according to claim 3, wherein

6. A division permission / prohibition table in which permission / prohibition of a combination of character types located before and after the division point of the compound word is preset is provided, and the compound word division means refers to the division permission / prohibition table to form a plurality of compound words. The document retrieval device according to claim 3, wherein the document retrieval device is divided into words.

7. A before-and-after-division dictionary in which a character immediately before a dividing point of a compound word and a character immediately after the dividing point of the compound word are individually stored in advance, and the compound-word dividing means refers to the compound-before-and-after dictionary to combine the compound words. 4. The document retrieval device according to claim 3, wherein the word is divided into a plurality of constituent words.

8. The probability of being located at the end and the probability of being located at the beginning of a word are set for each of the characters stored in the pre / post-division dictionary, and the compound word dividing means refers to these probabilities to generate a plurality of compound words. 8. The method is characterized by dividing into the constituent words of
Document retrieval device described.

9. The document retrieval device according to claim 7, wherein the compound word dividing means divides only a compound word consisting of a preset specific character type into a plurality of constituent words.

10. A synonym dictionary in which a combination of words that are synonyms is stored in advance is provided, and a synonym detection means for detecting a word stored in this synonym dictionary from an independent word extracted from a search request is provided. 10. The document retrieval device according to claim 1, further comprising synonym expansion means for expanding the detected independent word into another synonym. .

11. A word acquisition unit for acquiring a word from document data, a character classification unit for classifying characters constituting the acquired word, an appearance frequency and a word that appear in a word for each of the classified characters. A frequency calculating means for calculating a tail frequency located at the end and a head frequency located at the beginning, a tail probability that is the ratio of the tail frequency to the appearance frequency for each character, and a head probability that is the ratio of the head frequency; And a character having a high end probability is set in the dictionary before and after the division as a character positioned immediately before the division point of the compound word, and a character having a high head probability is positioned immediately after the division point of the compound word. And a character setting unit for setting the dictionary as a dictionary.

12. A character determination means for determining a calculated character having a low appearance frequency is provided, and an average value of the end probabilities and an average value of the start probabilities of the end probabilities and the head probabilities of the characters having a low appearance frequency are set. 12. The dictionary creating apparatus according to claim 11, further comprising: probability setting means for replacing with and.