JPH09237281A

JPH09237281A - Text data retrieving device and its method

Info

Publication number: JPH09237281A
Application number: JP8044705A
Authority: JP
Inventors: Tomoyuki Tada; 多田　　智之; Hidenobu Kaneoka; 秀信金岡; Toshihiro Fujinami; 稔弘藤並
Original assignee: Omron Corp; Omron Tateisi Electronics Co
Current assignee: Omron Corp
Priority date: 1996-03-01
Filing date: 1996-03-01
Publication date: 1997-09-09

Abstract

PROBLEM TO BE SOLVED: To make it possible to register also an unregistered word mixedly including HIRAGANA (cursive form of Japanese syllabary), KANJI (Chinese character), etc., in a dictionary by providing the text data retrieving device with a means for judging whether a key word is a word registered in a dictionary file or not and a means for registering a key word in the dictionary file as a word when the key word is not registered. SOLUTION: A morpheme analyzing means 3 executes morpheme analysis for text data inputted by a dictionary file 4. A retrieving means 8 retrieves a word coincident with a key word received by a key word receiving means 7 within words divided by the means 3 and outputs a retrieved result. A judging means judges whether an inputted key word is a word not registered in the file 4 or not, and in the case of a word not registered in the file 4, a registering means 9 registers the key word in the file 4. In addition, the means 9 registers a key word received by the means 7 more than the prescribed number of times as a word.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】この発明は、入力されたキー
ワードを用いてテキストデータを検索するテキストデー
タ検索装置およびテキストデータ検索方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a text data search device and a text data search method for searching text data using an input keyword.

【０００２】[0002]

【従来の技術】最近、日本語等の分かち書き（単語と単
語との間をスペース等で区切る書き方）されていない言
語で書かれたテキストデータに対して、予め単語単位で
全文検索のインデックスを作成しておき、キーワードが
入力されたときにはこのインデックスを利用してテキス
トデータを検索することのできるテキストデータ検索装
置が提案されている。このテキストデータ検索装置で
は、最初に辞書を用いて入力されたテキストデータを単
語単位に分割する形態素解析処理を行い、形態素解析処
理の結果に基づいて名詞や動詞等の単語を登録したイン
デックスを作成しておく。そして、キーワードが入力さ
れると、作成されているインデックスにキーワードが含
まれているかどうかを判定する。この判定で、インデッ
クスにキーワードが含まれている場合には該当する部分
（キーワードが含まれる部分）のテキストデータを出力
し、インデックスにキーワードが含まれていない場合に
は該当するテキストデータが存在しない旨のメッセージ
等を表示する。2. Description of the Related Art Recently, an index for full-text search is created in advance on a word-by-word basis for text data written in a language such as Japanese, which is not separated (words are separated from each other by spaces). Incidentally, there has been proposed a text data search device capable of searching text data using this index when a keyword is input. This text data search device performs a morphological analysis process that first divides text data input using a dictionary into word units, and creates an index that registers words such as nouns and verbs based on the results of the morphological analysis process. I'll do it. Then, when the keyword is input, it is determined whether the created index includes the keyword. In this determination, if the index includes the keyword, the text data of the corresponding portion (the portion including the keyword) is output, and if the index does not include the keyword, the corresponding text data does not exist. A message to that effect is displayed.

【０００３】ところが、辞書に登録されていない単語が
含まれるテキストデータに対して形態素解析処理を行う
と、単語の分割に失敗する。例えば、「ら致」という単
語が辞書に登録されていない場合に「ら致の容疑で逮捕
された。」というテキストデータに対して形態素解析処
理を実行すると、「ら」「致」「の」「容疑」「で」
「逮捕」「さ」「れ」「た」と分割されることとなり、
「ら致」という単語はインデックスに登録されない。し
たがって、「ら致」等の辞書に登録されていない単語が
キーワードとして入力されたときには、テキストデータ
の検索を正しく行えないという問題が生じる。このた
め、テキストデータ等から辞書に登録されていない単語
（以下、未登録単語と言う。）を検出したときに、該未
登録単語を辞書に登録する処理が重要となっている。However, when the morphological analysis process is performed on the text data including the word that is not registered in the dictionary, the word division fails. For example, if the word "ra-chi" is not registered in the dictionary, morphological analysis processing is executed on the text data "rare-arrested.""Suspect""De"
It will be divided into "arrest", "sa", "re" and "ta",
The word "Latch" is not registered in the index. Therefore, when a word that is not registered in the dictionary, such as “Latch”, is input as a keyword, there is a problem that the text data cannot be searched correctly. Therefore, when a word that is not registered in the dictionary (hereinafter referred to as an unregistered word) is detected from text data or the like, it is important to register the unregistered word in the dictionary.

【０００４】[0004]

【発明が解決しようとする課題】該未登録単語を辞書に
登録する方法として、漢字やカタカナ等の同一の文字種
が連続する文字列が辞書に登録されていないときに、こ
の文字列を未登録単語として検出し、これを辞書に登録
する方法がある。しかしながら、この方法では上記した
例のように「ひらがな」と「漢字」からなる「ら致」等
の文字種が混合している単語は未登録単語として検出す
ることができない。よって、このような文字種が混合し
ている未登録単語を辞書に登録することもできないとい
う問題がある。As a method of registering the unregistered word in the dictionary, when a character string in which the same character type such as kanji or katakana continues is not registered in the dictionary, this character string is not registered. There is a method of detecting it as a word and registering it in a dictionary. However, with this method, a word having a mixture of character types such as “hiragana” and “kanji” such as “ra-ki” as in the above example cannot be detected as an unregistered word. Therefore, there is a problem that it is not possible to register an unregistered word in which such character types are mixed in the dictionary.

【０００５】この発明の目的は、「ひらがな」と「漢
字」等の文字種が混合している未登録単語も辞書に登録
することができるテキストデータ検索装置およびテキス
トデータ検索方法を提供することにある。An object of the present invention is to provide a text data search device and a text data search method capable of registering an unregistered word having a mixture of character types such as "Hiragana" and "Kanji" in a dictionary. .

【０００６】[0006]

【課題を解決するための手段】請求項１に記載されたこ
の発明は、辞書ファイルを用いて入力されたテキストデ
ータの形態素解析を行う形態素解析手段と、キーワード
の入力を受け付けるキーワード受付手段と、前記キーワ
ード受付手段で受け付けたキーワードと一致する単語を
形態素解析されたテキストデータから検索して検索結果
を出力する検索手段と、を備えたテキストデータ検索装
置であって、前記キーワード受付手段で受け付けたキー
ワードが前記辞書ファイルに登録されている単語である
かどうかを判定する判定手段と、前記受け付けたキーワ
ードが前記辞書ファイルに登録されていない単語である
ときに該キーワードを単語として前記辞書ファイルに登
録する登録手段と、を備えたことを特徴とする。The present invention described in claim 1 is a morpheme analyzing means for performing a morpheme analysis of text data input using a dictionary file, and a keyword accepting means for accepting an input of a keyword. A text data search device comprising: a search unit that searches the morphologically analyzed text data for a word that matches the keyword accepted by the keyword accepting unit and outputs a search result, wherein the keyword accepting unit accepts the word. Determining means for determining whether or not the keyword is a word registered in the dictionary file, and registering the keyword as a word in the dictionary file when the accepted keyword is a word not registered in the dictionary file And a registration means for performing the registration.

【０００７】この構成では、形態素解析手段が、辞書フ
ァイルを用いて入力されたテキストデータの形態素解析
を行う。検索手段は、形態素解析手段によって分割され
た単語内にキーワード受付手段で受け付けたキーワード
と一致する単語を検索し、検索結果を出力する。また、
判定手段は、入力されたキーワードが辞書に登録されて
いない単語であるかどうかを判定する。この判定結果が
辞書に登録されていない単語であれば、登録手段がこの
キーワードを辞書に登録する。In this structure, the morpheme analysis means performs morpheme analysis of the text data input using the dictionary file. The search means searches the words divided by the morpheme analysis means for a word that matches the keyword accepted by the keyword acceptance means, and outputs the search result. Also,
The determination means determines whether or not the input keyword is a word that is not registered in the dictionary. If the result of this determination is a word that is not registered in the dictionary, the registration means registers this keyword in the dictionary.

【０００８】また、前記登録手段は、キーワード受付手
段で所定回数以上受け付けたキーワードを単語として辞
書ファイルに登録する手段であることを特徴とする。Further, the registration means is means for registering a keyword, which has been received a predetermined number of times or more by the keyword receiving means, as a word in a dictionary file.

【０００９】この構成では、辞書に登録されていないキ
ーワードをキーワード受付手段が所定回数以上受け付け
た時に、このキーワードを単語として辞書ファイルに登
録する。With this configuration, when the keyword receiving unit receives a keyword that is not registered in the dictionary a predetermined number of times or more, the keyword is registered as a word in the dictionary file.

【００１０】また、形態素解析された結果に基づいて単
語とともに前記入力されたテキストデータ中において該
単語が存在する位置を示す位置情報を登録したインデッ
クスを作成するインデックス作成手段を備え、前記検索
手段は、検索結果に前記位置情報も含めて出力する手段
を含むことを特徴とする。The searching means further comprises an index creating means for creating an index in which the position information indicating the position of the word in the input text data is registered together with the word based on the result of the morphological analysis. A unit for outputting the search result including the position information is also included.

【００１１】この構成では、インデックス作成手段が単
語とテキストデータ中における該単語の存在する位置を
示す位置情報とを登録したインデックスを作成する。そ
して、検索手段が検索結果に前記位置情報も含めて出力
する。In this configuration, the index creating means creates an index in which a word and position information indicating the position of the word in the text data are registered. Then, the search means outputs the search result including the position information.

【００１２】さらに、請求項４〜請求項６に記載した発
明は、請求項１〜請求項３に記載したテキストデータ検
索装置を方法の発明としてとらえた構成である。Further, the inventions described in claims 4 to 6 are configurations in which the text data search device described in claims 1 to 3 is regarded as a method invention.

【００１３】[0013]

【発明の実施の形態】図１は、この発明の実施の形態で
あるテキストデータ検索装置の機能を示すブロック図で
ある。テキストデータ検索装置１は、テキストデータ記
憶部２と形態素解析部３と、辞書ファイル４と、インデ
ックス作成部５、インデックス記憶部６と、キーワード
入力部７と、テキストデータ検索部８と、未登録単語登
録部９と、を備えている。テキストデータ記憶部２に
は、複数のテキストデータが記憶されている。形態素解
析部３は、テキストデータ記憶部２に記憶されているテ
キストデータを取り込み、このテキストデータを単語に
分割する形態素解析を実行する。辞書ファイル４は、単
語毎に文字列とその品詞等の属性を対応させて記憶して
いる。インデックス作成部５は、形態素解析された結果
に基づいて、インデックスを作成する。インデックス記
憶部６は、作成されたインデックスを記憶する。キーワ
ード入力部７はキーワードの入力を受け付ける。テキス
トデータ検索部８は、受け付けたキーワードがインデッ
クス記憶部６に記憶されているインデックスにあるかな
いかを検索する。未登録単語登録部９は、辞書ファイル
４への未登録単語の登録を実行する。なお、形態素解析
の結果に対して分割した単語の接続性を検定するために
辞書ファイル４に単語の品詞を記憶させている。例え
ば、名詞の後に形容詞がつながることはない等の品詞と
品詞との接続関係を予めルールとして定めておくこと
で、文字列「考えない」の形態素解析の結果を「考え
（名詞）＋ない（形容詞）」の２文節としてではなく、
「考え（下一段動詞）＋ない（助動詞）」の１文節とし
て得ることができる。1 is a block diagram showing the function of a text data search apparatus according to an embodiment of the present invention. The text data search device 1 includes a text data storage unit 2, a morphological analysis unit 3, a dictionary file 4, an index creation unit 5, an index storage unit 6, a keyword input unit 7, a text data search unit 8, and unregistered. The word registration unit 9 is provided. The text data storage unit 2 stores a plurality of text data. The morpheme analysis unit 3 takes in the text data stored in the text data storage unit 2 and executes a morpheme analysis to divide the text data into words. The dictionary file 4 stores a character string and an attribute such as a part of speech corresponding to each word. The index creating unit 5 creates an index based on the result of morphological analysis. The index storage unit 6 stores the created index. The keyword input unit 7 receives input of a keyword. The text data search unit 8 searches whether or not the accepted keyword is in the index stored in the index storage unit 6. The unregistered word registration unit 9 executes registration of unregistered words in the dictionary file 4. The word part of speech is stored in the dictionary file 4 in order to test the connectivity of the divided words with respect to the result of the morphological analysis. For example, by predetermining a connection relation between a part of speech and a part of speech such that an adjective is not connected after a noun, the result of the morphological analysis of the character string "don't think" is "think (noun) + no ( Not as two verses of "adjective)"
It can be obtained as one phrase of "thought (lower verb) + no (auxiliary verb)".

【００１４】以下、詳細にこの実施の形態のテキストデ
ータ検索装置１の動作を説明する。今、辞書ファイル４
には、図２に示す単語が登録されており、「ら致」とい
う単語は辞書ファイル４に登録されていない未登録単語
であるとする。図２に示すように辞書ファイル４は、単
語毎に文字列とその品詞を対応させて記憶している。図
３は、テキストデータ検索装置がテキストデータのイン
デックスを作成する処理を示すフローチャートである。
形態素解析部３が、テキストデータ記憶部２に記憶され
ている指定されたテキストデータを読み出す（ｎ１）。
形態素解析部３は、辞書ファイル４を用いてｎ１で読み
出したテキストデータを単語に分割する形態素解析処理
を実行する（ｎ２）。例えば、「ら致の容疑で逮捕され
た。」というテキストデータＡはｎ２の形態素解析処理
によって、図４に示すように９つの単語に分割される。
なお、この形態素解析処理では辞書ファイルに登録され
ていない単語（この例では、「ら」「致」）の品詞は未
登録とする。The operation of the text data search device 1 of this embodiment will be described in detail below. Dictionary file 4 now
2, the word shown in FIG. 2 is registered, and the word “Latch” is an unregistered word that is not registered in the dictionary file 4. As shown in FIG. 2, the dictionary file 4 stores a character string and its part-of-speech corresponding to each word. FIG. 3 is a flowchart showing a process in which the text data search device creates an index of text data.
The morphological analysis unit 3 reads the designated text data stored in the text data storage unit 2 (n1).
The morphological analysis unit 3 executes a morphological analysis process of dividing the text data read in n1 into words using the dictionary file 4 (n2). For example, the text data A of "arrested on suspicion of being killed" is divided into nine words as shown in FIG. 4 by the morphological analysis process of n2.
In this morphological analysis process, the part-of-speech of a word (in this example, "ra" and "ki") not registered in the dictionary file is not registered.

【００１５】インデックス作成部６は、形態素解析によ
って分割された単語内でその品詞が名詞または未登録で
ある単語を検出し（ｎ３）、インデックスを作成する
（ｎ４）（図５参照）。インデックスには、単語に対応
させてその単語が含まれていたテキストデータを示すテ
キスト情報が登録される。そして、インデックス作成部
５で作成されたインデックスをインデックス記憶部６に
記憶して処理を完了する（ｎ５）。The index creating section 6 detects a word whose part of speech is noun or unregistered in the words divided by the morphological analysis (n3), and creates an index (n4) (see FIG. 5). In the index, text information indicating the text data containing the word is registered in association with the word. Then, the index created by the index creating unit 5 is stored in the index storage unit 6 and the process is completed (n5).

【００１６】以上のように、この実施の形態のテキスト
データ検索装置１は、テキストデータを形態素解析によ
って単語に分割し、その品詞が名詞または辞書ファイル
４に登録されていなかった単語をインデックスに登録す
る。また、インデックスには単語と該単語が含まれてい
たテキストデータを示すテキスト情報とが対応して記憶
される。As described above, the text data retrieving apparatus 1 of this embodiment divides text data into words by morphological analysis, and registers the words whose part of speech is not registered in the noun or the dictionary file 4 in the index. To do. Further, a word and text information indicating text data including the word are stored in the index in association with each other.

【００１７】つぎに、入力されたキーワードでテキスト
データを検索する処理を説明する。図６は、検索処理を
示すフローチャートである。テキストデータ検索装置１
は、キーワード入力部７から文書検索のキーワードが入
力されると（ｎ１１）、テキストデータ検索部８がこの
入力されたキーワードがインデックス記憶部６に記憶し
ているインデックスに登録されているかどうかを判定す
る（ｎ１２）。インデックス記憶部６に入力されたキー
ワードが登録されていると、テキストデータ検索部８は
インデックスからこのキーワードに対応させて記憶して
いるテキスト情報を読み出す（ｎ１３）。そして、テキ
ストデータ検索部８はこの読み出したテキスト情報を出
力し（ｎ１４）、処理を完了する。Next, the process of searching the text data with the input keyword will be described. FIG. 6 is a flowchart showing the search process. Text data retrieval device 1
When a document search keyword is input from the keyword input unit 7 (n11), the text data search unit 8 determines whether the input keyword is registered in the index stored in the index storage unit 6. (N12). When the input keyword is registered in the index storage unit 6, the text data search unit 8 reads out the text information stored in association with the keyword from the index (n13). Then, the text data search unit 8 outputs the read text information (n14), and the processing is completed.

【００１８】例えば、インデックス記憶部６が図５に示
すインデックスを記憶している時に、キーワードとして
「容疑」が入力されたとする。インデックスには「容
疑」という単語が登録されているので、テキストデータ
検索部８は「容疑」という単語に対応させて記憶してい
るテキスト情報「テキストデータＡ」を検索結果として
出力する。この検索結果から「容疑」というキーワード
が含まれるテキストデータを簡単に読み出したりするこ
とができる。また、図７（Ａ）に示すように表示部でテ
キストデータを表示する場合に、テキストデータＡにお
けるキーワードの部分にのみアンダーラインを付けて他
の部分と区別して表示するようにすれば、表示されてい
るテキストデータＡから簡単にキーワードを見つけるこ
とができる。また、キーワードの部分を反転表示して他
の部分と区別してもよい（図７（Ｂ）参照）。For example, it is assumed that "suspect" is input as a keyword while the index storage unit 6 stores the index shown in FIG. Since the word “suspect” is registered in the index, the text data search unit 8 outputs the text information “text data A” stored in association with the word “suspect” as the search result. From this search result, it is possible to easily read out the text data containing the keyword “suspect”. Further, when the text data is displayed on the display unit as shown in FIG. 7A, if only the keyword portion in the text data A is underlined and is displayed separately from other portions, the display The keyword can be easily found from the text data A displayed. Alternatively, the keyword portion may be highlighted so as to be distinguished from other portions (see FIG. 7B).

【００１９】インデックスに入力されたキーワードが登
録されていないと、テキストデータ検索部８は未登録単
語管理部９にこのキーワードを入力する。未登録単語管
理部９は、このキーワードが辞書ファイル４に登録され
ている単語であるか、どうかを判定する（ｎ１５）。ｎ
１５で、入力された単語が辞書ファイル４に登録されて
いる単語であると判定すると、「該当するキーワードを
含むテキストデータが存在しません」というメッセージ
を検索結果として出力し（ｎ１６）、処理を完了する。If the keyword entered in the index is not registered, the text data retrieval unit 8 inputs this keyword into the unregistered word management unit 9. The unregistered word management unit 9 determines whether this keyword is a word registered in the dictionary file 4 (n15). n
If it is determined in 15 that the input word is a word registered in the dictionary file 4, a message "text data containing the relevant keyword does not exist" is output as a search result (n16), and the process is executed. Complete.

【００２０】例えば、辞書ファイル４には図２に示す単
語が登録されており、インデックス記憶部６には図５に
示すインデックスが記憶されている場合、キーワードと
して「監禁」が入力されたとする。インデックスには
「監禁」という単語が登録されていないが、辞書ファイ
ル４には「監禁」という単語が登録されているので、検
索結果として「該当するキーワード（「監禁」）を含む
テキストデータが存在しません。」というメッセージが
出力される。For example, when the words shown in FIG. 2 are registered in the dictionary file 4 and the index shown in FIG. 5 is stored in the index storage unit 6, it is assumed that “captivity” is input as the keyword. Although the word “confinement” is not registered in the index, the word “confinement” is registered in the dictionary file 4, so there is text data that includes “corresponding keyword (“ confinement ”) as a search result. I don't. Message is output.

【００２１】ｎ１５で入力されたキーワードが辞書ファ
イル４に登録されていない未登録単語であると判定する
と、未登録単語管理部９はこのキーワードを辞書ファイ
ル４に登録する（ｎ１７）。そして、検索結果として
「キーワードが辞書に未登録であり、このキーワードで
は検索できません。このキーワードは先程辞書に登録し
ました。」というメッセージを表示して処理を完了する
（ｎ１８）。When it is determined that the keyword input in n15 is an unregistered word not registered in the dictionary file 4, the unregistered word management unit 9 registers this keyword in the dictionary file 4 (n17). Then, as a search result, a message "The keyword is not registered in the dictionary and the keyword cannot be searched. This keyword was registered in the dictionary earlier" is displayed and the processing is completed (n18).

【００２２】例えば、辞書ファイル４には図２に示す単
語が登録されており、インデックス記憶部６には図５に
示すインデックスが記憶されている場合、キーワードと
して「ら致」が入力されたとする。インデックスおよび
辞書ファイル４に「ら致」という単語が登録されていな
いので、検索結果として「キーワード（「ら致」）が辞
書に未登録であり、このキーワードでは検索できませ
ん。このキーワードは先程辞書に登録しました。」とい
うメッセージが表示される。この処理によって、辞書フ
ァイル４には図８に示すように「ら致」という単語が新
たに登録されることになる。また、この実施の形態では
ｎ１７で辞書ファイル４に登録した単語の品詞は名詞と
した。For example, when the words shown in FIG. 2 are registered in the dictionary file 4 and the index shown in FIG. 5 is stored in the index storage unit 6, it is assumed that “Latch” is input as the keyword. . Since the word "Latch" is not registered in the index and dictionary file 4, "Keyword (" Latch ") is not registered in the dictionary as a search result and cannot be searched with this keyword. I registered this keyword in the dictionary earlier. Message is displayed. By this processing, the word "Latch" is newly registered in the dictionary file 4 as shown in FIG. Further, in this embodiment, the part of speech of the word registered in the dictionary file 4 at n17 is a noun.

【００２３】テキストデータＡ（「ら致の容疑で逮捕さ
れた。」）に対して再度インデックス作成処理が実行さ
れると、形態素解析では８つの単語に分割され（図９
（Ａ）参照）、図９（Ｂ）に示すインデックスが作成さ
れることになる。したがって、キーワードとして「ら
致」を入力してテキストデータの検索を行うことができ
るようになる。When the indexing process is executed again on the text data A ("arrested on suspicion of being killed"), it is divided into eight words by morphological analysis (FIG. 9).
(See (A)), and the index shown in FIG. 9 (B) is created. Therefore, it becomes possible to search the text data by inputting "Latch" as the keyword.

【００２４】また、「一般人がら致監禁された。」とい
う文字列のテキストデータＢに対してインデックスを作
成処理を実行すると、形態素解析では図１０（Ａ）に示
すように７つの単語に分割され（辞書ファイル４は図８
に示す単語が登録されているとする。）、インデックス
は更新されて図９（Ｂ）から図１０（Ｂ）に示すように
変化する。このときに、キーワードとして「ら致」を入
力してテキストデータの検索を行うと、検索結果として
キーワードを含むテキストデータとして、テキストデー
タＡ、テキストデータＢの２つを検出することができ
る。Further, when an index creating process is executed for the text data B of the character string "The general public is confined." In the morphological analysis, it is divided into seven words as shown in FIG. 10 (A). (Dictionary file 4 is shown in Figure 8
It is assumed that the words shown in are registered. ), The index is updated and changes as shown in FIG. 9 (B) to FIG. 10 (B). At this time, if the text data is searched by inputting “match” as the keyword, two of the text data A and the text data B can be detected as the text data including the keyword as the search result.

【００２５】以上のように、この実施の形態のテキスト
データ検索装置１は、検索時に入力されたキーワードが
辞書ファイル４に登録されていない単語であれば、その
キーワードを単語として辞書ファイル４に登録する。し
たがって、次回からはこの登録された単語による検索も
行えるようになる。また、検索結果としてキーワードを
含むテキストデータを示す情報がテキスト情報として出
力されるので、キーワードを含むテキストデータを簡単
に見つけることができる。As described above, in the text data search device 1 of this embodiment, if the keyword input during the search is a word that is not registered in the dictionary file 4, the keyword is registered as a word in the dictionary file 4. To do. Therefore, from the next time, it will be possible to search by this registered word. Moreover, since the information indicating the text data including the keyword is output as the text information as the search result, the text data including the keyword can be easily found.

【００２６】なお、上記した実施の形態では、入力され
たキーワードが辞書ファイル４に登録されていない単語
であれば、辞書ファイル４に登録するとしたが、辞書フ
ァイル４への単語の登録を以下に示す処理で行うように
してもよい。この例では、未登録単語管理部９には補助
辞書９ａを備えている。補助辞書９ａは、図１１に示す
ようにキーワードとこのキーワードが入力された回数と
を対応させて記憶している。テキストデータ検索装置１
は、キーワード入力部７から辞書ファイル４に登録され
ていないキーワードが入力されたことを検出すると、こ
のキーワードが補助辞書９ａに登録されているキーワー
ドであるかどうかを判定する。このキーワードが補助辞
書９ａに登録されていれば対応させて記憶している回数
を１増加させる。また、このキーワードが補助辞書９ａ
に登録されていなければ、このキーワードを補助辞書９
ａに登録するとともに、この登録したキーワードに対応
する回数を１にする。すなわち、辞書ファイル４に登録
されていない単語がキーワードとして入力された場合に
は、補助辞書９ａにそのキーワードを登録する。また、
補助辞書９ａでは、登録されているキーワード毎に入力
された回数を記憶させている。そして、テキストデータ
検索装置１が、立ち上げ時等の所定時に、補助辞書９ａ
に登録されているキーワードで入力された回数が所定回
数以上のキーワードがあれば、そのキーワードを辞書フ
ァイル４に登録する。なお、補助辞書９ａに登録されて
いたキーワードは辞書に登録されたときに、補助辞書９
ａから削除される。In the above-described embodiment, if the input keyword is a word that is not registered in the dictionary file 4, it is registered in the dictionary file 4. However, registration of a word in the dictionary file 4 will be described below. You may make it perform the process shown. In this example, the unregistered word management unit 9 includes an auxiliary dictionary 9a. As shown in FIG. 11, the auxiliary dictionary 9a stores a keyword and the number of times the keyword is input in association with each other. Text data retrieval device 1
When detecting that a keyword that is not registered in the dictionary file 4 is input from the keyword input unit 7, determines whether this keyword is a keyword registered in the auxiliary dictionary 9a. If this keyword is registered in the auxiliary dictionary 9a, the number of times it is stored correspondingly is increased by one. Also, this keyword is the auxiliary dictionary 9a.
If not registered in, use this keyword in the auxiliary dictionary 9
While registering in a, the number of times corresponding to this registered keyword is set to 1. That is, when a word that is not registered in the dictionary file 4 is input as a keyword, that keyword is registered in the auxiliary dictionary 9a. Also,
The auxiliary dictionary 9a stores the number of input times for each registered keyword. Then, the text data search device 1 is configured such that the auxiliary dictionary 9 a
If there is a keyword that has been input a predetermined number of times or more among the keywords registered in, the keyword is registered in the dictionary file 4. It should be noted that the keywords registered in the auxiliary dictionary 9a are not stored in the auxiliary dictionary 9a when registered in the dictionary.
deleted from a.

【００２７】すなわち、この例では辞書ファイル４に登
録されていない単語がキーワードとして入力されても、
すぐに辞書ファイル４に登録されない。入力された回数
が所定回数以上となったときに、このキーワードが辞書
ファイル４に登録される。したがって、繰り返し使われ
たキーワードが辞書ファイル４に登録されるので、誤入
力されたキーワードが辞書ファイル４に登録されること
がない。このため、辞書ファイル４に不要な単語（誤入
力されたキーワード）が登録されることを防止すること
ができる。That is, in this example, even if a word that is not registered in the dictionary file 4 is input as a keyword,
Not registered in dictionary file 4 immediately. This keyword is registered in the dictionary file 4 when the number of input times exceeds a predetermined number. Therefore, the keywords that have been repeatedly used are registered in the dictionary file 4, so that the keywords that are erroneously input are not registered in the dictionary file 4. Therefore, it is possible to prevent unnecessary words (mis-inputted keywords) from being registered in the dictionary file 4.

【００２８】[0028]

【発明の効果】以上のように、この発明によれば、入力
された検索のキーワードが辞書に登録されていない単語
であれば、その単語が「ひらがな」と「漢字」等の文字
種が混合しているキーワードであっても辞書に自動的に
登録される。As described above, according to the present invention, if the entered search keyword is a word that is not registered in the dictionary, the word is mixed with a character type such as "Hiragana" and "Kanji". Even the keywords that are included are automatically registered in the dictionary.

【００２９】また、キーワードとして入力された回数が
所定の回数以上になったときに辞書に登録されるように
構成したことで、誤入力されたキーワードが不要に辞書
に登録されることを防止することができる。Further, since the number of times of input as a keyword exceeds a predetermined number of times, it is registered in the dictionary, it is possible to prevent an erroneously input keyword from being registered in the dictionary unnecessarily. be able to.

【００３０】さらに、検索結果として出力される、入力
されたキーワードが存在する位置を示す位置情報によっ
て、簡単にキーワードが含まれるテキストデータ読み出
すことができる。Further, the position information indicating the position where the input keyword is present, which is output as the search result, makes it possible to easily read the text data containing the keyword.

[Brief description of drawings]

【図１】この発明の実施の形態であるテキストデータ検
索装置の構成を示すブロック図である。FIG. 1 is a block diagram showing a configuration of a text data search device according to an embodiment of the present invention.

【図２】辞書の構成を示す図である。FIG. 2 is a diagram showing a configuration of a dictionary.

【図３】テキストデータのインデックスを作成する処理
を示すフローチャートである。FIG. 3 is a flowchart showing a process of creating an index of text data.

【図４】形態素解析結果の例を示す図である。FIG. 4 is a diagram showing an example of a morpheme analysis result.

【図５】作成されるインデックスの内容を示す図であ
る。FIG. 5 is a diagram showing the contents of an index created.

【図６】検索処理を示すフローチャートである。FIG. 6 is a flowchart showing a search process.

【図７】テキストデータの表示例を示す図である。FIG. 7 is a diagram showing a display example of text data.

【図８】「ら致」が新たに登録された辞書の構成を示す
図である。FIG. 8 is a diagram showing a configuration of a dictionary in which “Latch” is newly registered.

【図９】形態素解析結果および更新されたインデックス
を示す図である。FIG. 9 is a diagram showing a morphological analysis result and updated indexes.

【図１０】形態素解析結果および更新されたインデック
スを示す図である。FIG. 10 is a diagram showing a morphological analysis result and updated indexes.

【図１１】補助辞書の構成を示す図である。FIG. 11 is a diagram showing a configuration of an auxiliary dictionary.

[Explanation of symbols]

１−テキストデータ検索装置２−テキストデータ記憶部３−形態素解析部４−辞書５−インデックス作成部６−インデックス記憶部７−キーワード入力部８−テキストデータ検索部９−未登録単語登録部９ａ−補助辞書 1-text data search device 2-text data storage unit 3-morpheme analysis unit 4-dictionary 5-index creation unit 6-index storage unit 7-keyword input unit 8-text data search unit 9-unregistered word registration unit 9a- Auxiliary dictionary

Claims

[Claims]

1. A morpheme analysis means for performing a morpheme analysis of text data input using a dictionary file, a keyword acceptance means for accepting an input of a keyword, and a morpheme analysis for a word matching the keyword accepted by the keyword acceptance means. A text data search device comprising a search means for searching the obtained text data and outputting a search result, wherein the keyword accepted by the keyword accepting means is a word registered in the dictionary file. Text data comprising: a determining unit that determines whether the received keyword is a word that is not registered in the dictionary file, and a registration unit that registers the keyword as a word in the dictionary file. Search device.

2. The registering means is means for registering a keyword, which has been received a predetermined number of times or more by the keyword receiving means, as a word in a dictionary file.
The described text data retrieval device.

3. An index creating means for creating an index that registers position information indicating a position where the word exists in the input text data together with the word based on the result of morphological analysis, and the search means 3. The text data search device according to claim 1, further comprising means for outputting the search result including the position information.

4. A text data search method for performing morphological analysis of input text data using a dictionary file, searching for a word matching an input keyword from the morphologically analyzed text data, and outputting a search result. A method of searching text data, characterized in that when the accepted keyword is a word that is not registered in the dictionary file, the keyword is registered as a word in the dictionary file.

5. The text data search method according to claim 4, wherein words that are received as a keyword less than a predetermined number of times are not registered in the dictionary file.

6. An index in which position information indicating a position where the word exists in the input text data is registered together with a word based on a result of morphological analysis, and the position information is also included as a search result. 6. The text data search method according to claim 4, wherein the text data search method outputs the text data.