JPH11328195A

JPH11328195A - Character string retrieving device

Info

Publication number: JPH11328195A
Application number: JP10130797A
Authority: JP
Inventors: Kazuhiro Tajima; 一弘田島
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1998-05-13
Filing date: 1998-05-13
Publication date: 1999-11-30
Anticipated expiration: 2018-05-13
Also published as: JP3314720B2

Abstract

PROBLEM TO BE SOLVED: To provide a character string retrieving device capable of speedily retrieving a document at a high speed in which a designated character string is included. SOLUTION: This device is provided with an input device 32 inputting a character string being a retrieving object, a ward dictionary 32 storing words, a storing part 30 provided with an unnecessary word dictionary 34 storing word information which is not preferable to coincide with a retrieving result and with a connection information storing part 36 storing connection information showing the strength of connection between words, a character string analyzing part 20 dividing the character string inputted from the device 10 into words based on the word information 32 and the unnecessary word information 34 and making the words be a compound word based on the connection information and a retrieving processing part 40 retrieving the compound word from a document data base 50.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、文字列検索装置に
係り、特に任意の文字列から特定の文字列を検索する文
字列検索装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a character string searching device, and more particularly to a character string searching device for searching a specific character string from an arbitrary character string.

【０００２】[0002]

【従来の技術】高度情報化社会が進展した現在、コンピ
ュータを用いて膨大な量のデータを蓄えるデータベース
から所定の文字列を高速に検索する技術は極めて重要な
事項である。例えば、特許情報検索においてはその検索
の方法によって探し出そうとする文献の質及び量が左右
される。2. Description of the Related Art With the advancement of a highly information-oriented society, a technique for quickly searching for a predetermined character string from a database storing a huge amount of data using a computer is very important. For example, in patent information search, the quality and quantity of documents to be searched depend on the search method.

【０００３】従って、検索を行う場合には、検索者は適
切なキーワードを指定して検索を行わなければならな
い。一般的に、最も基本的な検索方法は、例えば１単語
からなる短いキーワードについてアンド（ＡＮＤ）検索
やオア（ＯＲ）検索を用いて順次検索対象を絞り込むこ
とが行われる。[0003] Therefore, when performing a search, the searcher must specify an appropriate keyword to perform the search. In general, the most basic search method is to narrow down the search targets sequentially using, for example, an AND search or an OR search for a short keyword composed of one word.

【０００４】[0004]

【発明が解決しようとする課題】ところで、従来、長い
文字列の検索を行う場合には、検索自体の処理方法（ｎ
文字区切り、形態素解析等）のとき、区切った後の処理
はＯＲ検索を行いその機能をは果たすようにしていた。
そのため検索対象として、文書中に１単語でも含まれる
文書が対象となり、意味的に近い文書を膨大な量の文書
の中からさらに絞り込んで検索する必要がある。従っ
て、従来は長い文字列検索ではなく、単なる複数単語の
検索となり、検索時間に長時間を要するとともに、検索
の手間がかかりすぎ面倒であるという問題があった。ま
た、より多くの単語がマッチする文書を取り出したとし
ても単語間のつながりを考慮していないため、意味的に
全く関係の無い文書を取り出してしまい、指定したキー
ワードを含む有効な文書が、他の文献に埋もれてしまう
という問題があった。Conventionally, when a long character string is searched, a processing method (n
In the case of character separation, morphological analysis, etc.), the processing after the separation is performed by performing an OR search and performing its function.
Therefore, as a search target, a document that includes even one word in the document is targeted, and it is necessary to further narrow down a search for documents that are semantically close from an enormous amount of documents. Therefore, conventionally, instead of a long character string search, it is a simple search of a plurality of words, which requires a long search time and has a problem that the search is too time-consuming and troublesome. Also, even if a document that matches more words is retrieved, the connection between words is not taken into account, so a document that has no meaning at all is retrieved, and a valid document that includes the specified keyword is deleted. There is a problem that it is buried in the literature of.

【０００５】本発明は上記事情に鑑みてなされたもので
あり、指定した文字列が含まれる文書を迅速且つ高速に
検索することのできる文字列検索装置を提供することを
目的とする。SUMMARY OF THE INVENTION The present invention has been made in view of the above circumstances, and has as its object to provide a character string search device capable of searching for a document containing a specified character string quickly and at high speed.

【０００６】[0006]

【課題を解決するための手段】上記課題を解決するため
に、本発明は、検索対象である文字列を入力する入力装
置と、単語を記憶した単語辞書と、検索結果として一致
しては好ましくない単語情報を記憶した不要語辞書と、
単語間のつながりの強さを示すつながり情報を記憶する
つながり情報記憶部とを有する記憶部と、前記単語情報
及び不要語情報に基づいて、前記入力装置から入力され
た文字列を単語に区切り、前記つながり情報に基づいて
前記単語を複合語とする文字列解析部と、文書データベ
ースから前記複合語を検索する検索処理部とを具備する
ことを特徴とする。また、本発明は、前記記憶部が、前
記単語間のつながり強さが一定値を越えたものを示す使
用頻度情報を格納する使用頻度情報記憶部を更に具備す
ることを特徴とする。また、本発明は、前記文字列解析
部が、前記使用頻度情報に基づいて、前記単語を複合語
として複合語とすることを特徴とする。In order to solve the above-mentioned problems, the present invention preferably provides an input device for inputting a character string to be searched, a word dictionary storing words, and a match as a search result. Unnecessary word dictionary that stores missing word information,
A storage unit having a connection information storage unit that stores connection information indicating the strength of connection between words, and, based on the word information and the unnecessary word information, divides a character string input from the input device into words, A character string analysis unit that uses the word as a compound word based on the connection information and a search processing unit that searches the document database for the compound word are provided. Further, the present invention is characterized in that the storage unit further includes a usage frequency information storage unit that stores usage frequency information indicating that the connection strength between the words exceeds a certain value. Further, the present invention is characterized in that the character string analysis unit converts the word into a compound word as a compound word based on the use frequency information.

【０００７】[0007]

【発明の実施の形態】以下、図面を参照して本発明の一
実施形態による文字列検索装置の一実施形態について詳
細に説明する。図１は、本発明の一実施形態による文字
列検索装置の構成を示すブロック図である。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, an embodiment of a character string search device according to an embodiment of the present invention will be described in detail with reference to the drawings. FIG. 1 is a block diagram showing a configuration of a character string search device according to one embodiment of the present invention.

【０００８】本発明の一実施形態による文字列検索装置
は図１に示すように、キーボードやペン等を備える入力
装置１０と、入力装置１０から得られた入力文字を解析
する文字列解析部２０と、単語情報を保存する記憶部３
０と、文字列解析部２０から出力される解析結果文字列
から文書を取り出す検索処理部４０と、文書データベー
ス５０と、結果を出力する出力装置６０とからなる。As shown in FIG. 1, a character string search device according to an embodiment of the present invention includes an input device 10 having a keyboard, a pen, and the like, and a character string analyzing unit 20 for analyzing input characters obtained from the input device 10. And a storage unit 3 for storing word information
0, a search processing unit 40 for extracting a document from the analysis result character string output from the character string analysis unit 20, a document database 50, and an output device 60 for outputting the result.

【０００９】上記記憶部３０は、単語辞書３２、不要語
辞書３４、つながり情報記憶部３６、使用頻度情報記憶
部３８からなる。上記単語辞書３２は種々の単語及びそ
の品詞情報を対応させて記憶している。また、つながり
情報記憶部３６は、単語辞書３２と対になり単語、品詞
間へのつながりの強さを数値化して表したつながり情報
を記憶する。更に、不要語辞書３４は検索結果としてマ
ッチしては好ましくない単語情報を記憶する。The storage unit 30 includes a word dictionary 32, an unnecessary word dictionary 34, a connection information storage unit 36, and a use frequency information storage unit 38. The word dictionary 32 stores various words and their parts of speech information in association with each other. Further, the connection information storage unit 36 stores connection information that is paired with the word dictionary 32 and quantifies the strength of connection between words and parts of speech. Further, the unnecessary word dictionary 34 stores word information that is not desirable to be matched as a search result.

【００１０】また、使用頻度情報記憶部３８は、品詞間
のつながり強さが一定値を越えたものを示す使用頻度情
報を格納する。この使用頻度情報は、上記つながり情報
の付加情報として使用され、次回の解析時に同一のつな
がり強さの単語が現われた場合この情報を参照して解析
が行われる。The usage frequency information storage unit 38 stores usage frequency information indicating that the connection strength between parts of speech exceeds a certain value. This use frequency information is used as additional information of the connection information, and when words having the same connection strength appear at the next analysis, the information is analyzed with reference to this information.

【００１１】文字解析部２０は、入力装置１０から入力
された文字列を、単語辞書３２と不要語辞書３４とに従
って文字列の区切り位置を求め、区切られた文字毎に、
つながり情報記憶部３６から各単語のつながり強さを調
べ、単語間のつながり強さが一定以上のものを複合語と
して取り出す。また、つながり強さが一定に満たないも
のはその一単語のみで取り出す。更に、一度つながり強
さが一定値をこえ、複合語とみなされた単語には使用頻
度情報記憶部３８に使用頻度情報を追加して格納する。The character analysis unit 20 determines a character string delimiter position of the character string input from the input device 10 in accordance with the word dictionary 32 and the unnecessary word dictionary 34.
The connection strength of each word is checked from the connection information storage unit 36, and a word having a certain degree of connection strength between words is extracted as a compound word. If the connection strength is less than a certain level, it is extracted using only that one word. Further, the use frequency information is added and stored in the use frequency information storage unit 38 for a word whose connection strength once exceeds a certain value and is regarded as a compound word.

【００１２】検索処理部４０は、上記文字列解析部２０
によって得られた複合語、単語を入力し、文書データベ
ース５０からマッチする文書があるか否かを検索する。
また、上記文字列解析部２０から得られた複合語に対し
ても、文字解析部２０によりさらに区切り、検索を行
う。最終的に一単語になるまでこの処理を繰り返す。そ
して、最終的にマッチした文書全ての情報を、ディスプ
レイ等の出力装置６０により出力し検索結果を取得す
る。The search processing unit 40 includes the character string analysis unit 20
Is input, and the document database 50 is searched for a matching document.
The character analysis unit 20 further separates and searches the compound words obtained from the character string analysis unit 20. This process is repeated until one word is finally obtained. Then, information of all the finally matched documents is output by the output device 60 such as a display, and a search result is obtained.

【００１３】次に、以上の構成における本発明の一実施
形態による文字列検索装置の動作について詳細に説明す
る。図２は、本発明の一実施形態による文字列検索装置
の動作を示すフローチャートである。Next, the operation of the character string search apparatus according to one embodiment of the present invention having the above configuration will be described in detail. FIG. 2 is a flowchart showing the operation of the character string search device according to one embodiment of the present invention.

【００１４】まず、入力装置１０からキーワードとなる
文字列が入力されると（ステップＳＡ１）、入力された
文字列は文字列解析部２０へ渡される。文字解析部２０
は渡された文字列が単語辞書３２又は不要語辞書３４に
含まれているか否かをどうかを解析する（ステップＳＡ
２）。この解析の結果は、単語が単語辞書３２に存在し
たもの、不要語辞書３４に存在したもの、どちらにも存
在しないものの３通りになる。First, when a character string serving as a keyword is input from the input device 10 (step SA1), the input character string is passed to the character string analysis unit 20. Character analysis unit 20
Analyzes whether the passed character string is included in the word dictionary 32 or the unnecessary word dictionary 34 (step SA)
2). As a result of this analysis, there are three types of words: words that exist in the word dictionary 32, words that exist in the unnecessary word dictionary 34, and words that do not exist in any of them.

【００１５】文字列解析部２０は、単語が単語辞書３２
に存在している場合には、その単語について処理を行
う。つまり、この単語の前後の単語が、不要語辞書３４
に存在しているか否かを記憶する。ここで、単語の前後
が両方とも不要語辞書に無かった単語の場合には処理を
終了する。The character string analysis unit 20 determines that the word is a word dictionary 32
If the word exists, the process is performed on the word. That is, words before and after this word are stored in the unnecessary word dictionary 34.
It is stored whether or not it exists. If both words before and after the word are words that are not in the unnecessary word dictionary, the process ends.

【００１６】次に、文字列解析部２０は、単語辞書３２
に存在した単語が複数続いているものに対して、つなが
り情報記憶部３６に記憶されたつながり情報から単語間
のつながり強さを調べる（ステップＳＡ４）。その際、
つながり強さの値を合計して、一定値以上になったもの
を複合語として取り出す。Next, the character string analyzing section 20 executes the processing in the word dictionary 32
The connection strength between the words is checked from the connection information stored in the connection information storage unit 36 for a word in which a plurality of words exist in the same manner (step SA4). that time,
The values of the connection strengths are summed, and those having a certain value or more are extracted as compound words.

【００１７】以上の処理によって、入力装置１０から入
力された文字列は以下の５つに分かれる。（１）つながり強さから得られた複合語（２）つながり強さが弱かった一単語（３）前後の単語が単語辞書に無い一単語（４）不要語辞書から得られた単語（５）辞書に存在しない語By the above processing, the character string input from the input device 10 is divided into the following five. (1) Compound word obtained from connection strength (2) One word whose connection strength is weak (3) One word whose preceding and following words are not in the word dictionary (4) Word obtained from unnecessary word dictionary (5) Words that do not exist in the dictionary

【００１８】ここから、上記（１），（２），（３），
（５）の単語が取り出され、検索処理部４０へ渡され
る。上記単語が渡されると、検索処理部４０は文書デー
タベース５０に対して検索処理を行い（ステップＳＡ
４）、マッチした文書情報を取得する。From above, the above (1), (2), (3),
The word (5) is extracted and passed to the search processing unit 40. When the word is passed, the search processing unit 40 performs a search process on the document database 50 (step SA).
4) Acquire matched document information.

【００１９】上記（１）の単語に関しては再度解析処理
（図２中ステップＳＡ２）が行われた後に検索処理が行
われる。この処理を上記（１）の単語が無くなるまで繰
り返し行う。検索処理部４０は、全ての結果をマージし
て検索処理を行い（ステップＳＡ３）、検索結果を出力
装置６０に出力する。出力装置６０は検索結果が検索処
理部４０から出力されると、検索結果を出力する（ステ
ップＳＡ５）。With respect to the word (1), a search process is performed after the analysis process (step SA2 in FIG. 2) is performed again. This process is repeated until the word of (1) is exhausted. The search processing unit 40 performs a search process by merging all the results (step SA3), and outputs the search results to the output device 60. When the search result is output from the search processing unit 40, the output device 60 outputs the search result (step SA5).

【００２０】次に、具体例を用いて本発明の一実施形態
をより詳細に説明する。図３は、本発明の一実施形態の
具体例を詳細に説明するための図である。いま、図１中
の入力装置１０に文字列「東京外国為替市場の円の相場
について」（図３（ｂ））が入力されたとする。この文
字列は入力装置１０から文字列処理部２０へ渡される。Next, an embodiment of the present invention will be described in more detail using a specific example. FIG. 3 is a diagram for describing a specific example of one embodiment of the present invention in detail. Now, it is assumed that the character string "about the yen exchange rate in the Tokyo foreign exchange market" (FIG. 3B) is input to the input device 10 in FIG. This character string is passed from the input device 10 to the character string processing unit 20.

【００２１】文字解析部２０は、受け渡された文字列
を、単語辞書３２及び不要語辞書３４の内容に基づいて
区切る。文字列が区切られた結果、図３（ｃ）のように
なったとする。つまり、文字列は単語「東京」「外国」
「為替」「市場」「の」「円」「の」「相場」「に」
「ついて」と区切られたとする。上記の処理が終了する
と、次に区切られた各単語を、単語辞書３２内にあるも
の、不要語辞書３４内にあるもの、何れにも無いものの
３通りに分類する。図３（ｃ）の例では、単語辞書３２
内にあるものに対し、記号「１」を付して示し、不要語
辞書３４内にあるものに対し、記号「２」を付して示し
た。The character analyzer 20 divides the received character string based on the contents of the word dictionary 32 and the unnecessary word dictionary 34. It is assumed that the character strings are separated as shown in FIG. In other words, the string is the words "Tokyo""Foreign"
"Exchange""Market""NO""Yen""NO""Market""NI"
It is assumed that it is separated as "about". When the above processing is completed, each of the words separated next is classified into three types: those in the word dictionary 32, those in the unnecessary word dictionary 34, and those not in any of them. In the example of FIG. 3C, the word dictionary 32
Those in the unnecessary word dictionary 34 are denoted by the symbol "1", and those in the unnecessary word dictionary 34 are denoted by the symbol "2".

【００２２】そして、更に分かれた単語毎に前後の単語
の分類を参照して、「東京」「外国」「為替」「市場」
はつながり情報記憶部３６に記憶されたつながり情報か
らつながる可能性があると判断され、複合語として抜き
出される（図３（ｄ）。また、「円」、「相場」という
単語の前後は不要語辞書３４内にあるもの「の」「の」
「に」に囲まれているため、「円」「相場」は一単語と
する（図３（ｅ）。Then, referring to the classification of the words before and after each further divided word, "Tokyo", "Foreign", "Exchange", "Market"
Is determined to be likely to be connected from the connection information stored in the connection information storage unit 36, and is extracted as a compound word (FIG. 3 (d). The words "circle" and "quote" are not necessary). "No" and "No" in the word dictionary 34
Since it is surrounded by “ni”, “circle” and “market” are one word (FIG. 3E).

【００２３】以上の文字列解析部２０の解析の結果、図
３（ｆ）に示されている単語が検索処理部４０へ渡さ
れ、検索処理部４０はこれらの単語に基づいて文書デー
タベース５０を検索する。検索処理部４０は検索処理を
終了すると、得られた文書を第一文書群として取り出
す。As a result of the above analysis by the character string analysis unit 20, the words shown in FIG. 3F are passed to the search processing unit 40, and the search processing unit 40 stores the document database 50 based on these words. Search for. Upon completion of the search processing, the search processing unit 40 extracts the obtained document as a first document group.

【００２４】次に、文字列解析部２０は、結合された複
合語(この場合は、「東京外国為替市場」)を再度解析す
る。この解析は、前述と同様に単語辞書３２及びつなが
り情報記憶部３６に記憶されたつながり情報記憶部３６
を用いて行われれ、複合語の区切りが設定される。この
場合、「東京」「外国為替」「為替市場」（図３
（ｅ））というように切られるような情報が保存されて
いたとする。この処理は予め単語間のつながりを登録す
ることによって得られる。Next, the character string analyzer 20 analyzes the combined compound word (in this case, "Tokyo Foreign Exchange Market") again. This analysis is performed by the connection information storage unit 36 stored in the word dictionary 32 and the connection information storage unit 36 as described above.
To set the delimiter of the compound word. In this case, "Tokyo""Foreignexchange""Foreign exchange market" (Fig. 3
It is assumed that information that can be cut as shown in (e)) is stored. This process is obtained by registering the connection between words in advance.

【００２５】そして、この結果に基づいて検索処理部４
０は再度検索を行ってランクづけをする。この結果、得
られた文書群は第二文書群とし、図３（ｆ）に示した単
語で検索したものを第三文書群とする。この際、既に前
の文書群において検索した語（図３（ｅ）中の単語「東
京」)は外される。以上の処理を単語が全て区切られる
まで繰り返すことによって、第三、第四、…第ｎ文書群
を得る。最後に得られた第一から第ｎ文書群までを順に
検索結果として、出力装置６０から出力する。Then, based on the result, the search processing unit 4
0 performs the search again to rank. As a result, the obtained document group is referred to as a second document group, and the one searched by the words shown in FIG. 3F is referred to as a third document group. At this time, the word already searched in the previous document group (the word “Tokyo” in FIG. 3E) is removed. By repeating the above processing until all words are separated, a third, fourth,..., N-th document group is obtained. The output device 60 outputs the finally obtained first to n-th document groups as search results in order.

【００２６】次に、本発明の他の実施形態について説明
する。本実施形態が、前述の一実施形態と異なる点は、
単語辞書３２に関連する情報として、つながり情報記憶
部３６に記憶されたつながり情報だけではなく、更に使
用頻度情報記憶部３８に記憶された使用頻度情報を使用
する点である。Next, another embodiment of the present invention will be described. This embodiment is different from the above-described embodiment in that
The point is that not only the connection information stored in the connection information storage unit 36 but also the usage frequency information stored in the usage frequency information storage unit 38 is used as the information related to the word dictionary 32.

【００２７】入力装置１０から文字列が入力されると、
文字列解析部２０は、入力装置１０から入力された文字
列をつながり情報記憶部３６に記憶された単語のつなが
り情報により、複合語と単語、それ以外に分類する。こ
の際、複合語として得られた単語に対する使用頻度情報
の値を増加させる。一方、複合語が文書から見つからな
かった場合、使用頻度情報の値を減少させる。この使用
頻度情報は次回に検索を行った時に、三単語以上の複合
語が現われた場合、その複合語が文書内に存在しなかっ
た場合に利用される。When a character string is input from the input device 10,
The character string analysis unit 20 classifies the character string input from the input device 10 into a compound word, a word, and the other based on the connection information of the words stored in the connection information storage unit 36. At this time, the value of the use frequency information for the word obtained as the compound word is increased. On the other hand, if no compound word is found in the document, the value of the usage frequency information is reduced. This usage frequency information is used when a compound word of three or more words appears in the next search, or when the compound word does not exist in the document.

【００２８】例えば、ある三単語からなる複合語が入力
から得られたとし、その各単語をそれぞれ「１」「２」
「３」とする。このとき、「１２３」からなる単語が文
書に発見されない場合、前述の一実施形態では「１」
「２」「３」それぞれに分割して再度検索を行ってい
る。そのため、「１」、「２」、又は「３」のみが含ま
れた文書がマッチしてしまう。そこで、予め得られた使
用頻度情報により、「１２」のつながりが「２３」のつ
ながりよりも頻度が高いならば、「１２」と「３」を検
索の対象とし、「１」「２」「３」がどれか一つだけ入
っているものよりもより入力に近い文書を探し出すよう
に処理することができる。For example, suppose that a compound word consisting of a certain three words is obtained from the input, and each word is represented by "1" and "2".
Assume “3”. At this time, if the word consisting of “123” is not found in the document, “1” is used in the above-described embodiment.
The search is divided again into “2” and “3”, and the search is performed again. Therefore, documents containing only “1”, “2”, or “3” match. Therefore, if the connection of “12” is more frequent than the connection of “23” based on the use frequency information obtained in advance, “12” and “3” are searched for, and “1”, “2”, “2” It can be processed to search for a document that is closer to the input than one containing only one “3”.

【００２９】次に、より具体例に本実施形態の動作につ
いて説明する。図４は、本発明の他の実施形態の具体例
を詳細に説明するための図である。まず、図４（ｂ）に
示されたように、入力装置１０から文字列「東京外国為
替市場の円の相場について」が入力された場合を考え
る。文字列解析部２０に上記文字列が入力されると、文
字列解析部２０は、受け取った文字列を、単語辞書３２
及び不要語辞書３４に基づいて、その文字列を区切りる
処理を行う。図４（ｃ）に示したように、上記文字列
は、単語「東京」「外国」「為替」「市場」「の」
「円」「の」「相場」「に」「ついて」と区切られる。Next, the operation of this embodiment will be described as a more specific example. FIG. 4 is a diagram for describing a specific example of another embodiment of the present invention in detail. First, as shown in FIG. 4B, consider a case where a character string “about the yen price in the Tokyo foreign exchange market” is input from the input device 10. When the character string is input to the character string analysis unit 20, the character string analysis unit 20 converts the received character string into a word dictionary 32.
Then, based on the unnecessary word dictionary 34, a process of separating the character string is performed. As shown in FIG. 4 (c), the character string is composed of the words "Tokyo", "Foreign", "Exchange", "Market", and "No".
It is divided into "yen", "no", "market", "ni" and "about".

【００３０】次に、区切られた各単語を、単語辞書３２
内にあるもの、不要語辞書３４内にあるもの、どちらに
も無いものの３つに分類する。そして、更に分かれた単
語を見て行くと、つながり情報記憶部３６に記憶された
つながり情報から単語「東京」「外国」「為替」「市
場」はつながる可能性があると判断され、複合語として
抜き出す（図４（ｄ））。Next, each of the separated words is stored in the word dictionary 32.
In the unnecessary word dictionary 34 and those in neither of them. Then, looking at the words further divided, it is determined from the connection information stored in the connection information storage unit 36 that the words “Tokyo”, “foreign”, “exchange”, and “market” are likely to be connected. Pull out (FIG. 4D).

【００３１】また、単語「円」「相場」の前後が不要語
辞書３４内にあるもの、即ち「の」「の」「に」に囲ま
れているため、「円」「相場」を一単語とする（図４
（ｄ））。よって解析の結果、図４（ｄ）に示されてい
る単語を検索対象とする単語とする。これらの単語情報
は検索処理部４０に渡される。Further, since the words "yen" and "quote" are surrounded by those in the unnecessary word dictionary 34, that is, "no", "no" and "ni", "yen" and "quote" are represented by one word. (Fig. 4
(D)). Therefore, as a result of the analysis, the words shown in FIG. These pieces of word information are passed to the search processing unit 40.

【００３２】まず、「東京外国為替市場」が複合語とし
て現われるが、各単語「東京」「外国」「為替」「市
場」に対し、使用頻度情報記憶部３８に記憶された使用
頻度情報を参照する。いま、以前「為替市場」という入
力が現われており、「外国」「為替」間の使用頻度より
高かったとする。その場合、再解析の結果レベル１とし
て、「為替市場」で検索し、レベル２として、「東京」
「外国為替」を検索する。これによって、「東京」「外
国為替」「為替市場」を同じレベルの文書群とするより
も、より入力に近いものが見つかることが予想される。First, "Tokyo Foreign Exchange Market" appears as a compound word. For each of the words "Tokyo", "Foreign", "Foreign Exchange", and "Market", refer to the usage frequency information stored in the usage frequency information storage unit 38. I do. Now, it is assumed that the input "exchange market" has appeared before, and the use frequency is higher than that between "foreign countries" and "exchange". In that case, as a result of the re-analysis, search for "Exchange market" as level 1 and "Tokyo" as level 2
Search for "forex". As a result, it is expected that a document closer to the input will be found than "Tokyo", "Foreign exchange", and "Foreign exchange market" as a group of documents of the same level.

【００３３】図５は、得られた文書の内、ランクの異な
る文書の一例を示す図である。図５（ａ）に示した文書
はランクが高く、（ｂ）に示した文書はランクの低い文
書である。FIG. 5 is a diagram showing an example of documents having different ranks among the obtained documents. The document shown in FIG. 5A has a high rank, and the document shown in FIG. 5B has a low rank.

【００３４】前述のレベル１から３の単語が存在しない
場合は、図４（ｃ）に示される全ての単語を検索対象と
する。そして、検索処理部４０によって文書が見つけら
れた場合、出力装置６０から文字列が出力される。本実
施形態においては、三段階で最大区切りに到達したが、
他の文字列の場合、これを何段階も繰り返して最大区切
りまで処理する。つまり、文字列の区切りを繰り返す毎
に検索を行い、マッチした文書群を順に並べる。このこ
とにより、より入力装置１０から入力された文字列に近
いものが含まれている文書を出力装置６０から出力でき
る。If the words of levels 1 to 3 do not exist, all the words shown in FIG. 4C are searched. When a document is found by the search processing unit 40, a character string is output from the output device 60. In this embodiment, the maximum break has been reached in three stages,
In the case of other character strings, this is repeated many times and processed up to the maximum break. In other words, a search is performed every time the delimitation of the character string is repeated, and the matched documents are arranged in order. As a result, a document including a character string closer to the character string input from the input device 10 can be output from the output device 60.

【００３５】[0035]

【発明の効果】以上、説明したように、本発明によれ
ば、入力した検索文字が単に含まれているという文書を
探すのではなく、入力された文字列の内容により近い文
書を見つけ出すことができるという効果がある。そのた
め、再入力を行ったり、結果から更に入力を行い、絞り
込んで検索するなどの手間が省けるという効果がある。
更に、入力する時間、アクセスしている時間を減らす事
により、取得した文書の内容をより多く参照することが
できるという効果がある。As described above, according to the present invention, it is possible to find a document that is closer to the content of the input character string, instead of searching for a document that simply contains the input search character. There is an effect that can be. Therefore, there is an effect that the trouble of re-inputting, further inputting from the result, narrowing down the search, and the like can be omitted.
Further, by reducing the input time and the access time, there is an effect that the content of the acquired document can be referred to more.

【図面の簡単な説明】[Brief description of the drawings]

【図１】本発明の一実施形態による文字列検索装置の
構成を示すブロック図である。FIG. 1 is a block diagram illustrating a configuration of a character string search device according to an embodiment of the present invention.

【図２】本発明の一実施形態による文字列検索装置の
動作を示すフローチャートである。FIG. 2 is a flowchart showing an operation of the character string search device according to one embodiment of the present invention.

【図３】本発明の一実施形態の具体例を詳細に説明す
るための図である。FIG. 3 is a diagram for explaining a specific example of an embodiment of the present invention in detail.

【図４】本発明の他の実施形態の具体例を詳細に説明
するための図である。FIG. 4 is a diagram for explaining a specific example of another embodiment of the present invention in detail.

【図５】得られた文書の内、ランクの異なる文書の一
例を示す図である。FIG. 5 is a diagram showing an example of documents having different ranks among the obtained documents.

[Explanation of symbols]

１０入力装置２０文字列解析部３０記憶部３２単語辞書３４不要語辞書３６つながり情報記憶部３８使用頻度情報記憶部４０検索処理部５０文書データベース DESCRIPTION OF SYMBOLS 10 Input device 20 Character string analysis part 30 Storage part 32 Word dictionary 34 Unnecessary word dictionary 36 Connection information storage part 38 Usage frequency information storage part 40 Search processing part 50 Document database

Claims

[Claims]

1. An input device for inputting a character string to be searched, a word dictionary storing words, an unnecessary word dictionary storing word information that is undesirable as a search result, and a connection between words. A storage unit having a connection information storage unit that stores connection information indicating strength; and a character string input from the input device is divided into words based on the word information and the unnecessary word information, and based on the connection information. A character string analysis unit that uses the word as a compound word, and a search processing unit that searches the document database for the compound word.

2. The storage unit according to claim 1, wherein the storage unit further includes a usage frequency information storage unit that stores usage frequency information indicating that the connection strength between the words exceeds a certain value. Character string search device.

3. The character string search device according to claim 2, wherein the character string analysis unit converts the word into a compound word as a compound word based on the use frequency information.