JPH01258026A

JPH01258026A - Method for retrieving code string

Info

Publication number: JPH01258026A
Application number: JP63085018A
Authority: JP
Inventors: Tadashi Osone; 匡大曽根; Hiroyuki Kitajima; 北嶋　弘行
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1988-04-08
Filing date: 1988-04-08
Publication date: 1989-10-16

Abstract

PURPOSE:To rapidly find out the existence of an approximate pattern by using address information as control information in a method for retrieving the approximate pattern. CONSTITUTION:A text is stored in a secondary storage device 100 and retrieved by a text retrieving device connected to a secondary storage control device 11 and the retrieved result is transferred to a CPU 140. In this case, address information indicating the most coincident part in the text is added as control information in addition to information expressing the number of coincident characters such as the number of characters coincident with a specified pattern. When the detection of a required approximate pattern is known at a certain time, said address information is referred, so that the position of the detected approximate pattern can be specified.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は情報処理システムにおいて、長大な記号列から
特定の記号列を検索するのに好適な記号列検索方法に関
する。DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to a symbol string search method suitable for searching for a specific symbol string from a long symbol string in an information processing system.

[Conventional technology]

オフィス・オートメーション化に伴って、文書情報のデ
ータベース化が急速に進んでおり、そのデータベースの
規模も大規模化する方向にある。With office automation, document information is rapidly becoming a database, and the scale of the database is also increasing.

したがって、文書情報のデータベース処理の高速化は重
要な課題である。この文書情報のデータベース処理のう
ち、特に重要な処理の１つは、テキストと呼ばれる記号
列のデータの中から、パターンと呼ばれる特定の記号列
を検索する記号列検索処理である。この処理は、文章の
原文ファイルからのキーワード検索や抽出等、文書情報
処理では欠くことのできない処理である。Therefore, increasing the speed of database processing of document information is an important issue. One of the most important processes in this database processing of document information is a symbol string search process that searches for a specific symbol string called a pattern from symbol string data called text. This processing is indispensable for document information processing, such as keyword search and extraction from original text files.

上記の記号列検索の方法については、従来から、数多く
提案されている。これについては、例えば、ホラー著「
ハードウェア・システムズ・フォア・テキスト・インフ
ォメーション・リスリーバル」（ＩＩｏｌ、Ｌａａｒ、
Ｌ、Ａ、：ｔｌａｒｄｗａｒｅ　Ｓｙｓｔｅｍｓ　ｆｏ
ｒ　ＴｅｘｔＩｎｆｏｒｍａｔｉｏｎ　Ｒｃｔｒｉｅｖ
ａｌ、ＡＣＭ　５ＪＧＴＲ６ｔｈ　Ｃｏｎｆ、。Many methods for the above-mentioned symbol string search have been proposed in the past. Regarding this, for example, see Holler's book “
"Hardware Systems for Text Information Resurrection" (IIol, Laar,
L, A, :tardware Systems for
r TextInformation Rctriev
al, ACM 5JGTR6th Conf.

１９８３）において論じられている。しかし、ここで論
じられているのはいずれも、指定された記号列（パター
ン）と完全に一致する記号列をテキストから検索するこ
とを目的としていた。しかし、誤人力テキストや不統一
用語の検索を実行したい場合には、上記検索機能では不
十分である。これに対しては、一部の記号の欠けや誤り
、あるいは、余分に挿入されているようなパターンをも
検索できる機能、すなオ〕ち、近似なパターンを検索で
きる機能が必要である。この種の方法については、特開
昭６１−２８１３４において論じられている。1983). However, the purpose of all the methods discussed here is to search text for a symbol string that exactly matches a specified symbol string (pattern). However, when it is desired to search for erroneous manual text or inconsistent terms, the above search function is insufficient. To deal with this, a function is required that can search for patterns that are missing or erroneous in some symbols, or that have been inserted redundantly, in other words, a function that can search for similar patterns. This type of method is discussed in Japanese Patent Application Laid-Open No. 61-28134.

[Problem to be solved by the invention]

しかし、上記従来技術は、与えられたテキストに指定し
たパターンと近似なパターンを含んでいるかどうかを判
別するのが目的であり、テキストのどこからどこまでが
近似パターンであるかという、近似パターンの存在位置
の特定については配慮がなされていなかった。However, the purpose of the above-mentioned conventional technology is to determine whether a given text contains a pattern that is similar to a specified pattern. No consideration was given to the identification of

本発明の目的は、近似パターンの存在位置の特定を高速
に実行する検索方法を提供することにある。An object of the present invention is to provide a search method that quickly identifies the location of an approximate pattern.

[Means to solve the problem]

上記目的を達成するため、本発明は、パターンの１文字
のテキストの１文字を比較し、制御情報を生成する論理
を直列に接続し、その制御情報を伝達しながら記号列検
索を実行する方法において、制御情報として、指定パタ
ーンとは何文字一致しているかという一致文字数等を表
現する情報の他に、テキストのどこからもつとも良く一
致しているか等のアドレスの情報を付加することにより
達成される。To achieve the above object, the present invention provides a method for comparing one character of text of a pattern, connecting logic for generating control information in series, and performing symbol string search while transmitting the control information. This is achieved by adding, as control information, information expressing the number of matching characters, such as how many characters match the specified pattern, as well as address information, such as where in the text there is a good match. .

[Effect]

本発明では、ある時刻に所望の近似パターンが検出され
たことがわかったとき（一致文字数の情報よりわかる）
に上記アドレス情報を参照することにより、その検出さ
れた近似パターンの存在する位置を特定することができ
る。In the present invention, when it is known that a desired approximate pattern has been detected at a certain time (this can be determined from the information on the number of matching characters)
By referring to the above address information, it is possible to specify the position where the detected approximate pattern exists.

〔Example〕

以丁、本発明の一実施例を詳細に説明する。 An embodiment of the present invention will now be described in detail.

第１図に全体構成例を示す。テキストは二次記憶装置１
００内に格納され、それが二次記憶制御装置１１０に接
続されているテキスト検索装置５で検索され、その結果
がＣＰ　ｔＪ　］　４０に転送される。このテキスト検
索装置５で近似パターンの検索が実行される。Figure 1 shows an example of the overall configuration. Text is in secondary storage device 1
00, it is searched by the text search device 5 connected to the secondary storage control device 110, and the result is transferred to the CP tJ ] 40. This text search device 5 executes a search for an approximate pattern.

近似パターンの検索要求を、指定パターンとに文字以内
達いのパターンをテキストから検出することと規定する
。この時、検出されたパターンを指定パターンの近似パ
ターンと呼ぶ０本例では、テキスト長をＮ文字、指定パ
ターン長をＭ文字とする。また、テキス１〜のｉ番目か
らｔｌ　ａ目までの文字列をｒｘ′ｒ　（ｉ　：　ｊ）
、指定パターンのｊ番目からｊ番目までの文字列をＦ’
ＴＮ（ｉ：ｊ）とし、特に、テキストのｊ番［１の文字
を′ｌ″ＸＴ（ｉ）。A search request for an approximate pattern is defined as detecting a pattern from text that is within a character range of a specified pattern. At this time, the detected pattern is called an approximate pattern of the designated pattern. In this example, the text length is N characters and the designated pattern length is M characters. Also, the character string from the i-th to the tla-th of text 1~ is rx′r (i : j)
, the character string from the jth to the jth character string of the specified pattern is F'
TN(i:j), and in particular, the jth [1 character of the text is 'l''XT(i).

１目定パターンのｊ番目の文字をＰＴＮ（ｉ）で表わす
。The j-th character of the first target pattern is expressed as PTN(i).

近似パターンの検索アルゴリズムとして、第２図に示す
方法がある。例として、テキストがｒＯＵｓＡＫＡＯＨ
Ｈ３＾にＡＯ３ＡにＡ」、指定パターンがｒＯＨ３ＡＫ
＾」の場合を考える。この場合の制御情報Ｃ（ｉ、ｔ）
の時間的推移を第３図に示す。このアルゴリズムでは、
制御情報Ｃ（ｉ、ｔ）がＰＴＮ　（１：　ｊ）とＴＸＴ
（ｘ：ｔ）と比べて最も一致度が高いところの一致文字
数を表現している。例えば、第３図では、Ｃ（６，６）
とＣ（６，１３）とＣ（６゜１８）が５となっており、
長さ６文字の指定パターンと１文字違いの近似パターン
を検出していることを表している。しかし、このアルゴ
リズムでは、近似パターンがてこに存在するかを容易に
求めることはできなかった。As a search algorithm for an approximate pattern, there is a method shown in FIG. As an example, if the text is rOUsAKAOH
H3^ to AO3A to A”, the specified pattern is rOH3AK
Consider the case of "^". Control information C(i, t) in this case
Figure 3 shows the time course of . In this algorithm,
Control information C (i, t) is PTN (1: j) and TXT
It expresses the number of matching characters where the degree of matching is highest compared to (x:t). For example, in Figure 3, C(6,6)
and C(6,13) and C(6°18) are 5,
This indicates that an approximate pattern that differs by one character from the specified pattern with a length of 6 characters has been detected. However, with this algorithm, it was not possible to easily determine whether an approximate pattern exists on a lever.

本発明では、アドレスの情報Ａ（ｉ、ｔ）を導入し、ア
ルゴリズムを第５図のように変えることにより、近似パ
ターンの存在位置を容易に求めることができる。これの
概念的なフロチャートを第６図に示す。すなわち、ステ
ップ４２０とステップ４４０とでアドレス情報Ａ　（ｉ
、ｔ）の初期設定を行い、ステップ４８０において一致
文字数悄報Ｃ（ｉ、ｔ）がｆ）ｚ　、　Ｄ２．　Ｄａ　
ノどれから生成されたかに応じて、Ａ　（ｉ、ｔ）を生
成していく　（ステップ４９０〜５３０）。ステップ５
４０で近似パターンが検出されたことが判明すると、ス
テップ５５０で近似パターン検出処理で検出された近似
パターンの位置を特定する。この時の近似パターンは、
テキストのＡ（Ｍ、ｔ）番目の文字からテキストのｔ番
目までの文字列ＴＸＴ　（Δ（Ｍ、ｔ）：　ｔ）である
。このようにして、アドレス情報Ａ　（ｉ、ｔ）を用い
ることにより、容易に近似パターンの位置を特定するこ
とができるようになる。In the present invention, by introducing address information A(i, t) and changing the algorithm as shown in FIG. 5, the location of an approximate pattern can be easily determined. A conceptual flowchart of this is shown in FIG. That is, address information A (i
, t), and in step 480, the matching character count report C(i, t) is f)z, D2. Da
A (i, t) is generated depending on which source it was generated from (steps 490 to 530). Step 5
When it is determined in step 40 that an approximate pattern has been detected, in step 550 the position of the approximate pattern detected by the approximate pattern detection process is specified. The approximate pattern at this time is
The character string TXT (Δ(M, t): t) is from the A(M, t)th character of the text to the tth character of the text. In this way, by using the address information A (i, t), it becomes possible to easily specify the position of the approximate pattern.

このアルゴリズムを用いたときのＣ（ｉ、ｔ）とＡ　（
ｉ、ｔ）の時刻的推移をそれぞれ、第３図と第４図に示
す。この場合、時刻ｔ、　＝　６の時、Ｃ（Ｍ、　　し
）＝Ｃ（６，６）＝５となり、Ｍ−Ｃ（Ｍ、ｔ）＝１なので、１文字違いの近
似パターンが検出できたことがわかる。そして、アドレ
ス情報を用いることにより、この近似パターンの位置を
求めることができろ。すなわち、先頭位置がＡ　（Ｍ、ｔ）＝Ａ　（６，６）＝　１であり、末尾位置がｔ＝６である。つまり、ＴＸＴ　（Ａ（Ｍ、ｔ）：　ｔ）＝ＴＸＴ　（１：　６）＝　ＯＵ　Ｓ　Ａ　Ｋ　Ａが、ｒｏｌＩｓＡＫＡ」の近似パターンである。C(i, t) and A (
The time course of i, t) is shown in FIG. 3 and FIG. 4, respectively. In this case, at time t, = 6, C(M, shi) = C(6, 6) = 5, and M-C(M, t) = 1, so an approximate pattern with one character difference could be detected. I understand that. Then, by using address information, it is possible to find the position of this approximate pattern. That is, the start position is A (M, t) = A (6, 6) = 1, and the end position is t = 6. In other words, TXT (A(M, t): t) = TXT (1: 6) = OU S A K A is an approximate pattern of rol Is AKA.

同様に、時刻ｔ＝１３とｔ＝１８の時もＣ（Ｍ、　ｔ）
　＝５となり、指定パターンｒｏｌＩｓＡＫＡＪと１文字違い
の近似パターンが検出されたことがわかり、それはそれ
ぞれ７Ｉ”ＸＴ　（Ａ（Ｍ、ｔ）、ｔ）＝ＴＸＴ　（Ａ（６，１３）、１３）＝ＴＸＴ　（７，Ｌ３）＝　ＯＩ−（Ｓ　Ａ　Ｋ　ＡＴＸＴ　　（Ａ（Ｍ、　　ｔ）、　　ｔ）＝ＴＸＴ　　
（Ａ（６，１８）、１８）＝ＴＸＴ　　（１４，１８）＝ＯＳＡＫＡであることがＡ　（Ｍ、ｔ）を用いることにより容易に
わかる。Similarly, at times t=13 and t=18, C(M, t)
= 5, and it can be seen that an approximate pattern with one character difference from the specified pattern rolIsAKAJ has been detected, which is 7I"XT (A(M, t), t) = TXT (A(6, 13), 13) = TXT (7, L3) = OI-(S A K A TXT (A(M, t), t) = TXT
It can be easily seen that (A(6,18), 18)=TXT (14,18)=OSAKA by using A(M,t).

このアルゴリズムは、第７図のように適宜にハードウェ
ア化して、各セル１でテキストとパターンの文字の比較
と制御情報Ｃ（ｉ、ｔ）とアドレス情報Ａ　（ｉ、ｔ）
の生成を並列に行い高速化することができる。なお、２
はテキスト線、３は近似度線、４はアドレス情報線を示
す。This algorithm is implemented in appropriate hardware as shown in Figure 7, and in each cell 1, the text and pattern characters are compared, control information C (i, t) and address information A (i, t)
can be generated in parallel to speed up the generation. In addition, 2
3 indicates a text line, 3 indicates an approximation line, and 4 indicates an address information line.

〔Effect of the invention〕

近似パターンの検索は利用者に柔軟な検索を提供する。 Approximate pattern searches provide users with flexible searches.

例えば、利用者にとってあいまいなキーワードの検索を
可能としたり、類似のパターンを検索することを可能と
する。For example, it allows users to search for keywords that are ambiguous to them, or to search for similar patterns.

従来技術では、テキスト中に近似パターンが存在するこ
とは認識できたが、その位置を特定するのは困難であっ
た０本発明によれば、容易に近似パターンの位置を特定
することができるようになるという効果がある。さらに
本発明は、規則的でかつ並列動作が可能であるのでハー
ドウェア化が容易であり、近似パターンの存在位置を高
速に求めることができるという効果がある。With the conventional technology, it was possible to recognize the existence of an approximate pattern in a text, but it was difficult to specify its position.According to the present invention, it is possible to easily specify the position of an approximate pattern. It has the effect of becoming Further, since the present invention enables regular and parallel operations, it is easy to implement in hardware, and the present invention has the advantage that the position of an approximate pattern can be determined at high speed.

[Brief explanation of the drawing]

第１図は本発明を実現するハードウェアの全体構成図、
第２図は従来の方法を示すフローチャート、第３図は従
来技術の動作例の説明図、第４図は本発明の詳細な説明
図、第５図は本発明の方法を示すフローチャート、第６
図は第５図の概念的フローを示すフローチャート、第７
図は本発明のデータ検索部の一実施例のハードウェア構
成図である。１・・・セル、２・・・テキスト線、３・・・近似度線
、４・・・￥　１　　図＼ノ第　７　口場　２　（！］FIG. 1 is an overall configuration diagram of the hardware that realizes the present invention.
FIG. 2 is a flowchart showing the conventional method, FIG. 3 is an explanatory diagram of an example of the operation of the prior art, FIG. 4 is a detailed explanatory diagram of the present invention, FIG. 5 is a flowchart showing the method of the present invention, and FIG.
The figure is a flowchart showing the conceptual flow of Figure 5.
The figure is a hardware configuration diagram of an embodiment of the data search section of the present invention. 1...Cell, 2...Text line, 3...Approximation line, 4...￥ 1 Figure\no 7th exit 2 (!]

Claims

[Claims] 1. In a method of searching for an approximate pattern, which is a symbol string that differs in number of symbols within a predetermined number of customers, from a symbol string called text when a pattern, which is a designated symbol string, is given, the control information 1. A symbol string search method characterized by specifying the location of an approximate pattern by using address information as a symbol string. 2. Compare one symbol in the pattern with one symbol in the text, and generate information representing a new number of matching characters from the comparison result and information representing the number of matching characters transmitted from the logic one level below, Claim 1: Logics that generate address information in accordance with the generation situation are connected in series, and the location of the approximate pattern is specified while transmitting information representing the number of matching characters and address information between the logics. Symbol string search method described in section.