JPS6027938A - Character string comparing device - Google Patents
Character string comparing deviceInfo
- Publication number
- JPS6027938A JPS6027938A JP58136502A JP13650283A JPS6027938A JP S6027938 A JPS6027938 A JP S6027938A JP 58136502 A JP58136502 A JP 58136502A JP 13650283 A JP13650283 A JP 13650283A JP S6027938 A JPS6027938 A JP S6027938A
- Authority
- JP
- Japan
- Prior art keywords
- character
- string
- character string
- characters
- similarity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90339—Query processing by using parallel associative memories or content-addressable memories
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Document Processing Apparatus (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
【発明の詳細な説明】
く分野〉
本発明は文字列比較装置に係り特にキーワードを付加し
て整理された文献等をキーワードを媒介として検索しよ
うとするシステムにおいて有効な文字列と文字列を比較
する手法であって類似した文字列を取り出すことの出来
る方式に関する。[Detailed Description of the Invention] Field of the Invention The present invention relates to a character string comparison device, and in particular, to a device for comparing character strings, which is effective in a system that attempts to search documents organized by adding keywords using keywords as a medium. This invention relates to a method for extracting similar character strings.
〈従来技術と背景〉
意味判定と云った高級な判断手法を用いるのではなく一
致判定手法を用いてすでに記憶装置にファイルされてい
るデータの中から検索文字列を指定レジスタに入力し、
順次ファイルされているデ−タから引き出される文字列
と比較して一致するものを引き出す検索システムは、例
えば文献検索とか、カナ漢字変換における変換辞書のア
クセス等でしばしば用いられている。<Prior art and background> Rather than using a high-level judgment method such as meaning judgment, a match judgment method is used to input a search string from data already filed in a storage device into a specified register,
A search system that compares character strings extracted from sequentially filed data and extracts matching strings is often used, for example, in literature searches, access to conversion dictionaries in kana-kanji conversion, and the like.
しかしこうしたシステムの場合多少の不要のもマン、マ
シン間のインターフニースト云う面テ見るなら、こうし
たシステムでは不要のものが引き当てられても、人間が
判断して容易に括ることが出来るので不要のものが混入
するデメリットよりも類似のものが引き当てられないと
云うデメリットの方がM4):であると云う考え方であ
る。However, in such a system, if we look at the interface between man and machine that is somewhat unnecessary, even if unnecessary things are allocated in such a system, humans can easily decide and exclude unnecessary things. The idea is that the disadvantage of not being able to allocate similar items is greater than the disadvantage of having something mixed in.
従来上記の様な考え方にもとすき類似文字列を検出する
手法としては■前方一致による比較、■後方一致による
比較、また0中間一致による比較等比較基準側の文字列
全体の中で必ず一致せねばならない部分を指定して比較
することにより検出する方法や、類似群と云う考え方に
もとすき、あらかじめ表現の変化を辞書形式で記憶して
おき判定基準としての文字列による一致検索を行う場合
に類似群についての一致検索をも併せて行う方法である
が前者は類似文字列としてどんなものが入って来るかあ
る程度あらかじめ考lして一致部分の指定を行わないと
多量の不要データが出て来ること、後者は検索したい文
字列に対応する考えられる類似群としての表現のちがい
、同義飴、活用、音便表現等をあらかじめ記憶させたデ
ータソースとは別の副列定基準文字列引き当て用の辞書
を用意せねばならないが、この辞書のbjkが装置とし
てはあまりばかに出来ない程度に大きくなってしの
まりと云う欠点と、検索目的Iため副列定基準文字列中
から一つ一つ引き当てて基準文字列としながら比較して
ゆかねばならないと云う欠点があったO
本発明はこれらにかんがみこうし、た検−r:作業を比
較的かんたんなやり方で実質的には引尚てるべきデータ
の脱落を防止する方法を提供しようとするものであり検
索の場合ワードにおける表記のちがいは、カナ表現にお
いては多くは■−文字省略かえが大部分である実情を考
え基準となる文字列に対して比較する文字列を一文字ず
つ比較しつつ、これらのカテゴリーの内に含まれるもの
は準一致列長で2文字以上の異同と同一長文−字列中2
文字以上不一致と連続する文字の転置が2対以上あるも
のを不一致と判定すると云うカテゴリーを適用する文字
列比較装置を提供することにより、もちろんけんみつな
怠味でのデーク維音はある程度式って来る〃iかなり効
率的に有効なデータをもれなく引き当てることが可能と
なる一方実現手段として見た上記カテゴリーにもとず〈
判定を実現する手段は前記従来方法に比して、きわめて
簡単かっ女佃Jに実現出来る。Conventionally, methods for detecting similar character strings that are similar to the above idea include ■Comparison by forward match, ■Comparison by backward match, and comparison by 0-middle match, etc., which always match within the entire string on the comparison standard side. In addition to the method of detection by specifying and comparing the parts that need to be changed, and the idea of similar groups, changes in expressions are memorized in a dictionary format in advance and a match search is performed using character strings as criteria. In this case, a match search for similar groups is also performed, but in the former method, a large amount of unnecessary data will be generated unless you consider in advance what kind of similar character strings will be included and specify the matching parts. The latter is a substring specification standard character string assignment that is different from the data source that stores in advance the differences in expressions as possible similar groups corresponding to the character string you want to search, synonyms, conjugations, phonetic expressions, etc. However, the disadvantage of this dictionary is that the bjk of this dictionary is too large to be used as a device, and for the search purpose I, it is necessary to prepare a dictionary for substrings. In view of these problems, the present invention has the disadvantage of having to select one character string and use it as a standard character string for comparison. This is intended to provide a method to prevent the omission of data that should be used for searching, and in the case of searches, the differences in word notation are based on the fact that in kana expressions, most of the characters are omitted and replaced. While comparing the strings to be compared character by character against the columns, those included in these categories are semi-matching string lengths with differences of 2 or more characters and characters with the same length - 2 characters in the strings.
Of course, by providing a character string comparison device that applies the category of mismatching two or more characters and determining that there are two or more pairs of consecutive transposed characters as mismatching, it is possible to solve the problem of exhaustive laziness to some extent. Based on the above categories seen as a means of achieving this, it is possible to obtain all valid data fairly efficiently.
The means for realizing the determination is extremely simple and simple compared to the conventional method described above.
したがって本発明の目的は上記カテゴリーで文字列の一
致、亭一致を14」断する実用的な文字列比較装置を提
供することにあり、本発明の特徴は上記目的を実現する
ため文字単位で比較文字列をと被比較文字列をストアす
る第1と第2のシフトレジスタと、上記、第1.第2の
シフトレジスタに収容された文字を同順の文字同志で比
較する比較器と、1つ順序のずれた同志で比較する比較
器と、これら比較結果にもとすき文字毎の一致、不一致
、類似を判定する判定回路と、該判定結果の類似判定数
を累算する類似度カウンタと、シフトレジスフ(7)シ
フト手段を有し、順次シフトさせつつ文字同志を比較し
て文字列を比較した結果、同順かつ同文字列長一致のも
のを一致文字列と判定するとともに累算された類似度が
所定値以下のものを準一致文字列と判定すること。Therefore, the purpose of the present invention is to provide a practical character string comparison device that can determine whether or not character strings match in the above categories. first and second shift registers for storing character strings and compared character strings; A comparator that compares the characters stored in the second shift register with characters in the same order, a comparator that compares characters that are out of order, and the results of these comparisons also include matches and mismatches for each character. , a determination circuit for determining similarity, a similarity counter for accumulating the number of similarity determinations of the determination results, and a shift register (7), and a character string is compared by comparing characters while sequentially shifting them. As a result, those in the same order and the same length are determined to be matching character strings, and those whose accumulated similarity is less than or equal to a predetermined value are determined to be semi-matching character strings.
まだ上記比較器に加えて上記同順の文字の字種を検出す
る一対の字種検出器を有し7上記同準の比な不一致数を
上記類似度カウンタに登算すること、さらには上記判定
された一致文字列と準一致文字列を、少くとも区別可能
な形で記憶することであるO
〈実施例〉
第1図は本発明の一実施例の説明図であり、図中1.2
,3.4は夫々比較器、5と6は夫々、第1および第2
のシフトレジスタでありシフトレジスタ中の区切りは文
字分のコードを収容するブロックを意味し比較基準側の
比較文字列と比較される側の被比較文字列は図で云う左
側を先頭側として一文字ずつ夫々のシフトレジスタに収
容されているものとし、夫々の文字列は図示しない制御
回路よりの夫々のシフトパルスにより一文字ずつ左方向
にシフト可能なものとする。In addition to the comparator, there is also a pair of character type detectors for detecting the character types of the characters in the same order. The determined matching character strings and quasi-matching character strings are stored in a form that is at least distinguishable. 2
, 3.4 are comparators, and 5 and 6 are first and second comparators, respectively.
This is a shift register, and the delimiter in the shift register means a block that stores the code for characters.The comparison string on the comparison standard side and the compared string on the side to be compared are written one character at a time, starting from the left side in the figure. It is assumed that each character string is stored in each shift register, and each character string can be shifted leftward one character at a time by respective shift pulses from a control circuit (not shown).
棟た7は判定回路で、前記比較器1〜4が比較した一致
、不一致の判定結果にもとすき評価判定を行うものであ
り、8は判定回路7によるn・[価判足の結果率一致評
洒を計数し一文字列比較中に累算される不一致評価が0
個のときは完全一致、1個の簡は準一致、2個以上の時
は不一致と判定するため不一致を言1数する類似度カウ
ンタである〇こうした構成において1ず比較すべき二つ
の文字列のコードが夫々のシフトレジスタに夫々先頭(
図で云う左側)を一致させて収容されて伝るものとすれ
ば、比較器1は先頭に収容された文字同志比較器2は2
番目に収容された文字同志、比較器3はシフトレジスタ
5の先頭に収容された文字と、シフトレジスタ6の2査
目に収容された文字同志、比較器4はシフトレジスタ5
の2番目に収容された文字と、シフトレジスタ6の先頭
に収容された文字同志を比較し夫々一致ならば例えば0
〕不一致ならば■〕を出力するものとし、その出力を夫
々a l b l e l dで示す。また判定回路7
が文字列中の次の文字同志を比較するためシフトレジス
タに収容している文字列を、文字単位で左方向にシフト
させるシフトパルスを夫々e、f、とじ類似度に関する
一比較毎の評価判定結果の出力をgとする1以上の定義
をもとに判定回路7の動作を真理値表で示すと下表の如
くである。Numeral 7 is a judgment circuit, which performs a suki evaluation judgment on the match/mismatch judgment results compared by the comparators 1 to 4; Count the matching ratings and the cumulative discrepancy rating during one character string comparison is 0.
It is a similarity counter that counts 1 as an exact match when there are 1, a semi-match when 1 is a match, and a mismatch when 2 or more. The code for each shift register is placed at the beginning (
If the letters on the left side in the figure are matched and transmitted, then comparator 1 is accommodated at the beginning, and character-to-character comparator 2 is 2.
Comparator 3 compares the characters stored at the beginning of shift register 5 with the characters stored in the second row of shift register 6, comparator 4 compares the characters stored in shift register 5 with the characters stored in the second row of shift register 6.
The second character stored in the shift register 6 is compared with the first character stored in the shift register 6, and if they match, the result is 0, for example.
] If there is a mismatch, then ■] is output, and the outputs are indicated by a l b l e l d, respectively. Also, the judgment circuit 7
In order to compare the next characters in the string, the character string stored in the shift register is shifted to the left in character units using shift pulses e and f respectively.Evaluation judgment for each comparison regarding the degree of binding similarity. The operation of the determination circuit 7 is shown in a truth table based on one or more definitions where the resultant output is g.
表1
つまり、a、bとも一致(α〕)の場合は2文字につい
て同文字同順の完全一致がとれたことは明らかなので上
下のシフトレジスタ5,6の文字列をe、fに示す様夫
々二つづつ送って次の順の文字列同志智・比較する。Table 1 In other words, if both a and b match (α), it is clear that there is a complete match of the same characters and the same order for the two characters, so the character strings in the upper and lower shift registers 5 and 6 are as shown in e and f. Send two of each and compare the strings in the following order.
またa r b HC+ d +とも完全不一致〔オー
ル(9)〕の」ハ合は類似度のカテゴリー外なのでその
時定で文字列不一致として判定24+を上げ判断決定と
してその彼はその被比較文字列に関する比較は打ち切る
。Also, the complete mismatch [all (9)] with a r b HC + d + is outside the category of similarity, so at that time, the character strings are mismatched and the judgment 24+ is raised. The comparison is discontinued.
なを、aかbかいづれか一つが(1)である場合と、a
、bともに■〕であるがc、dのいづれか一つがα〕で
ある場合とについてはレジスタ上の比較している文字が
一文字だけ前後しているものとしてCが(1)ならeを
1. dが〔1〕ならばfを1にして夫々シフトレジス
タのどちらかを一文字分送るとともに類似度カウンタ8
に出力する信号出力gとして0〕を送り、上記シフト後
比較した比較出力として、a、bとも〔1〕が得られれ
ばまだ2文字分ずつシフトさせて次の文字間の比較を行
う手順を文字列の終りまで行う。What, if either a or b is (1), and a
, b are both ■], but one of c and d is α], assuming that the characters being compared on the register are one character ahead and behind the other, and if C is (1), then e is 1. If d is [1], set f to 1 and send one character of each shift register, and the similarity counter 8
0] is sent as the signal output g to be output, and if [1] is obtained for both a and b as the comparison output after the above shift, then shift by two characters and perform the next comparison between characters. Execute until the end of the string.
なお、a、bのいづれか一方が〔υでc、dともにΦ〕
の場合は一文字異同と判断し類似度カウンタ8に1を加
えるとともにe、fとも1としてシフトレジスタ5と6
の文字を夫々−文字づつ送って次の比較を行う。In addition, either a or b is [υ and c and d are both Φ]
In the case of , it is determined that one character is different and the same, and 1 is added to the similarity counter 8, and both e and f are set to 1, and the shift registers 5 and 6 are
The next comparison is performed by sending the characters of -character by character.
またa、bともに■〕であってかつc、dともに〔1〕
の場合は連続する1対分の文字順の入れかわりと判断し
類似度カウンタ8に1を加えるとともにe。Also, both a and b are ■], and both c and d are [1]
In the case of , it is determined that the order of characters for one consecutive pair has been changed, and 1 is added to the similarity counter 8, and e is executed.
fとも1として夫々のシフトレジスフ5,6 icヲ夫
々−文字分前進させて次の比較を行う。The next comparison is performed by setting both f to 1 and moving the shift registers 5 and 6 ic forward by one character.
以上の文字同志の比較手順を文字列同志がおわりになる
か類似度カラ/りが累積しているカウント値があらかじ
め設定している数をこえるまで(本実施例では累積値が
2以上で不一致)行い一文字列についての比較器+11
6が終了(完了又は打ち切り)しだ時定で類似度カウン
タ8のカウント値が0の場合は完全一致、1の場合は準
一致と判定し、例えば完全一致のものはその文字列に対
して評価ビット■〕を伺加して、準一致のものは評価ビ
ットα〕を付加して被比較文字列を収容していた側のレ
ジスタの文字列を倹累引凸てずみの文字列を記憶するメ
モリ(図示省略)に収容するとともに次の被比較文字列
をシフトレジスタに収容して次の比較を行う。なを、類
似度カウンタの杓ち切り限界を2としたが使用目的によ
っては3以上に取り、準一致の中に類似度1の一致、類
似度2の一致等の区別なイt」け、同様に識別ビットを
伺けておけは、可
あとで判定基準を追跡7$能とすることも出来る。The above procedure for comparing characters is repeated until the character strings end or the accumulated count value of the similarity value exceeds a preset number (in this example, if the cumulative value is 2 or more, there is a mismatch). ) Comparator for one character string +11
6 is completed (completed or discontinued) If the count value of similarity counter 8 is 0, it is determined to be a complete match, and if it is 1, it is determined to be a semi-match.For example, a complete match is determined as Add the evaluation bit ■], and if there is a near match, add the evaluation bit α] to the string in the register that contained the compared string, and store the convex string. At the same time, the next character string to be compared is stored in a shift register for the next comparison. Although we set the cut-off limit of the similarity counter to 2, depending on the purpose of use, it can be set to 3 or more, and there is no distinction between semi-matches such as matches with similarity 1 and matches with similarity 2. Similarly, if the identification bit is known, the determination criterion can be set to 7$ as the tracking standard.
また、評価作業手順を短縮するため本例では判定が定ま
ったかどうかを類似度カウンタで判定してただちに終了
させたが、もちろん文字列のおわ抄まで比較しても良い
。Further, in order to shorten the evaluation procedure, in this example, a similarity counter is used to determine whether or not a determination has been made and the process is immediately terminated, but it is of course possible to compare even the text of the character strings.
なを第2図は本発明の別の実施例の説明図であ今
り11と工4は比較器、15と16は夫々シフトレジス
タ17は判定回路、17mは判定回路17の中にもうけ
られた1ビツトメモリ、18は類似いる比較器が省略さ
れており判定回路17に入る比較結果入力がa、dの2
個となり、判定回路の評価出力g、あるいはhを出す基
準と結果にもとすきシフトレジスター5.16を送る制
御の基準が異るだけで基本的には第1図と同じ目的のも
のであり、符番も10番カウントアノグしただけで第1
図に準じている。FIG. 2 is an explanatory diagram of another embodiment of the present invention, in which numerals 11 and 4 are comparators, 15 and 16 respectively, a shift register 17 is a judgment circuit, and 17m is provided in the judgment circuit 17. 1-bit memory 18 has a similar comparator omitted, and the comparison result input to the judgment circuit 17 is 2 a and d.
The purpose is basically the same as in Figure 1, except that the criteria for producing the evaluation output g or h of the judgment circuit and the control criteria for sending the shift register 5.16 to the result are different. , just by counting and annotating the number 10, it was number 1.
According to the diagram.
以下に判定回路の真理値表を表2として示す。The truth table of the determination circuit is shown below as Table 2.
なを一致を〔1〕不一致を(9)とする表示もシフト量
を指示する表記も表1の賜金に準するものとする。The notation that indicates a match is [1] and the mismatch is (9), and the notation that indicates the shift amount shall conform to the donation in Table 1.
表 2
ケ
表2で明らかな様に比較器11と1亭は夫々シフトレジ
スター5と16に図でもう左側に頭をそろえて比較文字
列と被比較文字列を収容しているのけ第1図の実施例の
場合と同様でこの状態から比較をはじめるものとする。Table 2 As is clear from Table 2, comparators 11 and 1 are located in shift registers 5 and 16, respectively, with their heads aligned on the left side in the figure and containing the comparison string and the compared string. As in the case of the embodiment shown in the figure, the comparison will start from this state.
そしてこうした状態から比較をはじめ比較器11今
の出力aが〔1〕(一致)であれば比較器1ηの出力d
が〔1〕、■〕にかかわりなく同列の文字が一致してい
るのだから判定回路17は一致と判定しc、dともに1
すなわち両方のシフトレジスタを一つづつ送り次の文字
の比較を行う。寸だ出力信号aが■〕でdが〔1〕の場
合には順がずれているか、入れかわっているか、いづれ
にしても類似パターンの可能性があるのでe=1 f=
2 すなわちシフトレジスタ15に1ンフト、シフトレ
ジスター6に2シフトを行わせ次の比較を行うとともに
出力gを1として類似度カウンタに1を登算する。また
出力信号a、dがともに(9)の場合は類似パターンの
可能性とともに不一致の可能性もあるのでe = f
= 1すなわち、上下のシフトレジスタを一つづつ進め
て次の比較を行うとともに、この場合判定回路17′中
にもうけられている1ビツトメモリMに1をセットし類
似度カウンタ18にも1を加算する。From this state, the comparison begins and if the current output a of the comparator 11 is [1] (match), the output d of the comparator 1η
Since the characters in the same row match regardless of [1] or ■], the judgment circuit 17 judges that there is a match, and both c and d are 1.
That is, both shift registers are sent one by one and the next character is compared. If the output signal a is [■] and d is [1], there is a possibility that the order is out of order, that they have been switched, or that they are similar patterns, so e = 1 f =
2 That is, the shift register 15 performs 1 shift and the shift register 6 performs 2 shifts, and the next comparison is performed, and the output g is set to 1, and 1 is added to the similarity counter. Furthermore, if the output signals a and d are both (9), there is a possibility of similar patterns as well as a possibility of mismatch, so e = f
= 1 In other words, the upper and lower shift registers are advanced one by one to perform the next comparison, and in this case, 1 is set in the 1-bit memory M provided in the judgment circuit 17', and 1 is also added to the similarity counter 18. do.
なを上記1ビツトメモリMは次の比較でa=(1)とな
れはリセットする。If a=(1) in the next comparison, the 1-bit memory M is reset.
なお上記a = d−■〕の状態が一度現出して1ビツ
トメモIJ Mが1にセットされた後、−文字列外の比
較中につづいて上記a =’ d−(6)が現出した場
合には一文字列中に2個の不一致があったものとして1
ビツトメモリMは0に戻すとともに出力信号りにより類
似度カウンタに対し、少くとも類似と判定するカウント
値よりも大きな値を加算することで不一致判定を行わせ
て、その文字列を比較する手+1Aは打ち切#)終了さ
せる。Note that after the above state a=d-■] appears once and the 1-bit memo IJM is set to 1, the above a='d-(6) appears during the comparison outside the string. In this case, it is assumed that there are two mismatches in one character string.
The method +1A is to reset the bit memory M to 0 and add a value larger than the count value that is determined to be similar to the similarity counter based on the output signal to determine a mismatch, and then compare the character strings. Abort #) End.
なお類似度カウンタ18がカウントする値は類許容限界
がプリセットされており、上記比較手順中に許容値をこ
えれば不一致判定を行うものとする0
こうした比較手順で比較を行うと、少くとも完全一致の
場合はつねにa=α〕であるから類似度カウンタ18の
値はゼロであり一文字の過、あるいは不足は類似度1又
は2.同長文字列同順中1文字ちがいは類似度1又は2
と云った具合で分類能力から云うと実施例の1と同一で
はないが、一致文字列と類似文字列について云えば同様
の機能がよりかんたんな比較手段で実現出来る。Note that the similarity tolerance limit is preset for the value counted by the similarity counter 18, and if the tolerance value is exceeded during the above comparison procedure, a mismatch is determined. In the case of a match, a=α], so the value of the similarity counter 18 is zero, and if one character is missing or missing, the similarity is 1 or 2. Similarity is 1 or 2 for character strings of the same length but in the same order with 1 character difference
In this way, it is not the same as the first embodiment in terms of classification ability, but when it comes to matching character strings and similar character strings, the same function can be realized by a simpler comparison means.
なお、こうした文字列比較においては先にのべた様に類
似文字列を引き当てようとすると多少の必要のないデー
タ(データノイズ)をくわえ込んで来ることがわかって
はいるが、−文字ちがっただけで、また順序がちがった
だけで意味が大きくちがってしまう文字列が存在するこ
ともたしかであり、こうした類似群は検索の様な目的に
はある程度は許されるにしても、あまり明らさまkもの
は、簡単にとり除けるものならば、あらかじめ除外判定
しておいた方が望ましい。It should be noted that in such character string comparisons, it is known that if you try to match similar character strings as mentioned above, some unnecessary data (data noise) will be added. It is also true that there are character strings whose meanings differ greatly just by changing the order, and although these similar groups are allowed to some extent for purposes such as searches, they are not very obvious. If something can be easily removed, it is better to exclude it in advance.
そしてこうした[葱味が大きく異るケース]を引き起こ
す典形的な文字列は1つは数字であり、位取り表現して
あれは順序のずれも許されない。One of the typical character strings that causes this [case where the onion flavor differs greatly] is a number, and when expressed in terms of scale, no deviation in the order is allowed.
また他の典形例は否定詞が介在する例えば「アンティ」
、「ファンティ」、「イッチ」、[ツイツチj。Another typical example is ``aunty'' with a negative word.
, "Fanti", "Itchi", [Twitchj.
「トウキ」、「ヒトウキ」と云った類いでおる。They are called ``Touki'' and ``Hitouki''.
そして、こうしたデータノイズをひろいやすい文字列の
中からいくはくかの分でも不要を取り除くことが比較的
簡単な判定手段により可能ならば、その方法は実用的に
有効な手段と云える。If it is possible to remove even some of the unnecessary data noise from character strings that are susceptible to data noise using a relatively simple determination means, then that method can be said to be a practically effective means.
第3図は上記の意味で第2図の実施例に多少の付加属性
判定を加えた第3の実施例の説明図であるO
なを第3図の実施例の構成は基本的には第2図の構成と
類似であり、さらに100番カラ/ドアツブた20査代
の番号で対応表示するが文字列比較器21が比較する左
側に先頭をならべてシフトレジスタ25.26につめ込
まれた文字列中のレジスタの先頭に存在するものについ
て、夫々字種検出器29.30をもうけ夫々の字種判定
出力を、jkとした点と判定回路27の判定論理が多少
異なっていることが第2図の構成と異るところである。FIG. 3 is an explanatory diagram of a third embodiment in which some additional attribute determination is added to the embodiment of FIG. 2 in the above sense. The configuration is similar to the one in Figure 2, and the numbers 100 and 20 are displayed in correspondence, but the characters are packed into shift registers 25 and 26 with the beginnings lined up on the left side compared by the character string comparator 21. The first difference is that the character type detectors 29 and 30 are provided for the characters at the beginning of the registers in the character string, and the respective character type determination outputs are set to jk, and the determination logic of the determination circuit 27 is slightly different. This is different from the configuration in Figure 2.
なを第3図の動作真理値表を表3として示す。The operation truth table of FIG. 3 is shown in Table 3.
表3
なを上記表においての表記法は基本的には表1表2と共
通であるが字種判定器29の出力jと30の出力にはこ
こでは数字を検出した’E1合を〔1〕それ以外の34
1合を■〕として説明する。Table 3 The notation in the above table is basically the same as Table 1 and Table 2, but the outputs of the character type determiner 29 and 30 are expressed as ]Other 34
1 go will be explained as ■].
以上の設定のもとに第3図の場合の比較について説明す
ると比較器21の出力aが〔1〕のときには出力d&L
かかわりなく一致であるから第2図の場合と同じで一致
として/7トレジスタを夫々1づつ送り次の比較を行う
。なおこのときには字種判別の結果J、にも関係がない
。To explain the comparison in the case of Fig. 3 based on the above settings, when the output a of the comparator 21 is [1], the output d&L
Regardless, it is a match, so as in the case of FIG. 2, one /7 register is sent as a match and the next comparison is made. Note that in this case, the result J of character type discrimination is also irrelevant.
また出力aが(9)であっても出力dが0〕の場合はシ
フトレジスタ25の先頭文字とシフトレジスタ26の次
の文字が一致しているのであるから信号K(シフトレジ
スタ26の先頭文字の字種)が■〕(数字でない)場合
はe=1.f=2とじてシフトレジスタ26の側を一つ
余分に送って比較器21でシフトレジスタ25の2番目
の文字とシフトレジスタ26の3番目の文字を比較する
とともに字種判定するとともに類似度カウンタ28に1
を加える。Furthermore, even if the output a is (9), if the output d is 0], the first character of the shift register 25 and the next character of the shift register 26 match, so the signal K (the first character of the shift register 26 If the character type) is ■] (not a number), e=1. When f=2, one extra signal is sent from the shift register 26, and the comparator 21 compares the second character of the shift register 25 with the third character of the shift register 26, determines the character type, and uses a similarity counter. 1 in 28
Add.
なお信号aが(9)であって信号JとKがともに〔1〕
(数字)の場合は数字と云う字種同志の間で不一致があ
ったので他の状態にかかわ抄なく不一致信号りを類似度
カウンタに送出して、以後の比較をうちきり不一致評価
が確定したとしてその文字列間の比較は終了する。Note that signal a is (9) and signals J and K are both [1]
In the case of (number), since there was a discrepancy between the character types called numbers, a discrepancy signal was sent to the similarity counter regardless of other conditions, and subsequent comparisons were eliminated and the discrepancy evaluation was confirmed. The comparison between the strings ends.
なお信号aが■〕で信号Kが〔1〕でかつ信号Jが(9
)の場合には、シフトレジスタ25の2番目とシフトレ
ジスタ26の先!目に位置する文字が一致する可能性が
あるのでe=1.f=oすなけちシフトレジスタ25の
側だけ一文字分送って次の比較を行うとともに類似度カ
ウンタ28に1を加え、かつ、1ビツトメモリMをα〕
にセットする。Note that signal a is [■], signal K is [1], and signal J is (9
), the second shift register 25 and the first shift register 26! Since there is a possibility that the characters located at the eyes match, e=1. f = o Send one character only to the stingy shift register 25 side, perform the next comparison, add 1 to the similarity counter 28, and set 1 bit memory M to α]
Set to .
なお、次の比較で引き続いて信号aが■〕の場合は不一
致が二つつづいたことにより不一致評価確定として以後
の比較を打ち切るのは第2図の例と同じだし、次の比較
信号aが(1)ならは1ビツトメモリMのしわがリセッ
トされる点も第2図と同じである。Furthermore, if the signal a continues to be ■ in the next comparison, the mismatch evaluation is determined to be confirmed due to two consecutive mismatches, and the subsequent comparisons are discontinued, as in the example in Figure 2, and the next comparison signal a is In the case of (1), the wrinkles in the 1-bit memory M are reset, as in FIG.
なを信号a r d r j+ kともに■〕の場合(
数字でなくて一致しない)の場合も一文字ちがいの可能
性を考え同様e=1f=0すなわちシフトレジスタ25
を一つだけ送って次の比較を行い、1ビツトメモリMに
〔1〕をセットし、類似度カウンタ28に1を加える手
続きを行う。When the signal a r d r j + k are both ■] (
(It is not a number and does not match), consider the possibility that there is a difference in one character and use e = 1f = 0, that is, shift register 25.
A procedure is performed in which only one is sent, the next comparison is made, the 1-bit memory M is set to [1], and the similarity counter 28 is incremented by 1.
また、信号a、dがともに■〕で信号Jが〔1〕(数字
)の場合は数字データとしての文字同志が不一致である
ことは明らかなのでこの場合はやはり不一致信号りを類
似度カウンタ28に送り不一致判定決定とする。In addition, if the signals a and d are both [■] and the signal J is [1] (number), it is clear that the characters as numerical data do not match, so in this case, the mismatch signal is also sent to the similarity counter 28. It is determined that the feed does not match.
以上まとめると信号aがα〕以外のときは類似度カウン
タ28はいつも信号g又はhにより加算値を受け、決定
の場合の不一致信号りはそれ一つで類似度カウンタの累
積値によって類似度を評価する累積値の評価レベルより
大きい値に設定しておけは、一致文字列は類似度ゼロ準
一致文字列は類似度が評価レベル値以下のものとして判
別出来るとともに比較基準カテゴリーに字a要素を加え
比較する文字の字種によって重みを変えることにより、
あまりにも明らさまなデータノイズを防止することが可
能となる。To summarize the above, when the signal a is other than α], the similarity counter 28 always receives the addition value from the signal g or h, and in the case of a decision, the inconsistency signal alone calculates the similarity by the cumulative value of the similarity counter. If you set the value to be larger than the evaluation level of the cumulative value to be evaluated, matching character strings will have zero similarity, semi-matching character strings can be determined as those whose similarity is less than the evaluation level value, and the letter a element will be added to the comparison standard category. In addition, by changing the weight depending on the type of character to be compared,
It becomes possible to prevent data noise that is too obvious.
なを第3図の実施例の文字種判別カテゴリーとして数字
の場合と数字でない場合を夫々〔1〕と■〕に割り振っ
て区別したが一字で反対語をつくる否定詞等を数字と同
じなかまに入れてもさしつかえないことは明らかである
。In the example of Figure 3, ````'' is used as a character type discrimination category, and cases of numbers and cases of non-numbers are assigned to [1] and ■], respectively, to distinguish them, but negative words, etc., which create opposite words with one character, are classified into the same group as numbers. It is clear that there is no harm in including it.
く効果〉
以上説明した様に本発明によれは文法上の関係等のふく
ざつな評価を行わない一文字一文字、一致、不一致を比
較する構成の文字列比較装置において、ごくかんたんな
カテゴリ構成にもかかわらずあまりデータノイズを混入
させることなく表記きするに伴って選択したカナ文字の
ちがい等に対応して出て来る類似の言葉文字列を比較的
フレキシブルに抽出することが可能な装置を提供出来る
ものである。Effects> As explained above, according to the present invention, in a character string comparison device configured to compare matches and mismatches character by character, without making extensive evaluations such as grammatical relationships, it is possible to achieve To provide a device that can relatively flexibly extract similar word character strings that appear in response to differences in kana characters selected as they are written without introducing too much data noise. It is.
第1図は本発明の一実施例の説明図、第2図は本発明の
別の実施例の説明図、第3図は本発明の15.25と6
.16.26は夫々上下のシフトレジスタ、7,17.
27は判定回路、8,18.28は類似度カウンタ、2
9と30は字種検出器、Mは1ビツトメモリ、a +
b 、 C、d + e r f + g + b +
J + kは夫々信号出力に伺加した(g号符号であ
る−==ニー)
に
第 f の
/6
犀 ? 廚Fig. 1 is an explanatory diagram of one embodiment of the present invention, Fig. 2 is an explanatory diagram of another embodiment of the present invention, and Fig. 3 is an explanatory diagram of 15.25 and 6 of the present invention.
.. 16.26 are upper and lower shift registers, 7, 17.
27 is a judgment circuit, 8, 18.28 is a similarity counter, 2
9 and 30 are character type detectors, M is 1-bit memory, a +
b, C, d + e r f + g + b +
J + k respectively added to the signal output (the g code is −==knee) and the fth /6 rhinoceros?廚
Claims (1)
第1と第2のシフトレジスタと、上記、第1、第2のシ
フトレジスタに収容された文字を同順の文字同志で比較
する比較器と、1つ)匪序のずれた同志で比較する比較
器と、これら比較結果にもとすき文字毎の一致、不一致
、類似を判定する判定回路と、該判定結果の類似判定数
を累算する類似度カウンタと、シフトレジスタのシフト
手段をイ’j’L、順次シフトさせつつ文字同志を比較
して文字列を比較した結果、同順かつ同文字列長一致の
ものを一致文字列と判定するとともに、累算された力1
似度が所定1+?を以下のものを準一致文字列と判定す
ることを特徴とする文字列比較装置。 2)上記比較器に加えて上記同順の文字の字種を検出す
る一対の字種検出器を鳴し上記同準の比較器の比較結果
が不一致であって、不一致の文字が特定字種の場合、上
記類似度判定数よりも大きな不一致数を上記類似度カウ
ンタに登算することを特徴とする特許請求の範囲第1項
記載の文字列比較装置。 3)上記判定された一致文字列と準一致文字列を少くと
も区別可能な形で記憶することを特徴とする特許請求の
範囲第1項又は第2項記載の文字列比較装置。[Scope of Claims] 1) First and second shift registers that store a comparison string and compared character string character by character, and characters stored in the first and second shift registers in the same order. (1) A comparator that compares characters that are out of order, and a judgment circuit that judges whether each character matches, mismatches, or resembles each other. As a result of comparing the character strings by comparing the characters while sequentially shifting the similarity counter that accumulates the number of similarity judgments of the result and the shift means of the shift register, the result is that the character strings are in the same order and the same length. In addition to determining the matching string as a matching string, the accumulated force 1
Is the similarity a predetermined 1+? A character string comparison device that determines the following as quasi-matching character strings. 2) In addition to the above comparator, a pair of character type detectors that detect the character types of the characters in the same order as above are activated, and if the comparison results of the above comparable comparators do not match, the unmatched characters are of a specific character type. In the case of , the character string comparison device according to claim 1, wherein a number of discrepancies larger than the number of similarity determinations is registered in the similarity counter. 3) The character string comparison device according to claim 1 or 2, wherein the determined matching character string and semi-matching character string are stored in at least a distinguishable form.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP58136502A JPS6027938A (en) | 1983-07-26 | 1983-07-26 | Character string comparing device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP58136502A JPS6027938A (en) | 1983-07-26 | 1983-07-26 | Character string comparing device |
Publications (2)
Publication Number | Publication Date |
---|---|
JPS6027938A true JPS6027938A (en) | 1985-02-13 |
JPH0335702B2 JPH0335702B2 (en) | 1991-05-29 |
Family
ID=15176657
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
JP58136502A Granted JPS6027938A (en) | 1983-07-26 | 1983-07-26 | Character string comparing device |
Country Status (1)
Country | Link |
---|---|
JP (1) | JPS6027938A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0268663A (en) * | 1988-09-05 | 1990-03-08 | Ricoh Co Ltd | Character string retrieving device |
JPH02108157A (en) * | 1988-10-17 | 1990-04-20 | Sanyo Electric Co Ltd | Information retrieving method |
JPH04364577A (en) * | 1991-06-11 | 1992-12-16 | Oki Electric Ind Co Ltd | Specific data pattern detection system |
JPH10105574A (en) * | 1996-09-27 | 1998-04-24 | Hitachi Software Eng Co Ltd | Arrangement data similarity degree arithmetic unit |
JP2008102641A (en) * | 2006-10-18 | 2008-05-01 | Ns Solutions Corp | Retrieving device, retrieving method, and program |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS54104747A (en) * | 1978-02-03 | 1979-08-17 | Canon Inc | Small sized electronic unit |
JPS56101248A (en) * | 1979-12-28 | 1981-08-13 | Ibm | Word selection method |
JPS57113176A (en) * | 1980-12-19 | 1982-07-14 | Ibm | Conversation type data searching device |
-
1983
- 1983-07-26 JP JP58136502A patent/JPS6027938A/en active Granted
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS54104747A (en) * | 1978-02-03 | 1979-08-17 | Canon Inc | Small sized electronic unit |
JPS56101248A (en) * | 1979-12-28 | 1981-08-13 | Ibm | Word selection method |
JPS57113176A (en) * | 1980-12-19 | 1982-07-14 | Ibm | Conversation type data searching device |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0268663A (en) * | 1988-09-05 | 1990-03-08 | Ricoh Co Ltd | Character string retrieving device |
JPH02108157A (en) * | 1988-10-17 | 1990-04-20 | Sanyo Electric Co Ltd | Information retrieving method |
JPH04364577A (en) * | 1991-06-11 | 1992-12-16 | Oki Electric Ind Co Ltd | Specific data pattern detection system |
JPH10105574A (en) * | 1996-09-27 | 1998-04-24 | Hitachi Software Eng Co Ltd | Arrangement data similarity degree arithmetic unit |
JP2008102641A (en) * | 2006-10-18 | 2008-05-01 | Ns Solutions Corp | Retrieving device, retrieving method, and program |
Also Published As
Publication number | Publication date |
---|---|
JPH0335702B2 (en) | 1991-05-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US4241402A (en) | Finite state automaton with multiple state types | |
US5357431A (en) | Character string retrieval system using index and unit for making the index | |
US5655129A (en) | Character-string retrieval system and method | |
JPS63311530A (en) | Method and device for retrieving character string | |
JPH07182465A (en) | Character recognition method | |
KR950003984A (en) | How to use associative memory and associative memory | |
GB1280487A (en) | Multilevel compressed index searching | |
JPS57137976A (en) | Zip code discriminating device | |
JPS6027938A (en) | Character string comparing device | |
Mittendorf et al. | Applying probabilistic term weighting to OCR text in the case of a large alphabetic library catalogue | |
JPS6033665A (en) | Automatic extracting system of keyword | |
US4332014A (en) | Data retrieval system | |
KR930000593B1 (en) | Voice information service system and method utilizing approximate matching | |
EP0148008B1 (en) | Word spelling correlatively-storing method and its circuit | |
JPS63198123A (en) | Character string collating method | |
JPH03127176A (en) | Keyword extractor | |
JP2773657B2 (en) | String search device | |
JPH08249427A (en) | Method and device for character recognition | |
JP2918380B2 (en) | Post-processing method of character recognition result | |
JP3127869B2 (en) | Similar data extraction system and method | |
JPH04279973A (en) | Character string comparison system | |
JPH06161995A (en) | Method and device for shaping name data | |
JPH0462630A (en) | Character code converting device | |
JP2839515B2 (en) | Character reading system | |
JPS60138688A (en) | Character recognizing method |