JPS6027938A

JPS6027938A - Character string comparing device

Info

Publication number: JPS6027938A
Application number: JP58136502A
Authority: JP
Inventors: Kiyoshi Oi; 大井　清; Toshihiro Uchi; 内　利広
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1983-07-26
Filing date: 1983-07-26
Publication date: 1985-02-13
Also published as: JPH0335702B2

Abstract

PURPOSE:To prevent an omission of data to be retrieved by applying categories of similar, coincident, and anticoincident character strings while comparing a character string with a reference character string, character by character, by a character comparing device. CONSTITUTION:The reference character string and character string to be compared are stored in shift registers 5 and 6, and compared with each other, character by character, by comparators 1-4. A similarity counter 8 decides on complete coincidence when the counted value is 0 or semicoincidence when 1 when all objective characters are compared or when the cumulative value exceeds a preset value. For example, an evaluation bit 0 is added to a completely coincident character string and an evaluation bit 1 is added to a semicoincident string; and the character string of the register on the side where the compared character string is stored is stored in a memory for storing a character string after retrieval.

Description

【発明の詳細な説明】く分野〉本発明は文字列比較装置に係り特にキーワードを付加し
て整理された文献等をキーワードを媒介として検索しよ
うとするシステムにおいて有効な文字列と文字列を比較
する手法であって類似した文字列を取り出すことの出来
る方式に関する。[Detailed Description of the Invention] Field of the Invention The present invention relates to a character string comparison device, and in particular, to a device for comparing character strings, which is effective in a system that attempts to search documents organized by adding keywords using keywords as a medium. This invention relates to a method for extracting similar character strings.

〈従来技術と背景〉意味判定と云った高級な判断手法を用いるのではなく一
致判定手法を用いてすでに記憶装置にファイルされてい
るデータの中から検索文字列を指定レジスタに入力し、
順次ファイルされているデ−タから引き出される文字列
と比較して一致するものを引き出す検索システムは、例
えば文献検索とか、カナ漢字変換における変換辞書のア
クセス等でしばしば用いられている。<Prior art and background> Rather than using a high-level judgment method such as meaning judgment, a match judgment method is used to input a search string from data already filed in a storage device into a specified register,
A search system that compares character strings extracted from sequentially filed data and extracts matching strings is often used, for example, in literature searches, access to conversion dictionaries in kana-kanji conversion, and the like.

しかしこうしたシステムの場合多少の不要のもマン、マ
シン間のインターフニースト云う面テ見るなら、こうし
たシステムでは不要のものが引き当てられても、人間が
判断して容易に括ることが出来るので不要のものが混入
するデメリットよりも類似のものが引き当てられないと
云うデメリットの方がＭ４）：であると云う考え方であ
る。However, in such a system, if we look at the interface between man and machine that is somewhat unnecessary, even if unnecessary things are allocated in such a system, humans can easily decide and exclude unnecessary things. The idea is that the disadvantage of not being able to allocate similar items is greater than the disadvantage of having something mixed in.

従来上記の様な考え方にもとすき類似文字列を検出する
手法としては■前方一致による比較、■後方一致による
比較、また０中間一致による比較等比較基準側の文字列
全体の中で必ず一致せねばならない部分を指定して比較
することにより検出する方法や、類似群と云う考え方に
もとすき、あらかじめ表現の変化を辞書形式で記憶して
おき判定基準としての文字列による一致検索を行う場合
に類似群についての一致検索をも併せて行う方法である
が前者は類似文字列としてどんなものが入って来るかあ
る程度あらかじめ考ｌして一致部分の指定を行わないと
多量の不要データが出て来ること、後者は検索したい文
字列に対応する考えられる類似群としての表現のちがい
、同義飴、活用、音便表現等をあらかじめ記憶させたデ
ータソースとは別の副列定基準文字列引き当て用の辞書
を用意せねばならないが、この辞書のｂｊｋが装置とし
てはあまりばかに出来ない程度に大きくなってしのまりと云う欠点と、検索目的Ｉため副列定基準文字列中
から一つ一つ引き当てて基準文字列としながら比較して
ゆかねばならないと云う欠点があったＯ本発明はこれらにかんがみこうし、た検−ｒ：作業を比
較的かんたんなやり方で実質的には引尚てるべきデータ
の脱落を防止する方法を提供しようとするものであり検
索の場合ワードにおける表記のちがいは、カナ表現にお
いては多くは■−文字省略かえが大部分である実情を考
え基準となる文字列に対して比較する文字列を一文字ず
つ比較しつつ、これらのカテゴリーの内に含まれるもの
は準一致列長で２文字以上の異同と同一長文−字列中２
文字以上不一致と連続する文字の転置が２対以上あるも
のを不一致と判定すると云うカテゴリーを適用する文字
列比較装置を提供することにより、もちろんけんみつな
怠味でのデーク維音はある程度式って来る〃ｉかなり効
率的に有効なデータをもれなく引き当てることが可能と
なる一方実現手段として見た上記カテゴリーにもとず〈
判定を実現する手段は前記従来方法に比して、きわめて
簡単かっ女佃Ｊに実現出来る。Conventionally, methods for detecting similar character strings that are similar to the above idea include ■Comparison by forward match, ■Comparison by backward match, and comparison by 0-middle match, etc., which always match within the entire string on the comparison standard side. In addition to the method of detection by specifying and comparing the parts that need to be changed, and the idea of similar groups, changes in expressions are memorized in a dictionary format in advance and a match search is performed using character strings as criteria. In this case, a match search for similar groups is also performed, but in the former method, a large amount of unnecessary data will be generated unless you consider in advance what kind of similar character strings will be included and specify the matching parts. The latter is a substring specification standard character string assignment that is different from the data source that stores in advance the differences in expressions as possible similar groups corresponding to the character string you want to search, synonyms, conjugations, phonetic expressions, etc. However, the disadvantage of this dictionary is that the bjk of this dictionary is too large to be used as a device, and for the search purpose I, it is necessary to prepare a dictionary for substrings. In view of these problems, the present invention has the disadvantage of having to select one character string and use it as a standard character string for comparison. This is intended to provide a method to prevent the omission of data that should be used for searching, and in the case of searches, the differences in word notation are based on the fact that in kana expressions, most of the characters are omitted and replaced. While comparing the strings to be compared character by character against the columns, those included in these categories are semi-matching string lengths with differences of 2 or more characters and characters with the same length - 2 characters in the strings.
Of course, by providing a character string comparison device that applies the category of mismatching two or more characters and determining that there are two or more pairs of consecutive transposed characters as mismatching, it is possible to solve the problem of exhaustive laziness to some extent. Based on the above categories seen as a means of achieving this, it is possible to obtain all valid data fairly efficiently.
The means for realizing the determination is extremely simple and simple compared to the conventional method described above.

したがって本発明の目的は上記カテゴリーで文字列の一
致、亭一致を１４」断する実用的な文字列比較装置を提
供することにあり、本発明の特徴は上記目的を実現する
ため文字単位で比較文字列をと被比較文字列をストアす
る第１と第２のシフトレジスタと、上記、第１．第２の
シフトレジスタに収容された文字を同順の文字同志で比
較する比較器と、１つ順序のずれた同志で比較する比較
器と、これら比較結果にもとすき文字毎の一致、不一致
、類似を判定する判定回路と、該判定結果の類似判定数
を累算する類似度カウンタと、シフトレジスフ（７）シ
フト手段を有し、順次シフトさせつつ文字同志を比較し
て文字列を比較した結果、同順かつ同文字列長一致のも
のを一致文字列と判定するとともに累算された類似度が
所定値以下のものを準一致文字列と判定すること。Therefore, the purpose of the present invention is to provide a practical character string comparison device that can determine whether or not character strings match in the above categories. first and second shift registers for storing character strings and compared character strings; A comparator that compares the characters stored in the second shift register with characters in the same order, a comparator that compares characters that are out of order, and the results of these comparisons also include matches and mismatches for each character. , a determination circuit for determining similarity, a similarity counter for accumulating the number of similarity determinations of the determination results, and a shift register (7), and a character string is compared by comparing characters while sequentially shifting them. As a result, those in the same order and the same length are determined to be matching character strings, and those whose accumulated similarity is less than or equal to a predetermined value are determined to be semi-matching character strings.

まだ上記比較器に加えて上記同順の文字の字種を検出す
る一対の字種検出器を有し７上記同準の比な不一致数を
上記類似度カウンタに登算すること、さらには上記判定
された一致文字列と準一致文字列を、少くとも区別可能
な形で記憶することであるＯ〈実施例〉第１図は本発明の一実施例の説明図であり、図中１．２
，３．４は夫々比較器、５と６は夫々、第１および第２
のシフトレジスタでありシフトレジスタ中の区切りは文
字分のコードを収容するブロックを意味し比較基準側の
比較文字列と比較される側の被比較文字列は図で云う左
側を先頭側として一文字ずつ夫々のシフトレジスタに収
容されているものとし、夫々の文字列は図示しない制御
回路よりの夫々のシフトパルスにより一文字ずつ左方向
にシフト可能なものとする。In addition to the comparator, there is also a pair of character type detectors for detecting the character types of the characters in the same order. The determined matching character strings and quasi-matching character strings are stored in a form that is at least distinguishable. 2
, 3.4 are comparators, and 5 and 6 are first and second comparators, respectively.
This is a shift register, and the delimiter in the shift register means a block that stores the code for characters.The comparison string on the comparison standard side and the compared string on the side to be compared are written one character at a time, starting from the left side in the figure. It is assumed that each character string is stored in each shift register, and each character string can be shifted leftward one character at a time by respective shift pulses from a control circuit (not shown).

棟た７は判定回路で、前記比較器１〜４が比較した一致
、不一致の判定結果にもとすき評価判定を行うものであ
り、８は判定回路７によるｎ・［価判足の結果率一致評
洒を計数し一文字列比較中に累算される不一致評価が０
個のときは完全一致、１個の簡は準一致、２個以上の時
は不一致と判定するため不一致を言１数する類似度カウ
ンタである〇こうした構成において１ず比較すべき二つ
の文字列のコードが夫々のシフトレジスタに夫々先頭（
図で云う左側）を一致させて収容されて伝るものとすれ
ば、比較器１は先頭に収容された文字同志比較器２は２
番目に収容された文字同志、比較器３はシフトレジスタ
５の先頭に収容された文字と、シフトレジスタ６の２査
目に収容された文字同志、比較器４はシフトレジスタ５
の２番目に収容された文字と、シフトレジスタ６の先頭
に収容された文字同志を比較し夫々一致ならば例えば０
〕不一致ならば■〕を出力するものとし、その出力を夫
々ａ　ｌ　ｂ　ｌ　ｅ　ｌ　ｄで示す。また判定回路７
が文字列中の次の文字同志を比較するためシフトレジス
タに収容している文字列を、文字単位で左方向にシフト
させるシフトパルスを夫々ｅ、ｆ、とじ類似度に関する
一比較毎の評価判定結果の出力をｇとする１以上の定義
をもとに判定回路７の動作を真理値表で示すと下表の如
くである。Numeral 7 is a judgment circuit, which performs a suki evaluation judgment on the match/mismatch judgment results compared by the comparators 1 to 4; Count the matching ratings and the cumulative discrepancy rating during one character string comparison is 0.
It is a similarity counter that counts 1 as an exact match when there are 1, a semi-match when 1 is a match, and a mismatch when 2 or more. The code for each shift register is placed at the beginning (
If the letters on the left side in the figure are matched and transmitted, then comparator 1 is accommodated at the beginning, and character-to-character comparator 2 is 2.
Comparator 3 compares the characters stored at the beginning of shift register 5 with the characters stored in the second row of shift register 6, comparator 4 compares the characters stored in shift register 5 with the characters stored in the second row of shift register 6.
The second character stored in the shift register 6 is compared with the first character stored in the shift register 6, and if they match, the result is 0, for example.
] If there is a mismatch, then ■] is output, and the outputs are indicated by a l b l e l d, respectively. Also, the judgment circuit 7
In order to compare the next characters in the string, the character string stored in the shift register is shifted to the left in character units using shift pulses e and f respectively.Evaluation judgment for each comparison regarding the degree of binding similarity. The operation of the determination circuit 7 is shown in a truth table based on one or more definitions where the resultant output is g.

表１つまり、ａ、ｂとも一致（α〕）の場合は２文字につい
て同文字同順の完全一致がとれたことは明らかなので上
下のシフトレジスタ５，６の文字列をｅ、ｆに示す様夫
々二つづつ送って次の順の文字列同志智・比較する。Table 1 In other words, if both a and b match (α), it is clear that there is a complete match of the same characters and the same order for the two characters, so the character strings in the upper and lower shift registers 5 and 6 are as shown in e and f. Send two of each and compare the strings in the following order.

またａ　ｒ　ｂ　ＨＣ＋　ｄ　＋とも完全不一致〔オー
ル（９）〕の」ハ合は類似度のカテゴリー外なのでその
時定で文字列不一致として判定２４＋を上げ判断決定と
してその彼はその被比較文字列に関する比較は打ち切る
。Also, the complete mismatch [all (9)] with a r b HC + d + is outside the category of similarity, so at that time, the character strings are mismatched and the judgment 24+ is raised. The comparison is discontinued.

なを、ａかｂかいづれか一つが（１）である場合と、ａ
、ｂともに■〕であるがｃ、ｄのいづれか一つがα〕で
ある場合とについてはレジスタ上の比較している文字が
一文字だけ前後しているものとしてＣが（１）ならｅを
１．　ｄが〔１〕ならばｆを１にして夫々シフトレジス
タのどちらかを一文字分送るとともに類似度カウンタ８
に出力する信号出力ｇとして０〕を送り、上記シフト後
比較した比較出力として、ａ、ｂとも〔１〕が得られれ
ばまだ２文字分ずつシフトさせて次の文字間の比較を行
う手順を文字列の終りまで行う。What, if either a or b is (1), and a
, b are both ■], but one of c and d is α], assuming that the characters being compared on the register are one character ahead and behind the other, and if C is (1), then e is 1. If d is [1], set f to 1 and send one character of each shift register, and the similarity counter 8
0] is sent as the signal output g to be output, and if [1] is obtained for both a and b as the comparison output after the above shift, then shift by two characters and perform the next comparison between characters. Execute until the end of the string.

なお、ａ、ｂのいづれか一方が〔υでｃ、ｄともにΦ〕
の場合は一文字異同と判断し類似度カウンタ８に１を加
えるとともにｅ、ｆとも１としてシフトレジスタ５と６
の文字を夫々−文字づつ送って次の比較を行う。In addition, either a or b is [υ and c and d are both Φ]
In the case of , it is determined that one character is different and the same, and 1 is added to the similarity counter 8, and both e and f are set to 1, and the shift registers 5 and 6 are
The next comparison is performed by sending the characters of -character by character.

またａ、ｂともに■〕であってかつｃ、ｄともに〔１〕
の場合は連続する１対分の文字順の入れかわりと判断し
類似度カウンタ８に１を加えるとともにｅ。Also, both a and b are ■], and both c and d are [1]
In the case of , it is determined that the order of characters for one consecutive pair has been changed, and 1 is added to the similarity counter 8, and e is executed.

ｆとも１として夫々のシフトレジスフ５，６　ｉｃヲ夫
々−文字分前進させて次の比較を行う。The next comparison is performed by setting both f to 1 and moving the shift registers 5 and 6 ic forward by one character.

以上の文字同志の比較手順を文字列同志がおわりになる
か類似度カラ／りが累積しているカウント値があらかじ
め設定している数をこえるまで（本実施例では累積値が
２以上で不一致）行い一文字列についての比較器＋１１
６が終了（完了又は打ち切り）しだ時定で類似度カウン
タ８のカウント値が０の場合は完全一致、１の場合は準
一致と判定し、例えば完全一致のものはその文字列に対
して評価ビット■〕を伺加して、準一致のものは評価ビ
ットα〕を付加して被比較文字列を収容していた側のレ
ジスタの文字列を倹累引凸てずみの文字列を記憶するメ
モリ（図示省略）に収容するとともに次の被比較文字列
をシフトレジスタに収容して次の比較を行う。なを、類
似度カウンタの杓ち切り限界を２としたが使用目的によ
っては３以上に取り、準一致の中に類似度１の一致、類
似度２の一致等の区別なイｔ」け、同様に識別ビットを
伺けておけは、可あとで判定基準を追跡７＄能とすることも出来る。The above procedure for comparing characters is repeated until the character strings end or the accumulated count value of the similarity value exceeds a preset number (in this example, if the cumulative value is 2 or more, there is a mismatch). ) Comparator for one character string +11
6 is completed (completed or discontinued) If the count value of similarity counter 8 is 0, it is determined to be a complete match, and if it is 1, it is determined to be a semi-match.For example, a complete match is determined as Add the evaluation bit ■], and if there is a near match, add the evaluation bit α] to the string in the register that contained the compared string, and store the convex string. At the same time, the next character string to be compared is stored in a shift register for the next comparison. Although we set the cut-off limit of the similarity counter to 2, depending on the purpose of use, it can be set to 3 or more, and there is no distinction between semi-matches such as matches with similarity 1 and matches with similarity 2. Similarly, if the identification bit is known, the determination criterion can be set to 7$ as the tracking standard.

また、評価作業手順を短縮するため本例では判定が定ま
ったかどうかを類似度カウンタで判定してただちに終了
させたが、もちろん文字列のおわ抄まで比較しても良い
。Further, in order to shorten the evaluation procedure, in this example, a similarity counter is used to determine whether or not a determination has been made and the process is immediately terminated, but it is of course possible to compare even the text of the character strings.

なを第２図は本発明の別の実施例の説明図であ今り１１と工４は比較器、１５と１６は夫々シフトレジス
タ１７は判定回路、１７ｍは判定回路１７の中にもうけ
られた１ビツトメモリ、１８は類似いる比較器が省略さ
れており判定回路１７に入る比較結果入力がａ、ｄの２
個となり、判定回路の評価出力ｇ、あるいはｈを出す基
準と結果にもとすきシフトレジスター５．１６を送る制
御の基準が異るだけで基本的には第１図と同じ目的のも
のであり、符番も１０番カウントアノグしただけで第１
図に準じている。FIG. 2 is an explanatory diagram of another embodiment of the present invention, in which numerals 11 and 4 are comparators, 15 and 16 respectively, a shift register 17 is a judgment circuit, and 17m is provided in the judgment circuit 17. 1-bit memory 18 has a similar comparator omitted, and the comparison result input to the judgment circuit 17 is 2 a and d.
The purpose is basically the same as in Figure 1, except that the criteria for producing the evaluation output g or h of the judgment circuit and the control criteria for sending the shift register 5.16 to the result are different. , just by counting and annotating the number 10, it was number 1.
According to the diagram.

以下に判定回路の真理値表を表２として示す。The truth table of the determination circuit is shown below as Table 2.

なを一致を〔１〕不一致を（９）とする表示もシフト量
を指示する表記も表１の賜金に準するものとする。The notation that indicates a match is [1] and the mismatch is (9), and the notation that indicates the shift amount shall conform to the donation in Table 1.

表　２ケ表２で明らかな様に比較器１１と１亭は夫々シフトレジ
スター５と１６に図でもう左側に頭をそろえて比較文字
列と被比較文字列を収容しているのけ第１図の実施例の
場合と同様でこの状態から比較をはじめるものとする。Table 2 As is clear from Table 2, comparators 11 and 1 are located in shift registers 5 and 16, respectively, with their heads aligned on the left side in the figure and containing the comparison string and the compared string. As in the case of the embodiment shown in the figure, the comparison will start from this state.

そしてこうした状態から比較をはじめ比較器１１今の出力ａが〔１〕（一致）であれば比較器１ηの出力ｄ
が〔１〕、■〕にかかわりなく同列の文字が一致してい
るのだから判定回路１７は一致と判定しｃ、ｄともに１
すなわち両方のシフトレジスタを一つづつ送り次の文字
の比較を行う。寸だ出力信号ａが■〕でｄが〔１〕の場
合には順がずれているか、入れかわっているか、いづれ
にしても類似パターンの可能性があるのでｅ＝１　ｆ＝
２　すなわちシフトレジスタ１５に１ンフト、シフトレ
ジスター６に２シフトを行わせ次の比較を行うとともに
出力ｇを１として類似度カウンタに１を登算する。また
出力信号ａ、ｄがともに（９）の場合は類似パターンの
可能性とともに不一致の可能性もあるのでｅ　＝　ｆ　
＝　１すなわち、上下のシフトレジスタを一つづつ進め
て次の比較を行うとともに、この場合判定回路１７′中
にもうけられている１ビツトメモリＭに１をセットし類
似度カウンタ１８にも１を加算する。From this state, the comparison begins and if the current output a of the comparator 11 is [1] (match), the output d of the comparator 1η
Since the characters in the same row match regardless of [1] or ■], the judgment circuit 17 judges that there is a match, and both c and d are 1.
That is, both shift registers are sent one by one and the next character is compared. If the output signal a is [■] and d is [1], there is a possibility that the order is out of order, that they have been switched, or that they are similar patterns, so e = 1 f =
2 That is, the shift register 15 performs 1 shift and the shift register 6 performs 2 shifts, and the next comparison is performed, and the output g is set to 1, and 1 is added to the similarity counter. Furthermore, if the output signals a and d are both (9), there is a possibility of similar patterns as well as a possibility of mismatch, so e = f
= 1 In other words, the upper and lower shift registers are advanced one by one to perform the next comparison, and in this case, 1 is set in the 1-bit memory M provided in the judgment circuit 17', and 1 is also added to the similarity counter 18. do.

なを上記１ビツトメモリＭは次の比較でａ＝（１）とな
れはリセットする。If a=(1) in the next comparison, the 1-bit memory M is reset.

なお上記ａ　＝　ｄ−■〕の状態が一度現出して１ビツ
トメモＩＪ　Ｍが１にセットされた後、−文字列外の比
較中につづいて上記ａ　＝’　ｄ−（６）が現出した場
合には一文字列中に２個の不一致があったものとして１
ビツトメモリＭは０に戻すとともに出力信号りにより類
似度カウンタに対し、少くとも類似と判定するカウント
値よりも大きな値を加算することで不一致判定を行わせ
て、その文字列を比較する手＋１Ａは打ち切＃）終了さ
せる。Note that after the above state a=d-■] appears once and the 1-bit memo IJM is set to 1, the above a='d-(6) appears during the comparison outside the string. In this case, it is assumed that there are two mismatches in one character string.
The method +1A is to reset the bit memory M to 0 and add a value larger than the count value that is determined to be similar to the similarity counter based on the output signal to determine a mismatch, and then compare the character strings. Abort #) End.

なお類似度カウンタ１８がカウントする値は類許容限界
がプリセットされており、上記比較手順中に許容値をこ
えれば不一致判定を行うものとする０こうした比較手順で比較を行うと、少くとも完全一致の
場合はつねにａ＝α〕であるから類似度カウンタ１８の
値はゼロであり一文字の過、あるいは不足は類似度１又
は２．同長文字列同順中１文字ちがいは類似度１又は２
と云った具合で分類能力から云うと実施例の１と同一で
はないが、一致文字列と類似文字列について云えば同様
の機能がよりかんたんな比較手段で実現出来る。Note that the similarity tolerance limit is preset for the value counted by the similarity counter 18, and if the tolerance value is exceeded during the above comparison procedure, a mismatch is determined. In the case of a match, a=α], so the value of the similarity counter 18 is zero, and if one character is missing or missing, the similarity is 1 or 2. Similarity is 1 or 2 for character strings of the same length but in the same order with 1 character difference
In this way, it is not the same as the first embodiment in terms of classification ability, but when it comes to matching character strings and similar character strings, the same function can be realized by a simpler comparison means.

なお、こうした文字列比較においては先にのべた様に類
似文字列を引き当てようとすると多少の必要のないデー
タ（データノイズ）をくわえ込んで来ることがわかって
はいるが、−文字ちがっただけで、また順序がちがった
だけで意味が大きくちがってしまう文字列が存在するこ
ともたしかであり、こうした類似群は検索の様な目的に
はある程度は許されるにしても、あまり明らさまｋもの
は、簡単にとり除けるものならば、あらかじめ除外判定
しておいた方が望ましい。It should be noted that in such character string comparisons, it is known that if you try to match similar character strings as mentioned above, some unnecessary data (data noise) will be added. It is also true that there are character strings whose meanings differ greatly just by changing the order, and although these similar groups are allowed to some extent for purposes such as searches, they are not very obvious. If something can be easily removed, it is better to exclude it in advance.

そしてこうした［葱味が大きく異るケース］を引き起こ
す典形的な文字列は１つは数字であり、位取り表現して
あれは順序のずれも許されない。One of the typical character strings that causes this [case where the onion flavor differs greatly] is a number, and when expressed in terms of scale, no deviation in the order is allowed.

また他の典形例は否定詞が介在する例えば「アンティ」
、「ファンティ」、「イッチ」、［ツイツチｊ。Another typical example is ``aunty'' with a negative word.
, "Fanti", "Itchi", [Twitchj.

「トウキ」、「ヒトウキ」と云った類いでおる。They are called ``Touki'' and ``Hitouki''.

そして、こうしたデータノイズをひろいやすい文字列の
中からいくはくかの分でも不要を取り除くことが比較的
簡単な判定手段により可能ならば、その方法は実用的に
有効な手段と云える。If it is possible to remove even some of the unnecessary data noise from character strings that are susceptible to data noise using a relatively simple determination means, then that method can be said to be a practically effective means.

第３図は上記の意味で第２図の実施例に多少の付加属性
判定を加えた第３の実施例の説明図であるＯなを第３図の実施例の構成は基本的には第２図の構成と
類似であり、さらに１００番カラ／ドアツブた２０査代
の番号で対応表示するが文字列比較器２１が比較する左
側に先頭をならべてシフトレジスタ２５．２６につめ込
まれた文字列中のレジスタの先頭に存在するものについ
て、夫々字種検出器２９．３０をもうけ夫々の字種判定
出力を、ｊｋとした点と判定回路２７の判定論理が多少
異なっていることが第２図の構成と異るところである。FIG. 3 is an explanatory diagram of a third embodiment in which some additional attribute determination is added to the embodiment of FIG. 2 in the above sense. The configuration is similar to the one in Figure 2, and the numbers 100 and 20 are displayed in correspondence, but the characters are packed into shift registers 25 and 26 with the beginnings lined up on the left side compared by the character string comparator 21. The first difference is that the character type detectors 29 and 30 are provided for the characters at the beginning of the registers in the character string, and the respective character type determination outputs are set to jk, and the determination logic of the determination circuit 27 is slightly different. This is different from the configuration in Figure 2.

なを第３図の動作真理値表を表３として示す。The operation truth table of FIG. 3 is shown in Table 3.

表３なを上記表においての表記法は基本的には表１表２と共
通であるが字種判定器２９の出力ｊと３０の出力にはこ
こでは数字を検出した’Ｅ１合を〔１〕それ以外の３４
１合を■〕として説明する。Table 3 The notation in the above table is basically the same as Table 1 and Table 2, but the outputs of the character type determiner 29 and 30 are expressed as ]Other 34
1 go will be explained as ■].

以上の設定のもとに第３図の場合の比較について説明す
ると比較器２１の出力ａが〔１〕のときには出力ｄ＆Ｌ
かかわりなく一致であるから第２図の場合と同じで一致
として／７トレジスタを夫々１づつ送り次の比較を行う
。なおこのときには字種判別の結果Ｊ、にも関係がない
。To explain the comparison in the case of Fig. 3 based on the above settings, when the output a of the comparator 21 is [1], the output d&L
Regardless, it is a match, so as in the case of FIG. 2, one /7 register is sent as a match and the next comparison is made. Note that in this case, the result J of character type discrimination is also irrelevant.

また出力ａが（９）であっても出力ｄが０〕の場合はシ
フトレジスタ２５の先頭文字とシフトレジスタ２６の次
の文字が一致しているのであるから信号Ｋ（シフトレジ
スタ２６の先頭文字の字種）が■〕（数字でない）場合
はｅ＝１．ｆ＝２とじてシフトレジスタ２６の側を一つ
余分に送って比較器２１でシフトレジスタ２５の２番目
の文字とシフトレジスタ２６の３番目の文字を比較する
とともに字種判定するとともに類似度カウンタ２８に１
を加える。Furthermore, even if the output a is (9), if the output d is 0], the first character of the shift register 25 and the next character of the shift register 26 match, so the signal K (the first character of the shift register 26 If the character type) is ■] (not a number), e=1. When f=2, one extra signal is sent from the shift register 26, and the comparator 21 compares the second character of the shift register 25 with the third character of the shift register 26, determines the character type, and uses a similarity counter. 1 in 28
Add.

なお信号ａが（９）であって信号ＪとＫがともに〔１〕
（数字）の場合は数字と云う字種同志の間で不一致があ
ったので他の状態にかかわ抄なく不一致信号りを類似度
カウンタに送出して、以後の比較をうちきり不一致評価
が確定したとしてその文字列間の比較は終了する。Note that signal a is (9) and signals J and K are both [1]
In the case of (number), since there was a discrepancy between the character types called numbers, a discrepancy signal was sent to the similarity counter regardless of other conditions, and subsequent comparisons were eliminated and the discrepancy evaluation was confirmed. The comparison between the strings ends.

なお信号ａが■〕で信号Ｋが〔１〕でかつ信号Ｊが（９
）の場合には、シフトレジスタ２５の２番目とシフトレ
ジスタ２６の先！目に位置する文字が一致する可能性が
あるのでｅ＝１．ｆ＝ｏすなけちシフトレジスタ２５の
側だけ一文字分送って次の比較を行うとともに類似度カ
ウンタ２８に１を加え、かつ、１ビツトメモリＭをα〕
にセットする。Note that signal a is [■], signal K is [1], and signal J is (9
), the second shift register 25 and the first shift register 26! Since there is a possibility that the characters located at the eyes match, e=1. f = o Send one character only to the stingy shift register 25 side, perform the next comparison, add 1 to the similarity counter 28, and set 1 bit memory M to α]
Set to .

なお、次の比較で引き続いて信号ａが■〕の場合は不一
致が二つつづいたことにより不一致評価確定として以後
の比較を打ち切るのは第２図の例と同じだし、次の比較
信号ａが（１）ならは１ビツトメモリＭのしわがリセッ
トされる点も第２図と同じである。Furthermore, if the signal a continues to be ■ in the next comparison, the mismatch evaluation is determined to be confirmed due to two consecutive mismatches, and the subsequent comparisons are discontinued, as in the example in Figure 2, and the next comparison signal a is In the case of (1), the wrinkles in the 1-bit memory M are reset, as in FIG.

なを信号ａ　ｒ　ｄ　ｒ　ｊ＋　ｋともに■〕の場合（
数字でなくて一致しない）の場合も一文字ちがいの可能
性を考え同様ｅ＝１ｆ＝０すなわちシフトレジスタ２５
を一つだけ送って次の比較を行い、１ビツトメモリＭに
〔１〕をセットし、類似度カウンタ２８に１を加える手
続きを行う。When the signal a r d r j + k are both ■] (
(It is not a number and does not match), consider the possibility that there is a difference in one character and use e = 1f = 0, that is, shift register 25.
A procedure is performed in which only one is sent, the next comparison is made, the 1-bit memory M is set to [1], and the similarity counter 28 is incremented by 1.

また、信号ａ、ｄがともに■〕で信号Ｊが〔１〕（数字
）の場合は数字データとしての文字同志が不一致である
ことは明らかなのでこの場合はやはり不一致信号りを類
似度カウンタ２８に送り不一致判定決定とする。In addition, if the signals a and d are both [■] and the signal J is [1] (number), it is clear that the characters as numerical data do not match, so in this case, the mismatch signal is also sent to the similarity counter 28. It is determined that the feed does not match.

以上まとめると信号ａがα〕以外のときは類似度カウン
タ２８はいつも信号ｇ又はｈにより加算値を受け、決定
の場合の不一致信号りはそれ一つで類似度カウンタの累
積値によって類似度を評価する累積値の評価レベルより
大きい値に設定しておけは、一致文字列は類似度ゼロ準
一致文字列は類似度が評価レベル値以下のものとして判
別出来るとともに比較基準カテゴリーに字ａ要素を加え
比較する文字の字種によって重みを変えることにより、
あまりにも明らさまなデータノイズを防止することが可
能となる。To summarize the above, when the signal a is other than α], the similarity counter 28 always receives the addition value from the signal g or h, and in the case of a decision, the inconsistency signal alone calculates the similarity by the cumulative value of the similarity counter. If you set the value to be larger than the evaluation level of the cumulative value to be evaluated, matching character strings will have zero similarity, semi-matching character strings can be determined as those whose similarity is less than the evaluation level value, and the letter a element will be added to the comparison standard category. In addition, by changing the weight depending on the type of character to be compared,
It becomes possible to prevent data noise that is too obvious.

なを第３図の実施例の文字種判別カテゴリーとして数字
の場合と数字でない場合を夫々〔１〕と■〕に割り振っ
て区別したが一字で反対語をつくる否定詞等を数字と同
じなかまに入れてもさしつかえないことは明らかである
。In the example of Figure 3, ````'' is used as a character type discrimination category, and cases of numbers and cases of non-numbers are assigned to [1] and ■], respectively, to distinguish them, but negative words, etc., which create opposite words with one character, are classified into the same group as numbers. It is clear that there is no harm in including it.

く効果〉以上説明した様に本発明によれは文法上の関係等のふく
ざつな評価を行わない一文字一文字、一致、不一致を比
較する構成の文字列比較装置において、ごくかんたんな
カテゴリ構成にもかかわらずあまりデータノイズを混入
させることなく表記きするに伴って選択したカナ文字の
ちがい等に対応して出て来る類似の言葉文字列を比較的
フレキシブルに抽出することが可能な装置を提供出来る
ものである。Effects> As explained above, according to the present invention, in a character string comparison device configured to compare matches and mismatches character by character, without making extensive evaluations such as grammatical relationships, it is possible to achieve To provide a device that can relatively flexibly extract similar word character strings that appear in response to differences in kana characters selected as they are written without introducing too much data noise. It is.

[Brief explanation of drawings]

第１図は本発明の一実施例の説明図、第２図は本発明の
別の実施例の説明図、第３図は本発明の１５．２５と６
．１６．２６は夫々上下のシフトレジスタ、７，１７．
２７は判定回路、８，１８．２８は類似度カウンタ、２
９と３０は字種検出器、Ｍは１ビツトメモリ、ａ　＋　
ｂ　、　Ｃ、ｄ　＋　ｅ　ｒ　ｆ　＋　ｇ　＋　ｂ　＋
　Ｊ　＋　ｋは夫々信号出力に伺加した（ｇ号符号であ
る−＝＝ニー）に第　ｆ　の／６犀　？　廚Fig. 1 is an explanatory diagram of one embodiment of the present invention, Fig. 2 is an explanatory diagram of another embodiment of the present invention, and Fig. 3 is an explanatory diagram of 15.25 and 6 of the present invention.
．． 16.26 are upper and lower shift registers, 7, 17.
27 is a judgment circuit, 8, 18.28 is a similarity counter, 2
9 and 30 are character type detectors, M is 1-bit memory, a +
b, C, d + e r f + g + b +
J + k respectively added to the signal output (the g code is −==knee) and the fth /6 rhinoceros?廚

Claims

[Scope of Claims] 1) First and second shift registers that store a comparison string and compared character string character by character, and characters stored in the first and second shift registers in the same order. (1) A comparator that compares characters that are out of order, and a judgment circuit that judges whether each character matches, mismatches, or resembles each other. As a result of comparing the character strings by comparing the characters while sequentially shifting the similarity counter that accumulates the number of similarity judgments of the result and the shift means of the shift register, the result is that the character strings are in the same order and the same length. In addition to determining the matching string as a matching string, the accumulated force 1
Is the similarity a predetermined 1+? A character string comparison device that determines the following as quasi-matching character strings. 2) In addition to the above comparator, a pair of character type detectors that detect the character types of the characters in the same order as above are activated, and if the comparison results of the above comparable comparators do not match, the unmatched characters are of a specific character type. In the case of , the character string comparison device according to claim 1, wherein a number of discrepancies larger than the number of similarity determinations is registered in the similarity counter. 3) The character string comparison device according to claim 1 or 2, wherein the determined matching character string and semi-matching character string are stored in at least a distinguishable form.