JPS6289134A

JPS6289134A - Character string difference extracting method and its device

Info

Publication number: JPS6289134A
Application number: JP60228742A
Authority: JP
Inventors: Meikai Nakamura; 中村　明海
Original assignee: Nippon Steel Corp
Current assignee: Nippon Steel Corp
Priority date: 1985-10-16
Filing date: 1985-10-16
Publication date: 1987-04-23
Also published as: JPH0574858B2

Abstract

PURPOSE:To efficiently execute a document and computer program processing maintaining work by searching for the common part string from the first and second character strings, deleting successively the common part string matching with the standard and extracting the remaining dissident part as the difference. CONSTITUTION:When a part string length is the reference length or above specified beforehand, it is unnecessary to scan the the comparing pair area of respective elements of the common part string, the scanning starting position is renewed, and when the length is the reference length or below, the next scanning position is obtained by a scanning position occurring part 301. By using the common part string obtained by searching the common part string, a searching result deciding part 303 divides two gives character strings into three, decides the coincidence or the difference of the part string and accommodating it to a deciding result memory part 202. When the procedure is repeated and the divided part string is all decided, it is judged to be the completion. Finally, by a difference extracting part 305 from a difference part string which accommodates the deciding result memory part 202, the difference is outputted to a CRT printer driving part 500 by three expressions of the adding, deleting and replacing and the whole processing is completed.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は、電子計算機を用いた情報処理の分野に用いら
れる、文字列差異抽出の方法および装置に関する。本発
明による方法および装置は、情報保持媒体に記憶されて
いる日常言語で記述された文書、計算機プログラムなど
、文字列を加工、維持する作業の一部をなす類似文字列
間の比較および文字列差異抽出に関連するものである。DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to a method and apparatus for character string difference extraction used in the field of information processing using electronic computers. The method and apparatus according to the present invention are capable of comparing similar character strings and character strings, which are part of the work of processing and maintaining character strings such as documents written in everyday language and computer programs stored in information storage media. This is related to difference extraction.

[Prior art and problems to be solved by the invention]

従来行なわれている文字列比較方法は、それを構成する
文字、行、項などの加除訂正要素に、適宜組み合せをず
らしなから比較対応し、一致する要素同志を共通要素と
し、これを反復し、最終的な不一致部を差異とするもの
である。The conventional method for character string comparison is to shift the combinations of the addition/deletion correction elements such as characters, lines, and terms that make up the string as appropriate, then compare and match them, and then use the matching elements as common elements, and repeat this process. , the final discrepancy is the difference.

例えば、第１文字列「アイウニ」と第２文字列「アイエ
」を文字を要素として比較する場合、「ア」と「ア」、
「イ」と「イ」、「工」と「工」が等しいことをもって
［第２文字列は第１文字列から「つ」を除いたものであ
る。′」ことを見出す。For example, when comparing the first character string "Aiuni" and the second character string "Aie" using characters as elements, "A" and "A",
Since "i" and "i" and "tech" and "tech" are equal, [the second character string is the first character string with "tsu" removed. ’”.

これは差異を抽出する方法である。This is a method of extracting differences.

しかし従来の方法においては、次のような問題がある。However, the conventional method has the following problems.

すなわち、同一文字が再び出現しないように設計された
文字列に対しては、適切な結果を与えるが、通常に書き
下した文書、計算機プログラム等に対しては、これらの
文字列が文字単位であれ、行単位であれ同一要素が数多
く存在する構成要素から成ることがほとんどであり差異
を端的でなく不適切にとらえがちである。In other words, appropriate results are given for character strings designed so that the same character does not appear again, but for normally written documents, computer programs, etc., even if these character strings are In most cases, it consists of a large number of identical elements, even on a line-by-line basis, and the differences tend to be inappropriately understood.

例えば、第１文字列「イヌガイル」と第２文字列「ネコ
ガイル」を文字単位に比較する場合、第１文字列の１文
字目「イ」と第２文字列の４文字目「イ」を共通要素と
する結果、「第２文字列は第１文字列の前にｒネコガｊ
を加えｒヌガイ」を除いたものである。」と認識する。For example, when comparing the first character string "Inugairu" and the second character string "Nekogairu" character by character, the first character "i" of the first character string and the fourth character "i" of the second character string are the same. As a result of making it an element, ``The second string is r cat g j before the first string
``r nugai'' is added and ``nugai'' is removed. ”.

光測の端的な差異表現は、「ｒイヌ」を「ネコ」に訂正
した。」であり、前記結果は適切でない。The simple difference expression in photometry was corrected from ``r dog'' to ``cat.'' ”, and the above result is not appropriate.

このように、従来の方法は、通常の文書、計算機プログ
ラム等のように同一要素が数多く存在する文字列の比較
において適切な結果を得られないため、この文字列を加
工維持する作業を機械化することは不可能と考えられ、
人間に大きな負荷を負わせるという問題点がある。In this way, conventional methods cannot obtain appropriate results when comparing character strings that have many identical elements, such as ordinary documents, computer programs, etc. Therefore, it is necessary to mechanize the work of processing and maintaining these character strings. It is considered impossible,
The problem is that it places a heavy burden on humans.

本発明の一つの目的は、前述の問題点にかんがみ、多数
の同一要素により構成されている文字列と、これに類似
した文字列の比較において、その端的な差異表現を効率
的に機械で得るようにし、従来手作業で低効率でしか行
なえなかった文書、計算機プログラム加工維持作業を大
幅に効率化することにある。In view of the above-mentioned problems, one object of the present invention is to efficiently mechanically obtain a simple difference expression when comparing a character string composed of many identical elements and a similar character string. The goal is to greatly improve the efficiency of document and computer program processing and maintenance work, which could only be done manually and with low efficiency in the past.

また本発明の他の目的は、膨大な文字列で手作業では多
大な労力を要する比較作業を高効率で行い、迅速、正確
に文字的差異抽出を行うことにある。Another object of the present invention is to efficiently perform a comparison operation that requires a great deal of manual labor with a huge number of character strings, and to quickly and accurately extract character differences.

〔問題点を解決するための手段、および作用〕本発明に
おいては基本形態として、個々の要素から成る第１の文
字列と、該第１文字列に変更を施した第２の文字列の端
部から中央部へ向って、かつ位置の近接した要素から遠
隔な要素へと順に走査することにより共通部分列を探索
し、該得られた共通部分列の中で長さが予め定められた
基準に合致する共通部分列を該２つの文字列から順次削
除し、該共通部分列削除の結果残った不一致部分を差異
として抽出することを特徴とする文字列差異抽出方法が
提供される。[Means and effects for solving the problem] In the present invention, as a basic form, a first character string consisting of individual elements and an end of a second character string that is a modified first character string are provided. A common subsequence is searched by sequentially scanning from the part to the center and from elements close to each other to distant elements, and within the obtained common subsequence, a length is determined by a predetermined standard. Provided is a character string difference extraction method characterized by sequentially deleting common substrings that match the two character strings from the two character strings, and extracting unmatched portions remaining as a result of the common substring deletion as differences.

また本発明においては他の形態として、個々の要素から
成る第１の文字列と、該第１文字列に変更を施した第２
の文字列とを記憶する文書記憶手段、該文書記憶装置か
ら読み出された第１および第２の文字列の端部から中央
部へ向い、位置の近接した要素から遠隔の要素へと順次
走査して共通部分列の探索および該探索結果の判定を行
う共通部分列探索・判定手段、長さが予め定められた基
準に合致する共通部分列を順次削除する手段、および、
該共通部分順次削除の結果として残った不一致部分を該
第１および第２の文字列の差異として抽出する差異抽出
手段、を具備することを特徴とする文字列差異抽出装置
が提供される。In addition, in another embodiment of the present invention, a first character string consisting of individual elements and a second character string that is modified from the first character string are provided.
a document storage means for storing a character string; a document storage device that sequentially scans the first and second character strings read from the document storage device from the ends to the center, from elements close to each other to elements further away; a common subsequence search/determination means for searching for a common subsequence and determining the search results; a means for sequentially deleting common subsequences whose lengths match a predetermined criterion;
A character string difference extraction device is provided, comprising a difference extraction means for extracting a mismatched portion remaining as a result of the sequential deletion of the common portion as a difference between the first and second character strings.

本発明においては、一つの原型となる文字列に対し、そ
の原型文字列が加除訂正等の作業によって変更された場
合の変更部分、すなわち差異を端的、迅速に抽出するた
め、以下の手段が用いられる。In the present invention, the following means are used to simply and quickly extract the changed part, that is, the difference, when the original character string is changed by operations such as addition, deletion, correction, etc. It will be done.

本発明においては与えられた２つの文字列に共通に存在
する最長または十分長い部分副長の基準長さく以下、共
通部分列という）を探索し、この見出した共通部分列の
ところで、それぞれ分割する。それにより、与えられた
文字列は３組に分けられ、対応づけた共通部分列の組は
、「一致」と判定され、残された共通部分列の前後の２
つの組は一時保留される。この手順を保留した組に再帰
的に適用し、その中で最長の共通部分列をもって分割し
、最低−個以上の要素が「一致ｊ又は、不一致すなわち
「差異」の判定を受けるまで反復する。In the present invention, a standard length (hereinafter referred to as a common substring) of the longest or sufficiently long partial sublength commonly present in two given character strings is searched for, and each string is divided at the found common substring. As a result, the given character string is divided into three sets, the matched set of common substrings is judged to be a "match", and the two before and after the remaining common substring are
One set will be temporarily suspended. This procedure is applied recursively to the retained sets, dividing them based on the longest common subsequence, and repeating until at least - or more elements are judged to be a match or a mismatch, that is, a difference.

前述の過程により、−回の手順で最低−個以上の要素が
「−敗Ｊまたは、「差異」の判定を受けるから、いずれ
この手順は停止し、全ての要素が判定を終え、−敗した
部分列を削除することにより差異だけの部分列が抽出さ
れる。Through the above process, at least - elements are judged as "-defeat" or "difference" in - times, so this procedure will eventually stop and all elements will finish the judgment and -defeat. By deleting subsequences, subsequences with only differences are extracted.

それにより、最長または、十分長い部分列を先に対応づ
けることにより同一要素が数多く存在する文字列におい
て適切な差異を抽出することができる。Thereby, by first associating the longest or sufficiently long substrings, it is possible to extract appropriate differences in character strings containing many identical elements.

〔Example〕

本発明の実施例の具体的記述に先立ち、本発明の原理が
第２図、第３図を参照しつつ説明される。Prior to specific description of embodiments of the present invention, the principle of the present invention will be explained with reference to FIGS. 2 and 3.

第２図は、全体手順を説明するための説明図、第３図は
、共通部分列の探索手順を示す説明図である。FIG. 2 is an explanatory diagram for explaining the overall procedure, and FIG. 3 is an explanatory diagram showing the procedure for searching for a common subsequence.

第２図において、文字を要素とし、同じ要素により構成
された第１文字列「アイウエウエオイオイ」とそれに類
似した第２文字列［アイエラニオアオイエイ」を比較対
応させた場合の例である。In Figure 2, this is an example of comparing and matching the first character string "Aiueueoioi", which is made up of the same elements, and the second character string similar to it, "Aielanioaooiei", where letters are used as elements. .

全体手順は、この両文字列に共通に存在する共通部分列
を探索し、この中で最長の共通部分列［エラニオＪＰ３
を第１位に共通部分列として採用する。The overall procedure is to search for common substrings that exist in common in both character strings, and search for the longest common substring [Elanio JP3
is adopted as the first common subsequence.

これによって、両文字列は分割され、次の比較対象文字
列は、前方の「アイウ」対「アイ」の文字列および後方
の「イオイ」対「アオイエイ」の文字列の２組となる。As a result, both character strings are divided, and the next character strings to be compared become two sets: the front character string "aiu" vs. "ai" and the rear character string "ii" vs. "aoiei".

次に、前方の文字列について、その中で最長共通部分列
「アイＪＰＩが対応づけられるが、以下、共通部分列が
ないことによって、前方の文字列の処理は終える。Next, the longest common substring "i-JPI" is associated with the preceding character string, but since there is no common substring, the processing of the preceding character string ends.

後方の文字列についても同様の手順をほどこし、最長共
通部分列「オイＪＰ６を対応づけ、この文字列を分割す
るが、前方、後方それぞれの文字列にもはや共通部分列
が無いことから全体の処理は停止する。The same procedure is applied to the backward character string, and the longest common substring "Oi JP6" is associated and this string is divided, but since there is no longer a common substring in each of the forward and backward character strings, the entire process is stops.

この結果、第２文字列は、第１文字列から「つ」Ｐ２が
除かれ、［イＪＰ４が「アＪＰ５に置き換え、［エイＪ
Ｐ７が追加されたことを得る。As a result, in the second character string, "tsu" P2 is removed from the first character string, [A JP4 is replaced with "A JP5, [A J
Obtain that P7 has been added.

次に全体手順中の共通部分列探索の詳細を第３図を使っ
て説明する。第３図において、たて軸は、第１文字列の
並び、横軸は、第２文字列の並び、その交差するまず目
は、要素の組を表わす。Next, details of the common subsequence search in the overall procedure will be explained using FIG. In FIG. 3, the vertical axis represents the arrangement of first character strings, the horizontal axis represents the arrangement of second character strings, and the first intersection of the two represents a set of elements.

第３図中、実線の矢印は、端部から中央へ、かつ近接し
た組から遠隔な組の順となる、共通部分列を探索するた
めの開始位置だけを与える走査順を示し、鎖線の矢印は
、走査によって与えられる位置を始点として、たて軸お
よび横軸の双方を一つずつ進めた要素の比較組の並びで
、共通部分列の探索順を示す。In FIG. 3, solid line arrows indicate a scanning order that provides only the starting position for searching for common subsequences, from the edge to the center and from close to remote sets; is a list of comparison sets of elements in which both the vertical and horizontal axes are advanced one by one, starting from the position given by scanning, and indicates the search order for common subsequences.

また、○印は、探索により要素が等しい場合を示し、Ｘ
印は、同じく要素が等しくない場合を示す。この例では
十分長い共通部分列長の基準長をあらかじめ３文字以上
と設定した。In addition, the ○ mark indicates the case where the elements are equal through the search, and the
The marks also indicate cases where the elements are not equal. In this example, the reference length for a sufficiently long common subsequence length is set in advance to be 3 characters or more.

まず、端部である左上ずみの要素の組１０より走査を開
始する。次にここを始点として共通部分列の探索を開始
し、要素の組１０が等しいことから双方一つずつ進めた
要素の組１１へ進み、これも等しいことを知る。その後
は、続いた共通要素を発見できないため、この走査位置
での共通部分列の長さは、２文字であることを認識する
。First, scanning is started from the element set 10 at the upper left corner, which is the end. Next, a search for a common subsequence is started using this point as a starting point, and since element set 10 are equal, both elements are advanced one by one to element set 11, and it is found that these are also equal. After that, no subsequent common element can be found, so it is recognized that the length of the common subsequence at this scanning position is two characters.

第３図に示される例の場合、あらかじめ定めた共通部分
列の基準長を３文字以上と設定しているため、ここで発
見した共通部分列「アイＪＰＩは基準長に満たず一時保
留され、走査位置は、再び戻って端部から中央へ、かつ
近接した組から遠隔な組の順となる次の要素比較［１２
へと移る。In the case of the example shown in Figure 3, the standard length of the predetermined common subsequence is set to 3 characters or more, so the common subsequence discovered here, ``IJPI'', is temporarily put on hold because it does not meet the standard length. The scanning position is changed back again from the edges to the center and from the nearest set to the farthest set for the next element comparison [12
Move to.

次に共通要素１３を発見するが、これに続く共通部分列
もまた基準長に満たないため、対応づけを保留し、次の
要素比較組１４へと走査は移る。Next, a common element 13 is found, but since the common subsequence that follows it also does not meet the standard length, the matching is suspended and scanning moves on to the next element comparison set 14.

次に発見される共通要素１５に続く共通部分列の長さは
４文字であり十分、基準長に足りるので「エラニオＪＰ
３を共通部分列として認識する。The length of the common subsequence following the next discovered common element 15 is 4 characters, which is enough to meet the standard length.
3 is recognized as a common subsequence.

この結果、共通部分列をはさんで前後の比較も不要とな
るのでハツチング部１６の領域はもはや走査する必要は
なく、次の走査開始位置を、共通部分列「エラニオＪＰ
３の終端よりそれぞれプラス１した位置１７へ進める。As a result, comparisons before and after the common subsequence are no longer necessary, so it is no longer necessary to scan the area of the hatching part 16, and the next scanning start position is determined by the common subsequence "Elanio JP".
Proceed to position 17, which is plus 1 from the end of 3.

このように基準長を用いることにより、走査範囲を大幅
に削減することが可能となり迅速に結果を得る手段とな
っている。By using the reference length in this way, it is possible to significantly reduce the scanning range, and it becomes a means of obtaining results quickly.

以降、残された右下の領域に同様の走査を行なうが、共
通要素１Ｂを発見し、それに続（共通要素１９を発見し
たところで第１文字列が尽き、共通部分列の探索を終了
する。Thereafter, the remaining lower right area is scanned in a similar manner, but when the common element 1B is discovered and the subsequent (common element 19) is discovered, the first character string is exhausted and the search for the common substring ends.

これにより、第１文字列と第２文字列の２つの文字列に
共通して存在する、始めの共通要素を１０゜１３　、１
５　、１８とする４つの共通部分列を迅速に発見するこ
とができる。As a result, the first common element that exists in common to the two character strings, the first character string and the second character string, is 10°13, 1
Four common subsequences, 5 and 18, can be quickly found.

第３図の例において、共通部分列の最長のものから優先
して共通部分列として認識し削除することの原則を生か
しつつ、これをより一層能率的に実行するため、ある特
定の長さく基準長）以上にわたる共通部分列を最長と判
断して処置を進める手段をとるが、この場合のあらかじ
め定める特定の長さすなわち基準長の設定は以下のよう
にして定める。In the example shown in Figure 3, while taking advantage of the principle of recognizing and deleting the longest common subsequence as a common subsequence, in order to carry out this process more efficiently, a certain length standard is used. Measures are taken to proceed by determining that the common subsequence extending over the length (length) is the longest, and in this case, the predetermined specific length, that is, the reference length, is determined as follows.

基準長の長さは、一致していると判断し得る長さ以上の
長さであれば良いが、長すぎれば結果を得るための効率
が損われることになる。したがって、通常は対象とする
文字列を構成する要素の１０倍ないし文字列の対比範囲
の１／２程度に適宜決定すれば良い。本発明者の実験に
よれば、計算機プログラムの比較を行なう場合のあらか
じめ定める基準長は、２０〜１００行程度とした場合に
良好な結果が得られた。The length of the reference length may be at least a length that can be determined to be a match, but if it is too long, the efficiency in obtaining results will be impaired. Therefore, normally, the value may be appropriately determined to be about 10 times the elements constituting the target character string or about 1/2 of the comparison range of the character string. According to the inventor's experiments, good results were obtained when the predetermined reference length for comparing computer programs was set to about 20 to 100 lines.

第２図、第３図の例から下記の事項がひき出される。The following points can be drawn from the examples in FIGS. 2 and 3.

（１）与えられた２つの文字列に存在する最長の共通部
分列をもって対応づけ、分割、削除を行なうことにより
、文字列構成要素の各々がたびたび出現する場合におい
ても、適切な結果をもたらす。(1) By performing matching, division, and deletion using the longest common substring existing in two given character strings, appropriate results can be obtained even when each of the character string components appears frequently.

（２）最長の共通部分列から順次削除を行なうことによ
り、その範囲は解決済みとなるから、次に残った２組の
比較問題に帰着させることになり、以降の未解決範囲が
削減する。(2) By sequentially deleting from the longest common subsequence, the range becomes solved, so the problem is reduced to the next two remaining comparison problems, and the unsolved range thereafter is reduced.

（３）共通要素の走査において、端部から中央へ、かつ
近接した組から遠隔な組へという順をとることによって
共通要素の早期発見を達成する。これは、類似した文字
列の発生は、主として原型に対する部分的な加除訂正の
積み重ねに起因しており端部の近傍または、近接した組
に共通部分列が存在する確率が高いからである。(3) Early detection of common elements is achieved by scanning for common elements from the edges to the center and from close to distant sets. This is because the occurrence of similar character strings is mainly due to the accumulation of partial additions, subtractions, and corrections to the original, and there is a high probability that a common substring exists near the end or in a close group.

（４）前述の（２）、（３）で述べた手段に加えて、共
通部分列が十分長いことの判定を、あらかじめ定めた基
準長に達していることによっても行なうため、結果を得
るための比較領域である走査範囲を大幅ニ削減する。仮
に最長であることによってのみ十分長いと判定するなら
ば、最長であることを保証するため、結局、両車字列の
要素数の積だけの組み合せについて要素比較することに
なり、前述の（２）、（３）の事項が活されない。(4) In addition to the means described in (2) and (3) above, it is also determined that the common subsequence is long enough by determining that it has reached a predetermined standard length, so that the results can be obtained. The scanning range, which is the comparison area, is significantly reduced. If it is determined that it is sufficiently long only by being the longest, then in order to guarantee that it is the longest, elements will be compared for combinations that are the product of the number of elements of both character strings, and the above (2 ), (3) are not taken advantage of.

本発明の一実施例としての文字列差異抽出方法を行う装
置が第１図に示される。第１図装置においては、ワード
プロセッサで作成した文書を記録したフロッピーディス
ク１００ａを入力媒体とし、結果の差異をＣＲＴ　５０
０又はプリンター６００に出力する。第１図装置におい
ては、フロッピー人力部１００、文書記憶部２００、共
通部分列記憶部２０１、判定結果記憶部２０２、走査位
置発生部３０１、共通部分列探索部３０２、探索結果判
定部３０３、共通部分列削除部３０４、差異抽出部３０
５、ＣＲＴ・プリンター駆動部４００　、ＣＲＴ　５０
０　、およびプリンター６００が設けられる。An apparatus for performing a character string difference extraction method as an embodiment of the present invention is shown in FIG. In the apparatus shown in FIG. 1, a floppy disk 100a on which documents created using a word processor are recorded is used as an input medium, and differences in results are recorded on a CRT 50.
0 or output to the printer 600. In the apparatus shown in FIG. 1, a floppy manual section 100, a document storage section 200, a common subsequence storage section 201, a judgment result storage section 202, a scanning position generation section 301, a common subsequence search section 302, a search result judgment section 303, a common Subsequence deletion unit 304, difference extraction unit 30
5. CRT/printer drive unit 400, CRT 50
0 and a printer 600 are provided.

第１図装置の動作手順が第４図のフローチャートを参照
しつつ説明される。The operating procedure of the apparatus shown in FIG. 1 will be explained with reference to the flowchart shown in FIG.

手順１　　（ＳＯ〜Ｓ２）：まず始めにフロッピー人力
部で差異を抽出すべき第１文書、第２文書を読み（ＳＬ
）　、文書記憶部へ格納する（Ｓ２）。Step 1 (SO to S2): First, read the first and second documents for which differences should be extracted using the floppy human resource department (SL
) is stored in the document storage unit (S2).

手順２（３３〜Ｓ９）：次に、読み込んだ２つの文書の
文字を比較要素とし、全走査終了判定（Ｓ３）されるま
で、共通に存在する共通部分列を走査位置発生部から得
られる要素組の位置を始点とし共通部分列探索部で探索
（Ｓ４）する。部分列長がゼロより大きいと判定（Ｓ５
）されたならば、共通部分列探索結果記憶部へ格納（Ｓ
６）する。次に、部分列長があらかじめ定めた基準長以
上と判定（Ｓ７）されたならば、もはやこの共通部分列
の各要素の比較総領域は走査する必要がなくなるので、
走査開始位置を、ここで発見した共通部分列の終端より
それぞれプラス１した位置へ更新（Ｓ８）する。Step 2 (33 to S9): Next, the characters of the two read documents are used as comparison elements, and common subsequences that exist in common are used as elements obtained from the scanning position generator until it is determined that all scanning is complete (S3). The common subsequence search unit searches for the set position as a starting point (S4). Determine that the subsequence length is greater than zero (S5
), it is stored in the common subsequence search result storage unit (S
6) Do. Next, if it is determined that the subsequence length is equal to or greater than the predetermined reference length (S7), it is no longer necessary to scan the total comparison area of each element of this common subsequence.
The scanning start position is updated to a position that is plus one from the end of the common subsequence found here (S8).

さもなければ次の走査位置を走査位置発生部で得る（Ｓ
９）。Otherwise, the next scan position is obtained by the scan position generator (S
9).

手順３　（Ｓ１３〜５１５）：共通部分列探索により得
られた共通部分列を用いて、探索結果判定部は、与えら
れた２つの文字列を３分割し、これら部分列の一敗又は
、差異を判定しく５ＩＯ）、判定結果記憶部に格納する
（５１１）。この手順を繰り返し、分割された部分列全
てが判定を受けたならば終了と判断する。（Ｓ１２）。Step 3 (S13 to 515): Using the common subsequences obtained by the common subsequence search, the search result determination unit divides the given two character strings into three, and determines whether these subsequences are the same or different. is determined (5IO) and stored in the determination result storage section (511). This procedure is repeated, and when all the divided subsequences have been evaluated, it is determined that the process is complete. (S12).

手順４　（Ｓ１３〜５１５）　：最後に判定結果記憶部
に格納されている差異部分列から差異抽出部により差異
を追加、削除、置換の３つの表現（Ｓ１３）でＣＲＴ・
プリンター駆動部へ出力（Ｓ１４）　Ｌ全体処理を終了
（５１５）する。Step 4 (S13 to 515): Finally, the difference extracting unit extracts the difference from the difference subsequence stored in the judgment result storage unit using three expressions (S13): adding, deleting, and replacing the difference on the CRT.
Output to the printer drive unit (S14) L completes the entire process (515).

以上の手順を用いて第３図の具体的な演算処理例を示す
。A specific example of the arithmetic processing shown in FIG. 3 will be shown using the above procedure.

まず、手順１に従って、第１文書を第１文字列として第
２文書を第２文字列として文書記憶部へ格納する。この
結果、第１表に示すように第１文字列が、第２表に示す
ように第２文字列が各々記憶される。First, according to step 1, a first document is stored as a first character string and a second document is stored as a second character string in the document storage unit. As a result, the first character string is stored as shown in Table 1, and the second character string is stored as shown in Table 2.

次に手順２では、２つの文字列に共通して存在する部分
列を探索し、共通部分列記憶部へ格納する。この結果を
第３表に示す。Next, in step 2, a substring that exists in common between the two character strings is searched for and stored in the common substring storage section. The results are shown in Table 3.

手順３は、手順２で得た共通部分列をもって、与えられ
た２つの文字列を分割し、この分割した部分列に対して
一致又は、差異の判定を繰り返し、判定結果記憶部に分
割した部分列とその判定結果を格納する。この分割過程
が順次第４表から第９表に示される。Step 3 is to divide the two given character strings using the common substring obtained in step 2, repeat the judgment of match or difference for the divided substrings, and store the divided parts in the judgment result storage section. Stores columns and their judgment results. This division process is sequentially shown in Tables 4 to 9.

最初は、両車字列の先頭から終端を保留状態の判定結果
の部分列Ａとして第４表のように記憶する。Initially, the beginning to end of both vehicle character strings are stored as partial string A of the determination result of the pending state as shown in Table 4.

第　　４　表次に、手順２で探索した共通部分列の中で最長の第３表
９項番３を採用して、部分列Ａを分割する。この結果、
部分列ＡはＢ、Ｃ，Ｄの３つの部分列に分割され、Ｃは
一致と判定され、前後の組Ｂ、Ｄは、保留と判定される
。分割結果を第５表に示す。Table 4 Next, among the common subsequences searched in step 2, item No. 3 of Table 3, which is the longest, is used to divide subsequence A. As a result,
Subsequence A is divided into three subsequences B, C, and D. C is determined to be a match, and the preceding and succeeding pairs B and D are determined to be reserved. The division results are shown in Table 5.

次に、保留されている部分列Ｂについて着目し、この部
分列の範囲内にある共通部分列を第３表から検索すると
条件を満す共通部分列は、項番１であるからこれを採用
し、部分列Ｂを第６表に示す３組Ｅ、Ｆ、Ｇに分割する
。ここで、部分列Ｃは一致と判断されたので第６表では
削除されている。Next, we focus on the reserved subsequence B and search Table 3 for common subsequences within the range of this subsequence.The common subsequence that satisfies the condition is item number 1, so this is adopted. Then, subsequence B is divided into three sets E, F, and G shown in Table 6. Here, since subsequence C was determined to be a match, it has been deleted from Table 6.

第６表ここで、部分列Ｅは、両文字列とも空集合（０）なので
削除する。Table 6 Here, substring E is deleted because both character strings are empty sets (0).

また、部分列Ｇは、第２文字列が空集合（０）なのでも
はや分割することはできず、第２文字列に第１文字列要
素位置ｉ＝３、「つＪＰ２が存在しないことから削除さ
れた差異と判定される。この結果、判定結果記憶部には
、第７表で示す部分列ＧとＤが残る。In addition, substring G can no longer be divided because the second string is an empty set (0), and the second string contains the first string element position i = 3, which is deleted because JP2 does not exist. As a result, the partial strings G and D shown in Table 7 remain in the determination result storage section.

第７表再度、判定結果が保留状態の部分列に対して共通部分列
をもって分割を行なうが、この場合、判定結果記憶部に
は１つだけ保留されている部分列名りが対象となる。同
様に部分列りの範囲内に含まれる共通部分列を共通部分
列記憶部、表３から検索する。この場合は、項番４が採
用されＤを第８表に示すＨ，１，Ｊに３分割する。Table 7 Again, the subsequences for which the judgment results are pending are divided using the common subsequences, but in this case, only one subsequence name is held in the judgment result storage section. Similarly, common subsequences included within the range of the subsequences are searched from the common subsequence storage unit and Table 3. In this case, item number 4 is adopted and D is divided into three parts, H, 1, and J shown in Table 8.

第８表ここで部分列Ｊは、第１文字列側が空集合（０）なので
、もはや分割することはできず、第１文字列には存在せ
ず、第２文字列要素位置１＝１０〜１１、「エイＪＰ７
が存在することから追加された差異と判定する。次に部
分列■は、一致していることから判定結果記憶部から削
除され、部分列Ｈの範囲内にある共通部分列は検索の結
果、存在しないことにより、部分列Ｈは差異の判定を受
ける。この結果を第９表に示す。Table 8 Here, since the first string side of substring J is an empty set (0), it can no longer be divided, it does not exist in the first string, and the second string element position 1 = 10 ~ 11, “Ei JP7
Since there is, it is determined that the difference is an added difference. Next, subsequence ■ is deleted from the judgment result storage unit because it matches, and as a result of the search, there is no common subsequence within the range of subsequence H, so subsequence H cannot be judged for difference. receive. The results are shown in Table 9.

第９表ここで、判定結果が保留状態の部分列がまったく無いこ
とによって、部分列Ｇ、Ｈ，Ｊの差異を抽出し、差異抽
出処理を終える。Table 9 Here, since there are no subsequences for which the determination result is in a pending state, the difference between subsequences G, H, and J is extracted, and the difference extraction process is completed.

最後に、手順４で判定結果記憶部に存在する差異の部分
列Ｇ、Ｈ，Ｊからそれぞれの差異を差異抽出部において
ＣＲＴ・プリンター駆動部を介してＣＲＴ画面又は、プ
リンターへ出力する。Finally, in step 4, the difference extraction section outputs each difference from the partial strings G, H, and J of differences existing in the determination result storage section to the CRT screen or printer via the CRT/printer drive section.

部分列Ｇは、第２文字列が空集合（０）で第１文字列要
素位置ｉ＝３に文字［つＪＰ２が存在することから、「
つ」は削除されたと表示する。In the substring G, since the second character string is the empty set (0) and the character [JP2 exists at the first character string element position i=3,
"" is displayed as deleted.

部分列Ｈは、両文字列に文字が存在することにより、第
１文字列要素位置ｉ＝８の文字ｒ・イ、Ｐ４」が第２文
字列要素位置ｊ＝７の文字「アＪＰ５に置き換えられた
と表示する。In the substring H, since the characters exist in both character strings, the character r・i, P4'' at the first character string element position i=8 is replaced with the character ``AJP5'' at the second character string element position j=7. is displayed.

部分列Ｊは、第１文字列が空集合（０）で第２文字列要
素位置ｊ＝１０〜１１の文字「エイＪＰ７の文字が存在
することにより、文字「エイ」は追加されたと表示する
ことができる。Substring J displays that the character "Ei" has been added because the first character string is an empty set (0) and the character "Ei JP7" exists in the second character string element position j = 10 to 11. be able to.

〔Effect of the invention〕

本発明によれば日常言語で記述された文書、計算機プロ
グラムの加除訂正部分を、その経過でなく加除訂正後を
比較することによって機械的に把握することが可能とな
る。本発明によれば従来、手作業で低効率でしか行なえ
なかった、文書、計算機プログラム加工維持作業を大幅
に効率化することができる。また、本発明によれば、膨
大な文字列で手作業では多大な労力を要する比較作業を
高効率で行なうことができ、迅速、正確に文字列差異抽
出ができる。According to the present invention, it is possible to mechanically understand the addition/subtraction correction portion of a document or computer program written in everyday language by comparing the addition/subtraction correction portion rather than its progress. According to the present invention, it is possible to greatly improve the efficiency of document and computer program processing and maintenance work, which in the past could only be done manually and with low efficiency. Further, according to the present invention, it is possible to perform comparison work that requires a great deal of manual labor with a large number of character strings with high efficiency, and character string differences can be extracted quickly and accurately.

【図面の簡単な説明】第１図は、本発明の一実施例である文字列差異抽出方法
を行う装置のブロック図、第２図は、本発明の詳細な説明するための説明図、第３
図は、共通部分列探索手順の説明図、第４図は、第１図
装置の動作を説明するためのフローチャートである。１００・・・フロッピー人力部、１００ａ・・・フロ・ノビ−ディスク、２００・・・文
書記憶部、２０１・・・共通部分列探索結果記憶部、２０２・・・
判定結果記憶部、３０１・・・走査位置発生部、３０２・・・共通部分列探索部、３０３・・・探索結果判定部、３０４・・・共通部分列削除部、３０５・・・差異抽出部、４００・・・ＣＲＴ・プリンター駆動部、５００・・・
ＣＲＴ。６００・・・プリンター。[Brief Description of the Drawings] Fig. 1 is a block diagram of an apparatus for performing a character string difference extraction method which is an embodiment of the present invention; Fig. 2 is an explanatory diagram for explaining the present invention in detail; 3
The figure is an explanatory diagram of the common subsequence search procedure, and FIG. 4 is a flowchart for explaining the operation of the apparatus of FIG. 1. DESCRIPTION OF SYMBOLS 100... Floppy human resource department, 100a... Floppy disk, 200... Document storage section, 201... Common subsequence search result storage section, 202...
Judgment result storage section, 301... Scanning position generation section, 302... Common subsequence search section, 303... Search result determination section, 304... Common subsequence deletion section, 305... Difference extraction section , 400...CRT/printer drive unit, 500...
C.R.T. 600...Printer.

Claims

[Claims] 1. A first character string consisting of individual elements, and a second character string that has been modified from the first character string, from the end to the center and located close to each other. A common subsequence is searched by sequentially scanning from element to remote element, and among the obtained common subsequences, a common subsequence whose length matches a predetermined criterion is extracted from the two character strings. A character string difference extraction method characterized by sequentially deleting common substrings and extracting unmatched portions remaining as a result of the common substring deletion as differences. 2. The method according to claim 1, wherein the predetermined criterion is a common substring that has the largest number of characters among the common substrings or a common substring that has reached a predetermined number of characters. 3. Document storage means for storing a first character string consisting of individual elements and a second character string obtained by modifying the first character string; Common subsequence search/determination means for searching for a common subsequence and determining the search results by sequentially scanning from the end of the character string of No. 2 to the center, from elements close to each other to elements further away; means for sequentially deleting common substrings whose length matches a predetermined criterion; and difference extraction for extracting a mismatched portion remaining as a result of the sequential deletion of the common portions as a difference between the first and second character strings. A character string difference extraction device comprising: means.