JP2022069194A

JP2022069194A - Similar character string detection apparatus, method, program, and system

Info

Publication number: JP2022069194A
Application number: JP2020178242A
Authority: JP
Inventors: 直登青沼; Naoto Aonuma
Original assignee: Showa Denko KK
Current assignee: Resonac Holdings Corp
Priority date: 2020-10-23
Filing date: 2020-10-23
Publication date: 2022-05-11

Abstract

To provide a similar character string detection apparatus, method, program, and system capable of improving the accuracy of detection of similar character strings.SOLUTION: In a similar character string detection system, a similar character string detection apparatus 10 comprises: a first score calculation unit 104 for calculating a first score by dividing the sum of each length of common character parts between a first character string and a second character string by the longer one of either the first character string length or the second character string length; a second score calculation unit 105 for calculating a second score by dividing the number of characters of the longest common part between the first character string and the second character string by the shorter one of either the first character string length or the second character strings length; and an output unit 107 for outputting a combination of similar character strings based on the first score or the second score.SELECTED DRAWING: Figure 2

Description

本発明は、類似文字列検出装置、方法、プログラム、およびシステムに関する。 The present invention relates to similar string detectors, methods, programs, and systems.

従来、ある文字列に類似する文字列を複数の文字列のなかから検出する方法が知られている（特許文献１）。類似する文字列を検出する際に用いられる類似度として、例えば、レーベンシュタイン距離やジャロ・ウィンクラー距離が挙げられる。 Conventionally, a method of detecting a character string similar to a certain character string from a plurality of character strings has been known (Patent Document 1). Examples of the similarity used when detecting similar character strings include the Levenshtein distance and the Jaro-Winkler distance.

レーベンシュタイン距離では、一方の文字列を他方の文字列にするために必要な編集（１文字の挿入、削除、置換）の回数により類似度を表している。ジャロ・ウィンクラー距離では、文字列間で一致する文字数と置換の要否により類似度を表している。そのため、レーベンシュタイン距離やジャロ・ウィンクラー距離の手法は、タイプミス等による僅かな相違（例えば、１文字だけの相違等）がある文字列を検出するのに適している。 In the Levenshtein distance, the similarity is expressed by the number of edits (insertion, deletion, replacement of one character) required to make one character string into the other character string. In the Jaro-Winkler distance, the degree of similarity is expressed by the number of matching characters between character strings and the necessity of replacement. Therefore, the Levenshtein distance and Jaro-Winkler distance methods are suitable for detecting a character string having a slight difference (for example, a difference of only one character) due to a typo or the like.

特開２０１８－３６７４４号公報Japanese Unexamined Patent Publication No. 2018-36744

しかしながら、レーベンシュタイン距離やジャロ・ウィンクラー距離の手法では、複数の単語の語順が入れ替わっている文字列を検出することは難しい。例えば、"５０％溶液"という文字列と"溶液（５０％）"という文字列があったとする。"５０％溶液"と"溶液（５０％）"は、同じことを意図している、つまり、両者は類似していると判断されるべきであるとする。ところが、レーベンシュタイン距離やジャロ・ウィンクラー距離の手法では、このような複数の単語の語順が入れ替わっている文字列同士を、全く異なる文字列であると判断してしまう。 However, with the Levenshtein distance and Jaro-Winkler distance methods, it is difficult to detect a character string in which the word orders of multiple words are interchanged. For example, suppose that there is a character string "50% solution" and a character string "solution (50%)". It is assumed that "50% solution" and "solution (50%)" are intended to be the same, that is, they should be judged to be similar. However, in the Levenshtein distance and Jaro-Winkler distance methods, character strings in which the word orders of a plurality of words are interchanged are judged to be completely different character strings.

そこで、本発明では、類似する文字列の検出の精度を向上させることを目的とする。 Therefore, it is an object of the present invention to improve the accuracy of detecting similar character strings.

［１］第１の文字列と、第２の文字列と、の間で共通する文字列の各々の長さの和を、前記第１の文字列の長さと前記第２の文字列の長さとのうちの長い方の長さで除算した第１のスコアを算出する第１スコア算出部と、
前記第１のスコアに基づいて、類似している文字列の組み合わせを出力する出力部と
を備えた類似文字列検出装置。
［２］前記第１の文字列と、前記第２の文字列と、の間で共通する部分の最長の文字数を、前記第１の文字列の長さと前記第２の文字列の長さとのうち短い方の長さで除算した第２のスコアを算出する第２スコア算出部、をさらに備え、
前記出力部は、前記第２のスコアに基づいて、類似している文字列の組み合わせを出力する、［１］に記載の類似文字列検出装置。
［３］前記第１の文字列と、前記第２の文字列と、の間で共通する文字列の各々の長さは２以上である、［１］に記載の類似文字列検出装置。
［４］前記第１の文字列と、前記第２の文字列と、の間で共通する部分の最長の文字数は、連続した文字列の長さである、［２］に記載の類似文字列検出装置。
［５］コンピュータが実行する方法であって、
第１の文字列と、第２の文字列と、の間で共通する文字列の各々の長さの和を、前記第１の文字列の長さと前記第２の文字列の長さとのうちの長い方の長さで除算した第１のスコアを算出するステップと、
前記第１のスコアに基づいて、類似している文字列の組み合わせを出力するステップと
を含む方法。
［６］コンピュータを、
第１の文字列と、第２の文字列と、の間で共通する文字列の各々の長さの和を、前記第１の文字列の長さと前記第２の文字列の長さとのうちの長い方の長さで除算した第１のスコアを算出する第１スコア算出部、
前記第１のスコアに基づいて、類似している文字列の組み合わせを出力する出力部
として機能させるためのプログラム。
［７］類似文字列検出装置とユーザ端末とを含むシステムであって、
前記類似文字列検出装置は、
第１の文字列と、第２の文字列と、の間で共通する文字列の各々の長さの和を、前記第１の文字列の長さと前記第２の文字列の長さとのうちの長い方の長さで除算した第１のスコアを算出する第１スコア算出部と、
前記第１のスコアに基づいて、類似している文字列の組み合わせを出力する出力部と、を備え、
前記ユーザ端末は、
前記類似している文字列の組み合わせを表示する、システム。 [1] The sum of the lengths of the character strings common between the first character string and the second character string is the length of the first character string and the length of the second character string. The first score calculation unit that calculates the first score divided by the longer of the strings, and
A similar character string detection device including an output unit that outputs a combination of similar character strings based on the first score.
[2] The maximum number of characters in the portion common between the first character string and the second character string is the length of the first character string and the length of the second character string. It also has a second score calculation unit, which calculates the second score divided by the shorter length.
The similar character string detection device according to [1], wherein the output unit outputs a combination of similar character strings based on the second score.
[3] The similar character string detecting device according to [1], wherein the length of each of the character strings common between the first character string and the second character string is 2 or more.
[4] The similar character string according to [2], wherein the longest number of characters in the portion common between the first character string and the second character string is the length of continuous character strings. Detection device.
[5] This is a method executed by a computer.
The sum of the lengths of the character strings common between the first character string and the second character string is the sum of the length of the first character string and the length of the second character string. And the step to calculate the first score divided by the longer of
A method including a step of outputting a combination of similar character strings based on the first score.
[6] Computer
The sum of the lengths of the character strings common between the first character string and the second character string is the sum of the length of the first character string and the length of the second character string. 1st score calculation unit, which calculates the 1st score divided by the longer length of
A program for functioning as an output unit that outputs a combination of similar character strings based on the first score.
[7] A system including a similar character string detection device and a user terminal.
The similar character string detection device is
The sum of the lengths of the character strings common between the first character string and the second character string is the sum of the length of the first character string and the length of the second character string. The first score calculation unit that calculates the first score divided by the longer length of
An output unit that outputs a combination of similar character strings based on the first score is provided.
The user terminal is
A system that displays a combination of similar strings.

本発明では、類似する文字列の検出の精度を向上させることができる。 In the present invention, the accuracy of detecting similar character strings can be improved.

本発明の一実施形態に係る類似文字列検出装置を含む全体のシステム構成を示す図である。It is a figure which shows the whole system configuration including the similar character string detection apparatus which concerns on one Embodiment of this invention. 本発明の一実施形態に係る類似文字列検出装置の機能ブロックを示す図である。It is a figure which shows the functional block of the similar character string detection apparatus which concerns on one Embodiment of this invention. 本発明の一実施形態に係る類似文字列検出処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the similar character string detection processing which concerns on one Embodiment of this invention. 本発明の一実施形態に係る第１スコア算出処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the 1st score calculation process which concerns on one Embodiment of this invention. 本発明の一実施形態に係る第２スコア算出処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the 2nd score calculation process which concerns on one Embodiment of this invention. 本発明の一実施形態に係る共通部分文字列長演算処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the common substring length operation processing which concerns on one Embodiment of this invention. 本発明の一実施形態に係る共通部分文字列長の演算について説明するための図である。It is a figure for demonstrating the operation of the common substring length which concerns on one Embodiment of this invention. 本発明の一実施形態に係る共通部分文字列長の演算について説明するための図である。It is a figure for demonstrating the operation of the common substring length which concerns on one Embodiment of this invention. 本発明の一実施形態に係る共通部分文字列長の演算について説明するための図である。It is a figure for demonstrating the operation of the common substring length which concerns on one Embodiment of this invention. 本発明の一実施形態に係る共通部分文字列長の演算について説明するための図である。It is a figure for demonstrating the operation of the common substring length which concerns on one Embodiment of this invention. 本発明の一実施形態に係る共通部分文字列長の演算について説明するための図である。It is a figure for demonstrating the operation of the common substring length which concerns on one Embodiment of this invention. 本発明の一実施形態に係る共通部分文字列長の演算について説明するための図である。It is a figure for demonstrating the operation of the common substring length which concerns on one Embodiment of this invention. 類似文字列検出の精度を比較するための図である。It is a figure for comparing the accuracy of the similar character string detection. 類似文字列検出の精度を比較するための図である。It is a figure for comparing the accuracy of the similar character string detection. 類似文字列検出の精度を比較するための図である。It is a figure for comparing the accuracy of the similar character string detection. 本発明の一実施形態に係る類似文字列検出装置、ユーザ端末のハードウェア構成を示す図である。It is a figure which shows the hardware composition of the similar character string detection device and the user terminal which concerns on one Embodiment of this invention.

以下、各実施形態について添付の図面を参照しながら説明する。なお、本明細書および図面において、実質的に同一の機能構成を有する構成要素については、同一の符号を付することにより重複した説明を省略する。 Hereinafter, each embodiment will be described with reference to the attached drawings. In the present specification and the drawings, the components having substantially the same functional configuration are designated by the same reference numerals, and duplicate description thereof will be omitted.

＜用語の説明＞
・「文字列」とは、１つ以上の文字が連なったものをいう。
・「文字列群」とは、複数の文字列からなる集合のことをいう。 <Explanation of terms>
-A "character string" is a series of one or more characters.
-"Character string group" means a set consisting of a plurality of character strings.

なお、本実施形態では、２つの文字列群（以下、文字列群１および文字列群２とする）のそれぞれに含まれる文字列（つまり、文字列群１に含まれる文字列と文字列群２に含まれる文字列）を比較する場合を説明する。文字列群１および文字列群２は、異なる文字列群であってもよいし、同一の文字列群であってもよい。ただし、本実施形態は、ユーザ等が指定した文字列と、ある文字列群に含まれる文字列と、を比較する場合にも適用することができる。 In this embodiment, a character string included in each of the two character string groups (hereinafter referred to as a character string group 1 and a character string group 2) (that is, a character string and a character string group included in the character string group 1). A case of comparing (character strings included in 2) will be described. The character string group 1 and the character string group 2 may be different character string groups or may be the same character string group. However, this embodiment can also be applied to a case where a character string specified by a user or the like is compared with a character string included in a certain character string group.

＜システム構成＞
図１は、本発明の一実施形態に係る類似文字列検出装置１０を含む全体のシステム構成を示す図である。図１に示されるように、類似文字列検出システム１は、類似文字列検出装置１０と、ユーザ端末２０と、を含む。類似文字列検出装置１０は、任意のネットワークを介してユーザ端末２０とデータを送受信することができる。以下、それぞれについて説明する。 <System configuration>
FIG. 1 is a diagram showing an overall system configuration including a similar character string detection device 10 according to an embodiment of the present invention. As shown in FIG. 1, the similar character string detection system 1 includes a similar character string detection device 10 and a user terminal 20. The similar character string detection device 10 can send and receive data to and from the user terminal 20 via an arbitrary network. Each will be described below.

類似文字列検出装置１０は、類似する文字列（以下、「類似文字列」ともいう）の検出の処理を行う装置（例えば、サーバ）である。具体的には、類似文字列検出装置１０は、ある文字列に類似する文字列を、文字列群のなかから検出する。 The similar character string detection device 10 is a device (for example, a server) that performs processing for detecting a similar character string (hereinafter, also referred to as “similar character string”). Specifically, the similar character string detection device 10 detects a character string similar to a certain character string from the character string group.

ユーザ端末２０は、類似文字列検出装置１０に類似文字列を検出させるユーザが利用する端末である。具体的には、ユーザ端末２０は、ユーザによって入力された文字列群の指定を受け付ける。また、ユーザ端末２０は、受け付けた文字列群の指定を、類似文字列検出装置１０に通知する。例えば、ユーザ端末２０は、パーソナルコンピュータ等である。 The user terminal 20 is a terminal used by a user who causes the similar character string detection device 10 to detect a similar character string. Specifically, the user terminal 20 accepts the designation of the character string group input by the user. Further, the user terminal 20 notifies the similar character string detection device 10 of the designation of the received character string group. For example, the user terminal 20 is a personal computer or the like.

なお、類似文字列検出装置１０がユーザ端末２０の一部または全部の機能を有してもよい。 The similar character string detection device 10 may have a part or all of the functions of the user terminal 20.

＜機能ブロック＞
類似文字列検出装置１０は、第１スコア算出部１０４と、出力部１０７と、を備える。以下、類似文字列検出装置１０について、図２を参照しながら詳細に説明する。 <Functional block>
The similar character string detection device 10 includes a first score calculation unit 104 and an output unit 107. Hereinafter, the similar character string detection device 10 will be described in detail with reference to FIG. 2.

図２は、本発明の一実施形態に係る類似文字列検出装置１０の機能ブロックを示す図である。図２に示されるように、類似文字列検出装置１０は、文字列群取得部１０１と、文字列選択部１０２と、共通部分文字列長演算部１０３と、第１スコア算出部１０４と、第２スコア算出部１０５と、類似文字列検出部１０６と、出力部１０７と、を備えることができる。これら各部は、類似文字列検出装置１０にインストールされたプログラムが類似文字列検出装置１０のＣＰＵに実行させる処理により実現される。以下、それぞれについて説明する。 FIG. 2 is a diagram showing a functional block of the similar character string detection device 10 according to the embodiment of the present invention. As shown in FIG. 2, the similar character string detection device 10 includes a character string group acquisition unit 101, a character string selection unit 102, a common substring length calculation unit 103, a first score calculation unit 104, and a first. 2 The score calculation unit 105, the similar character string detection unit 106, and the output unit 107 can be provided. Each of these parts is realized by a process of causing the CPU of the similar character string detection device 10 to execute a program installed in the similar character string detection device 10. Each will be described below.

文字列群取得部１０１は、比較対象となる２つの文字列群（文字列群１および文字列群２とする）を取得する。例えば、文字列群取得部１０１は、ユーザがユーザ端末２０で指定した２つの文字列群を取得する。なお、各文字列群は、類似文字列検出装置１０に記憶されていてもよいし、類似文字列検出装置１０以外に記憶されていてもよい。 The character string group acquisition unit 101 acquires two character string groups (character string group 1 and character string group 2) to be compared. For example, the character string group acquisition unit 101 acquires two character string groups specified by the user on the user terminal 20. In addition, each character string group may be stored in the similar character string detection device 10, or may be stored in other than the similar character string detection device 10.

文字列選択部１０２は、文字列群取得部１０１が取得した各文字列群（つまり、文字列群１および文字列群２のそれぞれ）から１つずつ文字列を選択する。 The character string selection unit 102 selects one character string from each character string group (that is, each of the character string group 1 and the character string group 2) acquired by the character string group acquisition unit 101.

共通部分文字列長演算部１０３は、文字列選択部１０２が選択した２つの文字列で共通する文字列の長さ（以下、「共通部分文字列長」ともいう）を演算する。 The common substring length calculation unit 103 calculates the length of the character string common to the two character strings selected by the character string selection unit 102 (hereinafter, also referred to as “common substring length”).

第１スコア算出部１０４は、共通部分文字列長演算部１０３が演算した結果を用いて、第１スコアを算出する。 The first score calculation unit 104 calculates the first score using the result calculated by the common substring length calculation unit 103.

第２スコア算出部１０５は、共通部分文字列長演算部１０３が演算した結果を用いて、第２スコアを算出する。 The second score calculation unit 105 calculates the second score using the result calculated by the common substring length calculation unit 103.

類似文字列検出部１０６は、第１スコア算出部１０４あるいは第２スコア算出部１０５が算出したスコアに基づいて、類似文字列を検出する。なお、第１のスコアのみが用いられてもよいし、第１のスコアと第２のスコアとの両方が用いられてもよい。例えば、類似文字列検出部１０６は、スコアの高い順に、文字列の組み合わせ（つまり、類似している文字列の組み合わせ）を検出する。 The similar character string detection unit 106 detects a similar character string based on the score calculated by the first score calculation unit 104 or the second score calculation unit 105. In addition, only the first score may be used, or both the first score and the second score may be used. For example, the similar character string detection unit 106 detects a combination of character strings (that is, a combination of similar character strings) in descending order of score.

出力部１０７は、類似文字列検出部１０６が検出した類似文字列を出力（例えば、ユーザ端末２０に表示、類似文字列検出装置１０に保存等）する。例えば、出力部１０７は、各スコアの高い順に、文字列の組み合わせ（つまり、類似している文字列の組み合わせ）を出力する。 The output unit 107 outputs a similar character string detected by the similar character string detecting unit 106 (for example, displaying it on the user terminal 20 and storing it in the similar character string detecting device 10). For example, the output unit 107 outputs a combination of character strings (that is, a combination of similar character strings) in descending order of each score.

＜処理方法＞
図３は、本発明の一実施形態に係る類似文字列検出処理の流れを示すフローチャートである。 <Processing method>
FIG. 3 is a flowchart showing the flow of the similar character string detection process according to the embodiment of the present invention.

ステップ１（Ｓ１）において、文字列群取得部１０１は、比較対象となる２つの文字列群（文字列群１および文字列群２とする）を取得する。 In step 1 (S1), the character string group acquisition unit 101 acquires two character string groups (character string group 1 and character string group 2) to be compared.

ステップ２－１（Ｓ２－１）において、文字列選択部１０２は、Ｓ１で取得された文字列群１から未処理の文字列を選択する。文字列群１のｉ個目の文字列をＳ１ｉとする。 In step 2-1 (S2-1), the character string selection unit 102 selects an unprocessed character string from the character string group 1 acquired in S1. Let S1i be the i-th character string of the character string group 1.

ステップ２－２（Ｓ２－２）において、文字列選択部１０２は、Ｓ１で取得された文字列群２から未処理の文字列を選択する。文字列群２のｊ個目の文字列をＳ２ｊとする。 In step 2-2 (S2-2), the character string selection unit 102 selects an unprocessed character string from the character string group 2 acquired in S1. Let S2j be the jth character string of the character string group 2.

ステップ３（Ｓ３）において、共通部分文字列長演算部１０３および第１スコア算出部１０４（あるいは第２スコア算出部１０５）は、スコアを算出する。具体的には、共通部分文字列長演算部１０３および第１スコア算出部１０４（あるいは第２スコア算出部１０５）は、Ｓ１ｉとＳ２ｊのスコアを算出する。 In step 3 (S3), the common substring length calculation unit 103 and the first score calculation unit 104 (or the second score calculation unit 105) calculate the score. Specifically, the common substring length calculation unit 103 and the first score calculation unit 104 (or the second score calculation unit 105) calculate the scores of S1i and S2j.

ステップ４（Ｓ４）において、第１スコア算出部１０４あるいは第２スコア算出部１０５は、文字列群２に含まれる全ての文字列のスコアを算出したか否かを判断する。算出した場合にはステップ５へ進む。算出していない場合、ステップ２－２へ戻り、文字列選択部１０２は、ｊに１を加えて、文字列群２のうちｊ＋１番目の文字列を選択する。 In step 4 (S4), the first score calculation unit 104 or the second score calculation unit 105 determines whether or not the scores of all the character strings included in the character string group 2 have been calculated. If calculated, the process proceeds to step 5. If not calculated, the process returns to step 2-2, and the character string selection unit 102 adds 1 to j to select the j + 1th character string in the character string group 2.

ステップ５（Ｓ５）において、類似文字列検出部１０６は、文字列群２に含まれる文字列のうち、スコアの高い１つまたは複数の文字列（例えば、最もスコアの高い文字列）を、Ｓ１ｉに対応する文字列として保存する。 In step 5 (S5), the similar character string detection unit 106 selects one or a plurality of character strings having a high score (for example, the character string having the highest score) among the character strings included in the character string group 2 in S1i. Save as a character string corresponding to.

ステップ６（Ｓ６）において、第１スコア算出部１０４あるいは第２スコア算出部１０５は、文字列群１に含まれる全ての文字列のスコアを算出したか否かを判断する。算出した場合にはステップ７へ進む。算出していない場合、ステップ２－１へ戻り、文字列選択部１０２は、ｉに１を加えて、文字列群１のうちｉ＋１番目の文字列を選択する。 In step 6 (S6), the first score calculation unit 104 or the second score calculation unit 105 determines whether or not the scores of all the character strings included in the character string group 1 have been calculated. If calculated, the process proceeds to step 7. If not calculated, the process returns to step 2-1 and the character string selection unit 102 adds 1 to i and selects the i + 1th character string in the character string group 1.

ステップ７（Ｓ７）において、類似文字列検出部１０６は、スコアに基づいて、類似文字列を検出する。なお、第１のスコアのみが用いられてもよいし、第１のスコアと第２のスコアとの両方が用いられてもよい。例えば、類似文字列検出部１０６は、スコアの高い順に、文字列の組み合わせ（つまり、類似している文字列の組み合わせ）を検出する。 In step 7 (S7), the similar character string detection unit 106 detects a similar character string based on the score. In addition, only the first score may be used, or both the first score and the second score may be used. For example, the similar character string detection unit 106 detects a combination of character strings (that is, a combination of similar character strings) in descending order of score.

ステップ８（Ｓ８）において、出力部１０７は、Ｓ５およびＳ７で検出された類似文字列を出力する。 In step 8 (S8), the output unit 107 outputs the similar character strings detected in S5 and S7.

続いて、図３のＳ３の詳細について説明する。図４は、本発明の一実施形態に係る第１スコア算出処理の流れを示すフローチャートである。 Subsequently, the details of S3 in FIG. 3 will be described. FIG. 4 is a flowchart showing the flow of the first score calculation process according to the embodiment of the present invention.

ステップ１１（Ｓ１１）において、共通部分文字列長演算部１０３は、Ｓ１ｉ（文字列群１のｉ個目の文字列）とＳ２ｊ（文字列群２のｊ個目の文字列）の共通部分文字列長を演算する。なお、Ｓ１ｉとＳ２ｊとの間に、共通する文字列が複数有る場合、各文字列の文字列長が共通部分文字列長として演算される。したがって、複数の共通部分文字列長が演算されうる。 In step 11 (S11), the intersection character string length calculation unit 103 uses the intersection character of S1i (the i-th character string of the character string group 1) and S2j (the j-th character string of the character string group 2). Calculate the column length. If there are a plurality of common character strings between S1i and S2j, the character string length of each character string is calculated as the common substring length. Therefore, a plurality of common substring lengths can be calculated.

ステップ１２（Ｓ１２）において、第１スコア算出部１０４は、Ｓ１１で演算された共通部分文字列長の和を、Ｓ１ｉ（文字列群１のｉ個目の文字列）の文字列の長さとＳ２ｊ（文字列群２のｊ個目の文字列）の文字列の長さのうちの長いほうで除算する。第１スコア算出部１０４は、共通部分文字列長の和を、Ｓ１ｉの文字列の長さとＳ２ｊの文字列の長さのうちの長いほうで除算した値を第１スコアとする。第１スコアは、値が大きいほど類似した文字列であることを示す。 In step 12 (S12), the first score calculation unit 104 uses the sum of the common substring lengths calculated in S11 as the length of the character string of S1i (the i-th character string of the character string group 1) and S2j. Divide by the longer of the lengths of the character strings (jth character string of the character string group 2). The first score calculation unit 104 sets the value obtained by dividing the sum of the common substring lengths by the longer of the length of the character string of S1i and the length of the character string of S2j as the first score. The first score indicates that the larger the value, the more similar the character strings.

図５は、本発明の一実施形態に係る第２スコア算出処理の流れを示すフローチャートである。 FIG. 5 is a flowchart showing the flow of the second score calculation process according to the embodiment of the present invention.

ステップ２１（Ｓ２１）において、共通部分文字列長演算部１０３は、Ｓ１ｉ（文字列群１のｉ個目の文字列）とＳ２ｊ（文字列群２のｊ個目の文字列）の共通部分文字列長を演算する。Ｓ１１と同様に、複数の共通部分文字列長が演算されうる。 In step 21 (S21), the intersection character string length calculation unit 103 uses the intersection character of S1i (the i-th character string of the character string group 1) and S2j (the j-th character string of the character string group 2). Calculate the column length. Similar to S11, a plurality of common substring lengths can be calculated.

ステップ２２（Ｓ２２）において、第２スコア算出部１０５は、Ｓ２１で演算された共通部分文字列長のうち最長の共通部分文字列長（以下、「最長共通部分文字列長」ともいう）を、Ｓ１ｉ（文字列群１のｉ個目の文字列）の文字列の長さとＳ２ｊ（文字列群２のｊ個目の文字列を）の文字列の長さのうちの短いほうで除算する。第２スコア算出部１０５は、最長共通部分文字列長を、Ｓ１ｉの文字列の長さとＳ２ｊの文字列の長さのうちの短いほうで除算した値を第２スコアとする。第２スコアは、値が大きいほど類似した文字列であることを示す。 In step 22 (S22), the second score calculation unit 105 determines the longest common substring length (hereinafter, also referred to as “longest common substring length”) among the common substring lengths calculated in S21. Divide by the shorter of the length of the character string of S1i (the i-th character string of the character string group 1) and the length of the character string of S2j (the j-th character string of the character string group 2). The second score calculation unit 105 uses the value obtained by dividing the longest common substring length by the shorter of the length of the character string of S1i and the length of the character string of S2j as the second score. The second score indicates that the larger the value, the more similar the character strings.

なお、上記の最長共通部分文字列長の代わりに、最長共通部分列長を用いるようにしてもよい。最長共通部分列長は、２つの文字列で共通する文字の文字数であり、この文字は連続している必要はない。 The longest common substring length may be used instead of the longest common substring length described above. The longest common subsequence length is the number of characters that are common to two strings, and these characters do not have to be continuous.

続いて、図４のＳ１１および図５のＳ２１の詳細について説明する。図６は、本発明の一実施形態に係る共通部分文字列長演算処理の流れを示すフローチャートである。なお、図７に示されるように、列数が、文字列Ｓ１の文字列の長さ＋１であり、行数が、文字列Ｓ２の文字列の長さ＋１である、データ保存用の表（テーブル）が用いられる。共通部分文字列長演算部１０３は、左端から２番目の列から順に、文字列Ｓ１の各文字を割り当てる。また、共通部分文字列長演算部１０３は、上端から２番目の行から順に、文字列Ｓ２の各文字を割り当てる。 Subsequently, the details of S11 in FIG. 4 and S21 in FIG. 5 will be described. FIG. 6 is a flowchart showing a flow of common substring length calculation processing according to an embodiment of the present invention. As shown in FIG. 7, the number of columns is the length of the character string of the character string S1 + 1, and the number of lines is the length of the character string of the character string S2 + 1. Table) is used. The common substring length calculation unit 103 assigns each character of the character string S1 in order from the second column from the left end. Further, the intersection character string length calculation unit 103 allocates each character of the character string S2 in order from the second line from the upper end.

ステップ１０１（Ｓ１０１）において、共通部分文字列長演算部１０３は、最長部分文字列長を０で初期化する。具体的には、共通部分文字列長演算部１０３は、１行目（上端の行）の各列と１列目（左端の列）の各行の値を０とする。つまり、この時点の最長部分文字列長は０である。 In step 101 (S101), the intersection string length calculation unit 103 initializes the longest substring length to 0. Specifically, the intersection string length calculation unit 103 sets the value of each column of the first row (upper end row) and each row of the first column (leftmost column) to 0. That is, the longest substring length at this point is 0.

共通部分文字列長演算部１０３は、文字列Ｓ１と文字列Ｓ２の先頭の文字から探索を開始する。ｉは、文字列Ｓ１の先頭の文字からｉ文字目であることを示す。ｊは、文字列Ｓ２の先頭の文字からｊ文字目であることを示す。ｉ＝０、ｊ＝０（つまり、先頭の文字）から探索を行う。 The intersection string length calculation unit 103 starts the search from the first character of the character string S1 and the character string S2. i indicates that it is the i-th character from the first character of the character string S1. j indicates that it is the jth character from the first character of the character string S2. The search is performed from i = 0 and j = 0 (that is, the first character).

ステップ１０２（Ｓ１０２）において、共通部分文字列長演算部１０３は、文字列Ｓ１のｉ文字目と文字列Ｓ２のｊ文字目が等しいか否かを判断する。等しい場合にはステップ１０３へ進み、等しくない場合にはステップ１０４へ進む。ステップ１０４（Ｓ１０４）において、共通部分文字列長演算部１０３は、表（テーブル）のｊ＋１行目のｉ＋１列目のマスに０を入力し、ステップ１０５へ進む。 In step 102 (S102), the intersection character string length calculation unit 103 determines whether or not the i-th character of the character string S1 and the j-th character of the character string S2 are equal. If they are equal, the process proceeds to step 103, and if they are not equal, the process proceeds to step 104. In step 104 (S104), the intersection character string length calculation unit 103 inputs 0 to the cells in the j + 1 row and i + 1 column of the table, and proceeds to step 105.

ステップ１０３（Ｓ１０３）において、共通部分文字列長演算部１０３は、表（テーブル）のｊ＋１行のｉ＋１列目のマスに、ｊ行目のｉ列目のマスの値に１を加えた値を入力する。その時点での最長部分文字列長よりも大きな値であったとき、その値を最長部分列文字列長として更新する。 In step 103 (S103), the intersection string length calculation unit 103 sets the value of the cell in the i + 1 column of the j + 1 row of the table (table) plus 1 to the value of the cell in the i column of the j row. input. If the value is larger than the longest substring string length at that time, the value is updated as the longest substring string length.

ステップ１０５（Ｓ１０５）において、共通部分文字列長演算部１０３は、文字列Ｓ１の後端の文字まで探索したか否かを判断する。後端の文字まで探索した場合にはステップ１０６へ進み、後端の文字まで探索していない場合には、ｉに１を加えてＳ１０２へ戻る。 In step 105 (S105), the intersection character string length calculation unit 103 determines whether or not the character at the rear end of the character string S1 has been searched. If the search is performed up to the character at the rear end, the process proceeds to step 106. If the character at the rear end is not searched, 1 is added to i and the process returns to S102.

ステップ１０６（Ｓ１０６）において、共通部分文字列長演算部１０３は、文字列Ｓ２の後端の文字まで探索したか否かを判断する。後端の文字まで探索した場合には処理を終了し、後端の文字まで探索していない場合にはｊに１を加えてＳ１０２へ戻る。 In step 106 (S106), the intersection character string length calculation unit 103 determines whether or not the character at the rear end of the character string S2 has been searched. When the search is performed up to the character at the rear end, the process ends, and when the search is not performed up to the character at the rear end, 1 is added to j and the process returns to S102.

Ｓ１０１～Ｓ１０６の処理によって、全ての文字の比較が完了する。表（テーブル）において、共通部分文字列長は、右下のマスが０であるマス、または、右下のマスが存在しないマスであって、０以外の値が入力されたマスの値である。また、最長共通部分文字列長は、表（テーブル）に入力されている値のうち最も大きな値である。 By the processing of S101 to S106, the comparison of all characters is completed. In the table, the intersection string length is the value of the cell in which the lower right cell is 0, or the cell in which the lower right cell does not exist, and a value other than 0 is input. .. The longest common substring length is the largest value among the values entered in the table.

＜共通部分文字列長の演算＞
図７～図１２は、本発明の一実施形態に係る共通部分文字列長の演算について説明するための図である。文字列Ｓ１「ＡＢＣＤＥＦＧ」と文字列Ｓ２「ＡｂｃＤＥ」の２つの文字列についての比較を行うとする。 <Calculation of common substring length>
7 to 12 are diagrams for explaining the calculation of the common substring length according to the embodiment of the present invention. It is assumed that two character strings of the character string S1 "ABCDEFG" and the character string S2 "AbcDE" are compared.

図７～図１２に示されるように、列数が、文字列Ｓ１の文字列の長さ＋１（例では、「ＡＢＣＤＥＦＧ」の長さ：７に１を加えた８）であり、行数が、文字列Ｓ２の文字列の長さ＋１（本例では、「ＡｂｃＤＥ」の長さ：５に１を加えた６）である、データ保存用の表（テーブル）が用いられる。左端から２番目の列から順に、文字列Ｓ１の各文字（本例では、「Ａ」、「Ｂ」、「Ｃ」、「Ｄ」、「Ｅ」、「Ｆ」、「Ｇ」）が割り当てられる。また、上端から２番目の行から順に、文字列Ｓ２の各文字（本例では、「Ａ」、「ｂ」、「ｃ」、「Ｄ」、「Ｅ」）が割り当てられる。 As shown in FIGS. 7 to 12, the number of columns is the length of the character string of the character string S1 + 1 (in the example, the length of "ABCDEFG": 7 plus 1 is 8), and the number of rows is. , The length of the character string of the character string S2 + 1 (in this example, the length of "AbcDE": 5 plus 1 6), a table for data storage is used. Each character of the character string S1 (in this example, "A", "B", "C", "D", "E", "F", "G") is assigned in order from the second column from the left end. Be done. Further, each character of the character string S2 (in this example, "A", "b", "c", "D", "E") is assigned in order from the second line from the upper end.

最初に、最長部分文字列長を０で初期化する。具体的には、１行目（上端の行）の各列と１列目（左端の列）の各行の値を０とする。つまり、この時点の最長部分文字列長は０である。 First, the longest substring length is initialized to 0. Specifically, the value of each column of the first row (uppermost row) and each row of the first column (leftmost column) is set to 0. That is, the longest substring length at this point is 0.

以下、２つの文字列の各文字を比較した結果の値を、比較した文字が交わるマスに入力していく。具体的には、２つの文字が等しい場合には、比較した文字が交わるマスの左上のマスの値に１を加えた値を、比較した文字が交わるマスに入力する。２つの文字が等しくない場合には、比較した文字が交わるマスに０を入力する。その時点での最長部分文字列長よりも大きな値が入力されたとき、その値を最長部分列文字列長として更新する。 Hereinafter, the value of the result of comparing each character of the two character strings is input to the cell where the compared characters intersect. Specifically, when the two characters are equal, the value obtained by adding 1 to the value of the upper left cell of the cell where the compared characters intersect is input to the cell where the compared characters intersect. If the two characters are not equal, enter 0 in the cell where the compared characters intersect. When a value larger than the longest substring string length at that time is input, that value is updated as the longest substring string length.

まず、各文字列の１文字目（本例では、文字列Ｓ１の「Ａ」と文字列Ｓ２の「Ａ」）を比較する。本例では、両者が等しいので、比較した文字が交わるマスの左上のマスの値に１を加えた値を、比較した文字が交わるマスに入力する（図７参照）。 First, the first character of each character string (in this example, "A" of the character string S1 and "A" of the character string S2) is compared. In this example, since both are equal, the value obtained by adding 1 to the value of the upper left cell of the cell where the compared characters intersect is input to the cell where the compared characters intersect (see FIG. 7).

次に、一方の文字列の１文字目と他方の文字列の２文字目以降を比較する。まず、本例では、文字列Ｓ１の２文字目の「Ｂ」と文字列Ｓ２の１文字目の「Ａ」を比較する。本例では、両者が等しくないので、比較した文字が交わるマスに０を入力する（図８参照）。同様に繰り返し、文字列Ｓ１の各文字と文字列Ｓ２の１文字目との比較が完了する（図９参照）。 Next, the first character of one character string and the second and subsequent characters of the other character string are compared. First, in this example, the second character "B" of the character string S1 and the first character "A" of the character string S2 are compared. In this example, since they are not equal, 0 is input to the square where the compared characters intersect (see FIG. 8). Similarly, the comparison between each character of the character string S1 and the first character of the character string S2 is completed (see FIG. 9).

次に、一方の文字列の２文字目と他方の文字列の各文字を比較する。まず、本例では、文字列Ｓ１の１文字目の「Ａ」と文字列Ｓ２の２文字目の「ｂ」を比較する。本例では、両者が等しくないので、比較した文字が交わるマスに０を入力する（図１０参照）。同様に繰り返し、文字列Ｓ１の各文字と文字列Ｓ２の２文字目との比較が完了する（図１１参照）。 Next, the second character of one character string and each character of the other character string are compared. First, in this example, the first character "A" of the character string S1 and the second character "b" of the character string S2 are compared. In this example, since they are not equal, 0 is input to the square where the compared characters intersect (see FIG. 10). Similarly, the comparison between each character of the character string S1 and the second character of the character string S2 is completed (see FIG. 11).

このように、２つの文字列の各文字を比較した結果の値を当該各文字が交わるマスに入力していく。図１２では、全ての文字の比較が完了している。共通部分文字列長は、右下のマスが０であるマス、または、右下のマスが存在しないマスであって、０以外の値が入力されたマスの値である（図１２の例では、ＡとＡが交わるマスの「１」、および、ＥとＥが交わるマスの「２」）。つまり、図１２の例では、共通部分文字列長の和は、３（「１」と「２」の和）である。また、最長共通部分文字列長は、表（テーブル）に入力されている値のうち最も大きな値（図１２の例では、２）である。 In this way, the value of the result of comparing each character of the two character strings is input to the cell where the characters intersect. In FIG. 12, the comparison of all characters is completed. The common substring length is a cell in which the lower right cell is 0, or a cell in which the lower right cell does not exist, and a value other than 0 is input (in the example of FIG. 12). , "1" in the square where A and A intersect, and "2" in the square where E and E intersect). That is, in the example of FIG. 12, the sum of the intersection string lengths is 3 (the sum of "1" and "2"). Further, the longest common substring length is the largest value (2 in the example of FIG. 12) among the values input to the table.

なお、図７～図１２で説明した手法を用いることで、共通部分文字列長を高速に演算することができる。文字列Ｓ１と文字列Ｓ２の２つの文字列についての比較を行うとする。それぞれの文字列の長さを By using the methods described with reference to FIGS. 7 to 12, the common substring length can be calculated at high speed. It is assumed that two character strings, a character string S1 and a character string S2, are compared. The length of each string

とする。２つの文字列に共通する文字列を取得するために全探索的なアルゴリズムを用いると、文字列Ｓ２からの文字列の切り出し方が、

And. If a full-search algorithm is used to obtain a character string common to two character strings, the method of cutting out the character string from the character string S2 becomes

通りある。そして、切り出した文字列が文字列Ｓ１に含まれるか否かを確認する組み合わせが、各切り出しについて、

There is a street. Then, a combination for confirming whether or not the cut out character string is included in the character string S1 is used for each cutout.

通りある。したがって、２つの文字列に共通する文字列を取得するための組み合わせは、

There is a street. Therefore, the combination for obtaining the character string common to the two character strings is

通りある。Ｎ文字の比較にはＮ回の演算が必要なので、必要な比較のための演算の回数の合計は、

There is a street. Since N character comparison requires N operations, the total number of operations required for comparison is

である。つまり、演算コストは、

Is. In other words, the calculation cost is

で示される演算の回数に比例する。一方、図７～図１２で説明した手法の演算コストは

It is proportional to the number of operations indicated by. On the other hand, the calculation cost of the methods described in FIGS. 7 to 12 is

で示される演算の回数に比例し、本発明の演算原理の方が高速であるといえる。なお、演算コストは、

It can be said that the calculation principle of the present invention is faster in proportion to the number of operations indicated by. The calculation cost is

と

When

の積の演算の回数に比例する。

It is proportional to the number of operations of the product of.

＜他の実施形態＞
スコア１を算出する際に、共通部分文字列長に制約を設けるようにしもてよい。具体的には、文字列の長さが１である共通部分文字列を排除する。例えば、図７～図１２で説明した表（テーブル）において、左上のマスが１以上の値であるマス（かつ、右下のマスが０であるマス、または、右下のマスが存在しないマスであって、０以外の値が入力されたマス）の値を共通部分文字列長とすることで、文字列の長さが１である共通部分文字列を排除することができる。長さが１である共通部分文字列を用いた場合には、共通の１文字が含まれている文字列Ｓ１とＳ２との類似度が高くなってしまう。文字列の長さが１である共通部分文字列を排除することで、単語の入れ替えに特化したスコアを算出することができる。なお、多くの文字列が２～５文字程度の少数の文字から構成されている場合には、長さが１の共通部分文字列も含めたほうが有用である。 <Other embodiments>
When calculating the score 1, the common substring length may be restricted. Specifically, the intersection character string having a character string length of 1 is excluded. For example, in the table described with reference to FIGS. 7 to 12, the upper left cell has a value of 1 or more (and the lower right cell is 0, or the lower right cell does not exist). Therefore, by setting the value of (a cell in which a value other than 0 is input) as the common substring length, it is possible to eliminate the common substring whose character string length is 1. When a common substring having a length of 1 is used, the degree of similarity between the character strings S1 and S2 containing one common character becomes high. By excluding the intersection character string having a character string length of 1, it is possible to calculate a score specialized for word replacement. When many character strings are composed of a small number of characters of about 2 to 5 characters, it is useful to include a common substring with a length of 1.

［実施例］
＜類似文字列検出の精度＞
以下、実施例として本実施形態に基づく類似文字列検出の精度について説明する。図１３～図１５は、類似文字列検出の精度を比較するための図である。従来技術であるレーベンシュタイン距離、ジャロ・ウィンクラー距離、本発明の第１スコア、第２スコアの４つを比較した。 [Example]
<Accuracy of similar character string detection>
Hereinafter, the accuracy of similar character string detection based on this embodiment will be described as an example. 13 to 15 are diagrams for comparing the accuracy of similar character string detection. Four prior arts, the Levenshtein distance, the Jaro-Winkler distance, and the first score and the second score of the present invention were compared.

レーベンシュタイン距離は、値が小さいほど類似した文字列であることを示す。ただし、本比較においては、レーベンシュタイン距離の値を、２つの文字列の長さのうちの長いほうで除算して標準化した値を用いた。 The Levenshtein distance indicates that the smaller the value, the more similar the strings. However, in this comparison, the value of the Levenshtein distance was divided by the longer of the two character strings, and the standardized value was used.

ジャロ・ウィンクラー距離は、値が大きいほど類似した文字列であることを示す。 The Jaro-Winkler distance indicates that the larger the value, the more similar the strings.

上述したように、第１スコアおよび第２スコアは、値が大きいほど類似した文字列であることを示す。 As described above, the larger the value of the first score and the second score, the more similar the character strings.

＜＜比較１＞＞
図１３は、類似文字列検出の精度の比較１の結果を示す。上段は、文字列１「製品１評価レシピ（空間は全角スペース）」と文字列２「評価レシピ製品１（空間は半角スペース）」の類似度を示し、下段は、文字列１「製品１評価レシピ（空間は全角スペース）」と文字列３「レシピ製品２（空間は半角スペース）」の類似度を示す。複数の単語の語順が入れ替わっている文字列同士を類似していると判断するためには、文字列１と文字列３よりも、文字列１と文字列２が類似している判断されるべきである。比較１の結果から、第１スコアが適していることが分かる。 << Comparison 1 >>
FIG. 13 shows the result of comparison 1 of the accuracy of similar character string detection. The upper row shows the similarity between the character string 1 "Product 1 evaluation recipe (space is a full-width space)" and the character string 2 "Evaluation recipe product 1 (space is a half-width space)", and the lower row is the character string 1 "Product 1 evaluation". The similarity between the recipe (space is a full-width space) and the character string 3 "recipe product 2 (space is a half-width space)" is shown. In order to judge that the character strings having the word order of a plurality of words are similar to each other, it should be judged that the character string 1 and the character string 2 are more similar than the character string 1 and the character string 3. Is. From the result of comparison 1, it can be seen that the first score is suitable.

＜＜比較２＞＞
図１４は、類似文字列検出の精度の比較２の結果を示す。上段は、文字列１「評価レシピ製品１（空間は半角スペース）」と文字列２「製品１評価レシピ（空間は全角スペース）」の類似度を示し、下段は、文字列１「評価レシピ製品１（空間は半角スペース）」と文字列３「レシピ製品２（空間は半角スペース）」の類似度を示す。複数の単語の語順が入れ替わっている文字列同士を類似していると判断するためには、文字列１と文字列３よりも、文字列１と文字列２が類似している判断されるべきである。比較２の結果から、第１スコアが適していることが分かる。 << Comparison 2 >>
FIG. 14 shows the result of comparison 2 of the accuracy of similar character string detection. The upper row shows the similarity between the character string 1 "evaluation recipe product 1 (space is a half-width space)" and the character string 2 "product 1 evaluation recipe (space is a full-width space)", and the lower row is the character string 1 "evaluation recipe product". 1 (space is a half-width space) ”and the character string 3“ recipe product 2 (space is a half-width space) ”show the degree of similarity. In order to judge that the character strings having the word order of a plurality of words are similar to each other, it should be judged that the character string 1 and the character string 2 are more similar than the character string 1 and the character string 3. Is. From the result of comparison 2, it can be seen that the first score is suitable.

＜＜比較３＞＞
図１５は、類似文字列検出の精度の比較３の結果を示す。上段は、文字列１「製品１評価レシピ（空間は半角スペース）」と文字列２「品１評価（空間は半角スペース）」の類似度を示し、下段は、文字列１「製品１評価レシピ（空間は半角スペース）」と文字列３「製品２評価レシピ（空間は半角スペース）」の類似度を示す。単語が省略されていない文字列と単語が省略されている文字列を類似していると判断するためには、文字列１と文字列３よりも、文字列１と文字列２が類似している判断されるべきである。比較３の結果から、第２スコアが適していることが分かる。 << Comparison 3 >>
FIG. 15 shows the result of comparison 3 of the accuracy of similar character string detection. The upper row shows the similarity between the character string 1 "Product 1 evaluation recipe (space is a half-width space)" and the character string 2 "Product 1 evaluation (space is a half-width space)", and the lower row is the character string 1 "Product 1 evaluation recipe". (Space is a half-width space) ”and the character string 3“ Product 2 evaluation recipe (space is a half-width space) ”show the degree of similarity. In order to judge that the character string in which the word is not omitted and the character string in which the word is omitted are similar, the character string 1 and the character string 2 are more similar than the character string 1 and the character string 3. Should be judged. From the result of comparison 3, it can be seen that the second score is suitable.

＜効果＞
このように、本発明の一実施形態では、複数の単語の語順が入れ替わっている文字列同士を類似していると判断することができるので、元となる文字列から、複数の単語の語順が入れ替わっている文字列を検出することができる。また、本発明の一実施形態では、単語が省略されていない文字列と単語が省略されている文字列を類似していると判断することができるので、元となる文字列から、単語が省略されている文字列（あるいは、単語が省略されていない文字列）を検出することができる。 <Effect>
As described above, in one embodiment of the present invention, it can be determined that the character strings in which the word orders of the plurality of words are exchanged are similar to each other, so that the word order of the plurality of words is changed from the original character string. It is possible to detect the replaced character strings. Further, in one embodiment of the present invention, since it can be determined that the character string in which the word is not omitted and the character string in which the word is omitted are similar, the word is omitted from the original character string. It is possible to detect the character string (or the character string in which the word is not omitted).

＜ハードウェア構成＞
図１６は、本発明の一実施形態に係る類似文字列検出装置１０、ユーザ端末２０のハードウェア構成図である。類似文字列検出装置１０、ユーザ端末２０は、ＣＰＵ（Central Processing Unit）１００１、ＲＯＭ（Read Only Memory）１００２、ＲＡＭ（Random Access Memory）１００３を有する。ＣＰＵ１００１、ＲＯＭ１００２、ＲＡＭ１００３は、いわゆるコンピュータを形成する。 <Hardware configuration>
FIG. 16 is a hardware configuration diagram of a similar character string detection device 10 and a user terminal 20 according to an embodiment of the present invention. The similar character string detection device 10 and the user terminal 20 have a CPU (Central Processing Unit) 1001, a ROM (Read Only Memory) 1002, and a RAM (Random Access Memory) 1003. The CPU 1001, ROM 1002, and RAM 1003 form a so-called computer.

また、類似文字列検出装置１０、ユーザ端末２０は、補助記憶装置１００４、表示装置１００５、操作装置１００６、Ｉ／Ｆ（Interface）装置１００７、ドライブ装置１００８を有することができる。 Further, the similar character string detection device 10 and the user terminal 20 may have an auxiliary storage device 1004, a display device 1005, an operation device 1006, an I / F (Interface) device 1007, and a drive device 1008.

なお、類似文字列検出装置１０、ユーザ端末２０の各ハードウェアは、バスＢを介して相互に接続されている。 The hardware of the similar character string detection device 10 and the user terminal 20 are connected to each other via the bus B.

ＣＰＵ１００１は、補助記憶装置１００４にインストールされている各種プログラムを実行する演算デバイスである。 The CPU 1001 is an arithmetic device that executes various programs installed in the auxiliary storage device 1004.

ＲＯＭ１００２は、不揮発性メモリである。ＲＯＭ１００２は、補助記憶装置１００４にインストールされている各種プログラムをＣＰＵ１００１が実行するために必要な各種プログラム、データ等を格納する主記憶デバイスとして機能する。具体的には、ＲＯＭ１００２は、ＢＩＯＳ（Basic Input/Output System）やＥＦＩ（Extensible Firmware Interface）等のブートプログラム等を格納する、主記憶デバイスとして機能する。 ROM 1002 is a non-volatile memory. The ROM 1002 functions as a main storage device for storing various programs, data, and the like necessary for the CPU 1001 to execute various programs installed in the auxiliary storage device 1004. Specifically, the ROM 1002 functions as a main storage device for storing boot programs such as BIOS (Basic Input / Output System) and EFI (Extensible Firmware Interface).

ＲＡＭ１００３は、ＤＲＡＭ（Dynamic Random Access Memory）やＳＲＡＭ（Static Random Access Memory）等の揮発性メモリである。ＲＡＭ１００３は、補助記憶装置１００４にインストールされている各種プログラムがＣＰＵ１００１によって実行される際に展開される作業領域を提供する、主記憶デバイスとして機能する。 The RAM 1003 is a volatile memory such as a DRAM (Dynamic Random Access Memory) or a SRAM (Static Random Access Memory). The RAM 1003 functions as a main storage device that provides a work area to be expanded when various programs installed in the auxiliary storage device 1004 are executed by the CPU 1001.

補助記憶装置１００４は、各種プログラムや、各種プログラムが実行される際に用いられる情報を格納する補助記憶デバイスである。 The auxiliary storage device 1004 is an auxiliary storage device that stores various programs and information used when various programs are executed.

表示装置１００５は、類似文字列検出装置１０、ユーザ端末２０の内部状態等を表示する表示デバイスである。 The display device 1005 is a display device that displays the internal state of the similar character string detection device 10, the user terminal 20, and the like.

操作装置１００６は、類似文字列検出装置１０、ユーザ端末２０を操作する者が類似文字列検出装置１０、ユーザ端末２０に対して各種指示を入力する入力デバイスである。 The operation device 1006 is an input device in which a person who operates the similar character string detection device 10 and the user terminal 20 inputs various instructions to the similar character string detection device 10 and the user terminal 20.

Ｉ／Ｆ装置１００７は、ネットワークに接続し、他の装置と通信を行うための通信デバイスである。 The I / F device 1007 is a communication device for connecting to a network and communicating with other devices.

ドライブ装置１００８は記憶媒体１００９をセットするためのデバイスである。ここでいう記憶媒体１００９には、ＣＤ－ＲＯＭ、フレキシブルディスク、光磁気ディスク等のように情報を光学的、電気的あるいは磁気的に記録する媒体が含まれる。また、記憶媒体１００９には、ＥＰＲＯＭ (Erasable Programmable Read Only Memory)、フラッシュメモリ等のように情報を電気的に記録する半導体メモリ等が含まれていてもよい。 The drive device 1008 is a device for setting the storage medium 1009. The storage medium 1009 referred to here includes a medium such as a CD-ROM, a flexible disk, a magneto-optical disk, or the like, which records information optically, electrically, or magnetically. Further, the storage medium 1009 may include a semiconductor memory for electrically recording information such as an EPROM (Erasable Programmable Read Only Memory) and a flash memory.

なお、補助記憶装置１００４にインストールされる各種プログラムは、例えば、配布された記憶媒体１００９がドライブ装置１００８にセットされ、該記憶媒体１００９に記録された各種プログラムがドライブ装置１００８により読み出されることでインストールされる。あるいは、補助記憶装置１００４にインストールされる各種プログラムは、Ｉ／Ｆ装置１００７を介して、ネットワークよりダウンロードされることでインストールされてもよい。 The various programs installed in the auxiliary storage device 1004 are installed, for example, by setting the distributed storage medium 1009 in the drive device 1008 and reading the various programs recorded in the storage medium 1009 by the drive device 1008. Will be done. Alternatively, various programs installed in the auxiliary storage device 1004 may be installed by being downloaded from the network via the I / F device 1007.

以上、本発明の実施例について詳述したが、本発明は上述した特定の実施形態に限定されるものではなく、特許請求の範囲に記載された本発明の要旨の範囲内において、種々の変形・変更が可能である。 Although the embodiments of the present invention have been described in detail above, the present invention is not limited to the above-mentioned specific embodiments, and various modifications are made within the scope of the gist of the present invention described in the claims.・ Can be changed.

１類似文字列検出システム
１０類似文字列検出装置
２０ユーザ端末
１０１文字列群取得部
１０２文字列選択部
１０３共通部分文字列長演算部
１０４第１スコア算出部
１０５第２スコア算出部
１０６類似文字列検出部
１０７出力部
１００１ＣＰＵ
１００２ＲＯＭ
１００３ＲＡＭ
１００４補助記憶装置
１００５表示装置
１００６操作装置
１００７Ｉ／Ｆ装置
１００８ドライブ装置
１００９記憶媒体 1 Similar character string detection system 10 Similar character string detection device 20 User terminal 101 Character string group acquisition unit 102 Character string selection unit 103 Common substring length calculation unit 104 First score calculation unit 105 Second score calculation unit 106 Similar character string Detection unit 107 Output unit 1001 CPU
1002 ROM
1003 RAM
1004 Auxiliary storage device 1005 Display device 1006 Operation device 1007 I / F device 1008 Drive device 1009 Storage medium

Claims

The sum of the lengths of the character strings common between the first character string and the second character string is the sum of the length of the first character string and the length of the second character string. The first score calculation unit that calculates the first score divided by the longer length of
A similar character string detection device including an output unit that outputs a combination of similar character strings based on the first score.

The longest number of characters in the portion common between the first character string and the second character string is the shorter of the length of the first character string and the length of the second character string. Further equipped with a second score calculation unit, which calculates a second score divided by the length of
The similar character string detection device according to claim 1, wherein the output unit outputs a combination of similar character strings based on the second score.

The similar character string detection device according to claim 1, wherein the length of each of the character strings common between the first character string and the second character string is 2 or more.

The similar character string detecting device according to claim 2, wherein the longest number of characters in a portion common between the first character string and the second character string is the length of a continuous character string.

The way the computer does
The sum of the lengths of the character strings common between the first character string and the second character string is the sum of the length of the first character string and the length of the second character string. And the step to calculate the first score divided by the longer of
A method comprising the step of outputting a combination of similar character strings based on the first score.

Computer,
The sum of the lengths of the character strings common between the first character string and the second character string is the sum of the length of the first character string and the length of the second character string. 1st score calculation unit, which calculates the 1st score divided by the longer length of
A program for functioning as an output unit that outputs a combination of similar character strings based on the first score.

A system that includes a similar character string detector and a user terminal.
The similar character string detection device is
The sum of the lengths of the character strings common between the first character string and the second character string is the sum of the length of the first character string and the length of the second character string. The first score calculation unit that calculates the first score divided by the longer length of
An output unit that outputs a combination of similar character strings based on the first score is provided.
The user terminal is
A system that displays a combination of similar strings.