WO2015040793A1 - Character string retrieval device - Google Patents

Character string retrieval device Download PDF

Info

Publication number
WO2015040793A1
WO2015040793A1 PCT/JP2014/004285 JP2014004285W WO2015040793A1 WO 2015040793 A1 WO2015040793 A1 WO 2015040793A1 JP 2014004285 W JP2014004285 W JP 2014004285W WO 2015040793 A1 WO2015040793 A1 WO 2015040793A1
Authority
WO
WIPO (PCT)
Prior art keywords
character string
rule
edit distance
similar
search
Prior art date
Application number
PCT/JP2014/004285
Other languages
French (fr)
Japanese (ja)
Inventor
相川 勇之
悠介 小路
Original Assignee
三菱電機株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 三菱電機株式会社 filed Critical 三菱電機株式会社
Priority to JP2015537549A priority Critical patent/JP5846340B2/en
Publication of WO2015040793A1 publication Critical patent/WO2015040793A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Definitions

  • the present invention relates to a character string search device capable of obtaining a character string similar to an input search character string as a search result.
  • search character string when a user searches for an address or facility name using a common name, abbreviated name, or name that is mistakenly stored, or when a search character string entered due to a user's operation error contains an error, If the search character string contains an error due to a recognition error in the voice signal input in the search by voice input, the search character string that is similar to the search character string is used as the search result, although it does not match the input search character string. A similar character string search is required.
  • Patent Document 1 discloses a method of using a weighted edit distance in order to rank candidate sentence examples and narrow them down.
  • Non-Patent Document 1 discloses a technique for weighting a replacement rule for converting a character string into another character string.
  • the present invention has been made to solve the above-described problems, and it is an object of the present invention to reduce the calculation processing amount of the weighted edit distance when a similar character string search is performed by a character string search device. .
  • the character string search device provides an input search character from a plurality of similar character string weight rules in which the similarity between the first character string and the second character string made up of one or more characters is defined.
  • the search character string and the search character string are searched using the weight rule extraction unit that extracts the similar character string weight rule including the first character string in the column, and the similar character string weight rule extracted by the weight rule extraction unit.
  • An editing distance calculation unit for calculating a weighted editing distance with a heading character string acquired from the dictionary.
  • a similar character string weight rule that may be applied to a search character string that is input as described above is extracted, and a weighted edit distance at the time of similar character string search is extracted.
  • the applicability determination is made with reference to the extracted similar string weight rule, so the number of applicability determinations can be reduced, and the calculation processing amount of the weighted edit distance is reduced. can do.
  • FIG. 4 is a table showing an example of similar character string weight rules stored in a rule storage unit of the character string search device according to the first embodiment. It is a table which shows the example of the dictionary data memorize
  • storage part. 6 is a processing flow of a weighted edit distance calculation process of the character string search device according to the first embodiment. 5 is a processing flow of edit distance calculation when a similar character string weight rule is applied to the character string search device according to the first embodiment.
  • FIG. 6 is an explanatory diagram illustrating an example of weighted edit distance calculation in the character string search device according to the first embodiment.
  • FIG. 6 is a configuration diagram of a character string search device according to Embodiment 2.
  • FIG. It is a processing flow of the weighted edit distance calculation process of the character string search device of Embodiment 2.
  • 10 is a processing flow of an abort determination process according to the second embodiment.
  • FIG. 1 is a block diagram of a character string search apparatus according to Embodiment 1 of the present invention.
  • the character string search apparatus according to the first embodiment of the present invention includes a weight rule extraction unit 101, a rule storage unit 102, an application rule storage unit 103, an edit distance calculation unit 104, a dictionary 105, and a distance order alignment unit 106.
  • the weight rule extraction unit 101 extracts a similar character string weight rule related to the input character string 110 from the rule storage unit 102 based on the input character string 110 input by the user and stores it in the application rule storage unit 103. To do.
  • the input character string 110 is a search character string, and is input using an input device such as a keyboard or a touch panel, or recognized by a voice recognition mechanism (not shown) from a voice signal input by a voice input device.
  • This is a character string.
  • the present invention does not limit the input method of the search character string, and may be a character string input by various other methods.
  • the similar character string weight rule is a rule that defines the similarity between two character strings (the first character string and the second character string), and the weighted edit distance between the search character string and another character string is determined. Used when calculating.
  • the edit distance calculation unit 104 refers to the dictionary 105, and uses the similar character string weight rule stored in the application rule storage unit 103 as the weighted edit distance between the input character string 110 and each heading character string stored in the dictionary 105. Calculate using.
  • the distance order arranging unit 106 outputs each heading character string in the dictionary 105 as a similar character string list 111 in ascending order of the weighted editing distance calculated by the editing distance calculating unit 104.
  • the similar character string list 111 is used in processing such as being presented to a user on a display (not shown) or the like.
  • the weight rule extraction unit 101, the edit distance calculation unit 104, and the distance order alignment unit 106 are configured by hardware such as an ASIC (Application Specific Integrated Circuit), or dedicated hardware using a processor and software executed on the processor It can be realized by, software realized on a general-purpose computer, or a combination of these realization methods.
  • the rule storage unit 102, the applied rule storage unit 103, and the dictionary 105 are configured using a volatile storage medium such as a RAM (Random Access Memory) or a non-volatile storage medium such as an HDD (Hard Disc Drive). That's fine. Alternatively, it may be configured to be read and written remotely via a communication line, or a detachable device may be used.
  • FIG. 2 is a table showing an example of similar character string weight rules stored in the rule storage unit 102.
  • the rule number 201 is a number uniquely assigned to identify each similar character string weight rule.
  • the left-side character string 202 that is the first character string and the right-side character string 203 that is the second character string are combinations of character strings that define the similarity according to each similar character string weight rule, and the similarity score 204 is the left-hand side
  • the similarity between the character string 202 and the right-side character string 203 is shown.
  • FIG. 3 is a table showing an example of dictionary data stored in the dictionary 105.
  • the dictionary 105 stores a heading character string and attribute information of the heading character string.
  • the attribute information is information such as the type (part of speech) and content of the heading character string.
  • FIG. 4 is a processing flow of the character string search apparatus according to the first embodiment.
  • the weight rule extraction unit 101 first performs rule extraction processing (ST201).
  • the weight rule extraction unit 101 performs the input character string 110 on the basis of the input character string 110 that has been input, from among a plurality of similar character string weight rules stored in the rule storage unit 102. Extract similar string weight rules that may be applied.
  • the similar character string weight rule that may be applied here is a similar character string weight rule in which the left side character string 202 is included in the input character string 110.
  • the partial character string from the first character to the character position is referred to, and the last character of these partial character strings (for example, the part up to the third character) If it is a character string, a part including the third character) is selected, and the left-side character string 202 of the similar character string weight rule is selected.
  • the extracted similar character string weight rule is stored in the application rule storage unit 103.
  • FIG. 5 is a table showing an example of similar character string weight rules stored in the application rule storage unit 103.
  • similar character string weight rules are classified and stored for each corresponding character position in the input character string 110.
  • An example of the similar character string weight rule stored in the application rule storage unit 103 in FIG. 5 is when the input character string 110 is three characters “CHA”, and is stored in the rule storage unit 102 illustrated in FIG. The similar character string weight rule extracted from the similar character string weight rule is shown.
  • an input character position 205 indicates which number character in the input character string 110 (“CHA”) corresponds to the rule, and the rule corresponding to the character at each character position is classified.
  • the weight rule extraction unit 101 refers to the partial character string from the first character to the character position for each character position from the beginning to the end of the input character string 110, and determines the last character of these partial character strings. A part in which the left-side character string 202 of the similar character string weight rule matches the part to be included is extracted.
  • the rule number 206 indicates the number of rules corresponding to each input character position 205.
  • the edit distance calculation unit 104 performs weighted edit distance calculation (ST202).
  • the editing distance calculation unit 104 sequentially reads the heading character strings from the dictionary 105, and calculates the weighted editing distance between each heading character string and the input character string 110.
  • the processing content of the weighted edit distance calculation will be described later.
  • the distance order arranging unit 106 sorts each heading character string read from the dictionary 105 and calculating the weighted edit distance in ST202 in ascending order (closer) of the weighted edit distance, and the similar character string list It outputs as 111.
  • the output similar character string list 111 is used for processing such as being presented to a user on a display (not shown) or the like.
  • FIG. 6 is a processing flow of the weighted editing distance calculation processing in ST202.
  • the editing distance from the original character string (referred to as Str1) that is the input character string 110 to the target character string (referred to as Str2) that is each heading character string is obtained.
  • the character string length of Str1
  • the character string length of Str2
  • M [0,0] is set to 0
  • ) is set to i
  • ) is set to j. Substitute and initialize. Note that it is determined whether or not Str1 and Str2 match before performing ST301, and if they match, the processing after ST302 may not be performed.
  • step ST302 the process from step ST302 to step ST310 is repeated
  • the process from step ST303 to step ST309 is repeated
  • FIG. 7 is a processing flow of edit distance calculation processing using the similar character string weight rule of ST308.
  • the processing from ST401 to ST405 is repeated N R [i] times (that is, the number of rules for the i-th character of the input character 110 is 206 minutes).
  • Rule [i] [k] is applicable (similar character string weight rule applicability determination) (ST402). Specifically, the right-side character string 203 of the similar character string weight rule stored in Rule [i] [k] matches the partial character string that ends with the j-th character of Str2, which is the target character string here. If it matches, it is determined that it is applicable.
  • the contents of the rule are acquired (ST403).
  • the similarity score 204 defined in Rule [i] [k] is assigned to the variable rule_score, and the number of characters of the left side character string is assigned to the variable len1, and the number of characters of the right side character string is assigned to the variable len2.
  • M [i, j] in the edit distance calculation table is updated (ST404). In this process, the two values M [i, j] and M [i-len1, j-len2] + rule_score are compared, and M [i, j] is updated with the smaller value.
  • FIG. 8A shows the edit distance calculation table initialized in ST301 in this operation example.
  • ST308 ie ST401 in FIG. 7
  • k 1
  • CQ right-hand side character string 203
  • M [i-len1, j-len2] + rule_score is compared with the value of M [i, j] before update, and the value of M [i, j] is updated with the smaller value.
  • the similar character string weight rule that may be applied to the input character string is extracted from the similar character string weight rule stored in the rule storage unit 102 based on the input character string that is the search character string.
  • the number of applicability determination processes can be reduced.
  • the effect that the similar character string search process of a character string search apparatus can be sped up can be acquired.
  • the number of similar character string weight rules is very large. For example, in languages such as French that have a lot of spelling with similar sounds, there are hundreds of similar string weight rules required to handle misunderstandings of users, keystroke errors, and voice recognition errors.
  • the number of similar character string weight rules for determining applicability when calculating the weighted edit distance by the rule extraction unit can be reduced to about several tens or less, and the similar character string search processing can be greatly speeded up.
  • the tf ⁇ shown in, for example, Patent Document 1 is calculated before the weighted edit distance is calculated. It is also possible to provide some kind of prior narrowing means such as narrowing using idf (term frequency-inverse document frequency) weights.
  • the rule extracted from the similar character string weight rule stored in the rule storage unit 102 is stored in the application rule storage unit 103.
  • the rule number is stored in the application rule storage unit.
  • a similar character string weight rule may be acquired from the rule storage unit based on the rule number when calculating the weighted edit distance.
  • this character string may be a word string.
  • this embodiment has been described using an alphabetic input character string, the present invention is not limited to an alphabetic input character string, and is a character string in another language (for example, hiragana or kanji). May be.
  • FIG. FIG. 9 is a block diagram of a character string search device according to Embodiment 2 of the present invention.
  • the difference between the character string search devices of the first embodiment is that the edit distance calculation unit 104a has a function to cancel the edit distance calculation halfway based on the distance upper limit value 112 set from the outside.
  • the other weight rule extraction unit 101, rule storage unit 102, application rule storage unit 103, dictionary 105, distance quasi-alignment unit 106, input character string 110, and similar character string list 111 are the same as those in the first embodiment.
  • FIG. 10 is a processing flow of a weighted edit distance calculation process of the character string search apparatus according to the second embodiment.
  • Each process from ST301 to ST310 is the same as that in the first embodiment.
  • the weighted editing distance calculation is aborted between ST309 and ST310 (ST311).
  • FIG. 11 is a processing flow of ST311 abort determination.
  • the processing of ST311 will be described in detail with reference to FIG.
  • the value of the variable i is Z.
  • MinL is calculated (ST501). Specifically, MinL can be obtained as follows.
  • the distance upper limit 112 is compared with M [x, y] (MinL ⁇ x ⁇ Z, 0 ⁇ y ⁇
  • a distance upper limit value is set to search only those having high similarity, for example, when the number of heading character strings stored in a dictionary such as facility names and addresses exceeds hundreds of thousands to millions An effect is obtained.
  • the character string search device of the present invention realizes an increase in the processing speed of the device and a reduction in the processing capability required for the device by reducing the processing amount of the weighted edit distance calculation in the character string search. Therefore, it is useful in an apparatus for performing a character string search such as a car navigation system.
  • 101 weight rule extraction unit 102 rule storage unit, 103 applied rule storage unit, 104, 104a edit distance calculation unit, 105 dictionary, 106 distance order alignment unit, 110 input character string, 111 similar character string list, 112 distance upper limit value, 201 rule number, 202 left side character string, 203 right side character string, 204 similarity score, 205 input character position, 206 rule number.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

 The present invention is provided with: a weight rule extraction unit (101) for extracting, from among a plurality of similar character string weight rules in which similarity of a first character string and a second character string consisting of one or more characters is defined, a similar character string weight rule containing the first character string in an inputted retrieval character string; and an editing distance calculation unit (104) for calculating, by using the similar character string weight rule extracted by the weight rule extraction unit (101), a weighted editing distance between the retrieval character string and an entry character string acquired from a dictionary from which the retrieval character string is retrieved.

Description

文字列検索装置String search device
 この発明は、入力された検索文字列に類似の文字列を検索結果として得ることが可能な文字列検索装置に関するものである。 The present invention relates to a character string search device capable of obtaining a character string similar to an input search character string as a search result.
 例えば使用者が通称や略称あるいは誤って記憶した名称で住所や施設名を検索する場合や、使用者の操作ミスで入力された検索文字列に誤りが含まれる場合、またカーナビやスマートフォンなどでの音声入力による検索で入力された音声信号の認識誤りにより検索文字列に誤りが含まれる場合などでは、入力された検索文字列とは一致しないものの、検索文字列に類似する文字列を検索結果として得る類似文字列検索が必要になる。 For example, when a user searches for an address or facility name using a common name, abbreviated name, or name that is mistakenly stored, or when a search character string entered due to a user's operation error contains an error, If the search character string contains an error due to a recognition error in the voice signal input in the search by voice input, the search character string that is similar to the search character string is used as the search result, although it does not match the input search character string. A similar character string search is required.
 類似文字列検索では、検索文字列と検索結果として得る文字列との類似度を評価する必要がある。文字列間の類似を評価する尺度として重み付き編集距離が用いられている。例えば特許文献1には、候補例文をランク付けして絞り込みをするために重み付き編集距離を用いる方法が開示されている。
 また、非特許文献1にはある文字列を別の文字列に変換する置換ルールに重み付けをする技術が開示されている。
In the similar character string search, it is necessary to evaluate the degree of similarity between the search character string and the character string obtained as a search result. A weighted edit distance is used as a measure for evaluating the similarity between character strings. For example, Patent Document 1 discloses a method of using a weighted edit distance in order to rank candidate sentence examples and narrow them down.
Non-Patent Document 1 discloses a technique for weighting a replacement rule for converting a character string into another character string.
特開2004-62893号公報(図3)JP 2004-62893 A (FIG. 3)
 特許文献1に記載されているように重み付き編集距離を類似度の評価尺度として、誤りを含む検索文字列と類似する正しい文字列を辞書から検索するとき、誤りを含む検索文字列と正しい文字列との重み付き編集距離を計算することが必要である。
 非特許文献1に記載されているような2つの文字列間の類似度を類似文字列重みルール(置換ルール)に定義して、この類似度に基づいて重み付き編集距離を計算する場合、誤り等のパターンは大量に存在するため大量の類似文字列重みルールが必要になる。重み付き編集距離の計算処理では、計算処理中の文字列に対して適用するか否かを、これらの大量の類似文字列重みルールについて判定しなければならない。このため、重み付き編集距離の計算処理における類似文字列重みルールの適用可否判定の処理量が非常に大きいという課題があった。
When a correct character string similar to a search character string including an error is searched from a dictionary using a weighted editing distance as a measure of similarity as described in Patent Document 1, a search character string including an error and a correct character are searched. It is necessary to calculate the weighted edit distance with the column.
When the similarity between two character strings as described in Non-Patent Document 1 is defined in a similar character string weight rule (replacement rule), and a weighted edit distance is calculated based on this similarity, an error occurs. Since a large number of such patterns exist, a large number of similar character string weight rules are required. In the calculation processing of the weighted edit distance, it is necessary to determine whether or not to apply to the character string being calculated, with respect to the large number of similar character string weight rules. For this reason, there has been a problem that the processing amount of the applicability determination of the similar character string weight rule in the calculation processing of the weighted edit distance is very large.
 この発明は上記のような問題点を解決するためになされたものであり、文字列検索装置が行う類似文字列検索の際の、重み付き編集距離の計算処理量を削減することを目的とする。 The present invention has been made to solve the above-described problems, and it is an object of the present invention to reduce the calculation processing amount of the weighted edit distance when a similar character string search is performed by a character string search device. .
 この発明の文字列検索装置は、1文字以上の文字からなる第1の文字列と第2の文字列の類似度が定義された複数の類似文字列重みルールの中から、入力された検索文字列に第1の文字列が含まれる類似文字列重みルールを抽出する重みルール抽出部と、重みルール抽出部において抽出された類似文字列重みルールを用いて、検索文字列と検索文字列を検索する辞書から取得された見出し文字列との重み付き編集距離を計算する編集距離計算部と、を備えたるようにしたものである。 The character string search device according to the present invention provides an input search character from a plurality of similar character string weight rules in which the similarity between the first character string and the second character string made up of one or more characters is defined. The search character string and the search character string are searched using the weight rule extraction unit that extracts the similar character string weight rule including the first character string in the column, and the similar character string weight rule extracted by the weight rule extraction unit. An editing distance calculation unit for calculating a weighted editing distance with a heading character string acquired from the dictionary.
 この発明によれば、上述のように入力された検索文字列ついて、その検索文字列に適用される可能性のある類似文字列重みルールを抽出し、類似文字列検索の際の重み付き編集距離の計算においては、この抽出した類似文字列重みルールを参照して適用可否を判断するようにしたので、適用可否判断をする回数を削減することができ、重み付き編集距離の計算処理量を削減することができる。 According to the present invention, a similar character string weight rule that may be applied to a search character string that is input as described above is extracted, and a weighted edit distance at the time of similar character string search is extracted. In this calculation, the applicability determination is made with reference to the extracted similar string weight rule, so the number of applicability determinations can be reduced, and the calculation processing amount of the weighted edit distance is reduced. can do.
実施の形態1の文字列検索装置の構成図である。1 is a configuration diagram of a character string search device according to Embodiment 1. FIG. 実施の形態1の文字列検索装置のルール記憶部に記憶された類似文字列重みルールの例を示すテーブルである。4 is a table showing an example of similar character string weight rules stored in a rule storage unit of the character string search device according to the first embodiment. 辞書に記憶された辞書データの例を示すテーブルである。It is a table which shows the example of the dictionary data memorize | stored in the dictionary. 実施の形態1の文字列検索装置の処理フローである。4 is a processing flow of the character string search device according to the first embodiment. 適用ルール記憶部に記憶された抽出された類似文字列重みルールの例を示すテーブルである。It is a table which shows the example of the extracted similar character string weight rule memorize | stored in the application rule memory | storage part. 実施の形態1の文字列検索装置の重み付き編集距離計算処理の処理フローである。6 is a processing flow of a weighted edit distance calculation process of the character string search device according to the first embodiment. 実施の形態1の文字列検索装置の類似文字列重みルール適用時の編集距離計算の処理フローである。5 is a processing flow of edit distance calculation when a similar character string weight rule is applied to the character string search device according to the first embodiment. 実施の形態1の文字列検索装置における重み付き編集距離計算の例を説明する説明図である。6 is an explanatory diagram illustrating an example of weighted edit distance calculation in the character string search device according to the first embodiment. FIG. 実施の形態2の文字列検索装置の構成図である。6 is a configuration diagram of a character string search device according to Embodiment 2. FIG. 実施の形態2の文字列検索装置の重み付き編集距離計算処理の処理フローである。It is a processing flow of the weighted edit distance calculation process of the character string search device of Embodiment 2. 実施の形態2の打ち切り判定処理の処理フローである。10 is a processing flow of an abort determination process according to the second embodiment.
 以下この発明の実施の形態を、図を参照して説明する。なお、参照する図において同一もしくは相当する部分には同一の符号を付している。 Embodiments of the present invention will be described below with reference to the drawings. In the drawings to be referred to, the same or corresponding parts are denoted by the same reference numerals.
実施の形態1.
 図1はこの発明の実施の形態1に係る文字列検索装置の構成図である。この発明の実施の形態1の文字列検索装置は、重みルール抽出部101、ルール記憶部102、適用ルール記憶部103、編集距離計算部104、辞書105、距離順整列部106とから構成される。重みルール抽出部101は、使用者により入力された入力文字列110をもとに、ルール記憶部102から入力文字列110に関係する類似文字列重みルールを抽出して適用ルール記憶部103に記憶する。
Embodiment 1 FIG.
1 is a block diagram of a character string search apparatus according to Embodiment 1 of the present invention. The character string search apparatus according to the first embodiment of the present invention includes a weight rule extraction unit 101, a rule storage unit 102, an application rule storage unit 103, an edit distance calculation unit 104, a dictionary 105, and a distance order alignment unit 106. . The weight rule extraction unit 101 extracts a similar character string weight rule related to the input character string 110 from the rule storage unit 102 based on the input character string 110 input by the user and stores it in the application rule storage unit 103. To do.
 ここで、入力文字列110は検索文字列であり、キーボードやタッチパネル等の入力デバイスを用いて入力されたり、音声入力デバイスで入力された音声信号から音声認識機構(図示せず)によって認識されたりした文字列である。なお、この発明は検索文字列の入力方法を限定するものではなく、その他の種々の方法で入力された文字列であって良い。
 また、類似文字列重みルールは、2つの文字列(第1の文字列と第2の文字列)の類似度を定義した規則であり、検索文字列と他の文字列の重み付き編集距離を計算する際に用いられる。
Here, the input character string 110 is a search character string, and is input using an input device such as a keyboard or a touch panel, or recognized by a voice recognition mechanism (not shown) from a voice signal input by a voice input device. This is a character string. The present invention does not limit the input method of the search character string, and may be a character string input by various other methods.
The similar character string weight rule is a rule that defines the similarity between two character strings (the first character string and the second character string), and the weighted edit distance between the search character string and another character string is determined. Used when calculating.
 編集距離計算部104は、辞書105を参照し、入力文字列110と辞書105に記憶された各見出し文字列との重み付き編集距離を、適用ルール記憶部103に記憶された類似文字列重みルールを用いて計算する。距離順整列部106は、編集距離計算部104が計算した重み付き編集距離の小さい順に辞書105の各見出し文字列を類似文字列リスト111として出力する。なお、類似文字列リスト111は使用者に対してディスプレイ(図示せず)等に提示されるなどの処理で使用される。 The edit distance calculation unit 104 refers to the dictionary 105, and uses the similar character string weight rule stored in the application rule storage unit 103 as the weighted edit distance between the input character string 110 and each heading character string stored in the dictionary 105. Calculate using. The distance order arranging unit 106 outputs each heading character string in the dictionary 105 as a similar character string list 111 in ascending order of the weighted editing distance calculated by the editing distance calculating unit 104. The similar character string list 111 is used in processing such as being presented to a user on a display (not shown) or the like.
 重みルール抽出部101、編集距離計算部104および距離順整列部106は、ASIC(Application Specific Integrated Circuit)等のハードウェアで構成したり、プロセッサを用いた専用ハードウェアとそのプロセッサにおいて実行されるソフトウェアで実現したり、汎用のコンピュータ上で動作するソフトウェアで実現したり、あるいはこれらの実現方法を組み合わせて実現したりすることが可能である。
 また、ルール記憶部102、適用ルール記憶部103、辞書105は、RAM(Random Access Memory)などの揮発性の記憶媒体や、HDD(Hard Disc Drive)などの不揮発性の記憶媒体を用いて構成すればよい。あるいは、通信回線を経由して遠隔で読み書きされるように構成しても良いし、着脱可能なデバイスを用いるようにしても良い。
The weight rule extraction unit 101, the edit distance calculation unit 104, and the distance order alignment unit 106 are configured by hardware such as an ASIC (Application Specific Integrated Circuit), or dedicated hardware using a processor and software executed on the processor It can be realized by, software realized on a general-purpose computer, or a combination of these realization methods.
The rule storage unit 102, the applied rule storage unit 103, and the dictionary 105 are configured using a volatile storage medium such as a RAM (Random Access Memory) or a non-volatile storage medium such as an HDD (Hard Disc Drive). That's fine. Alternatively, it may be configured to be read and written remotely via a communication line, or a detachable device may be used.
 図2はルール記憶部102に記憶された類似文字列重みルールの例を示すテーブルである。ルール番号201は各類似文字列重みルールを識別するために一意に付与された番号である。第1の文字列である左辺文字列202と第2の文字列である右辺文字列203は、各類似文字列重みルールで類似度を定義する文字列の組み合わせであり、類似度スコア204は左辺文字列202と右辺文字列203との類似度を示している。ここでは、類似度スコア204が小さいほど類似度が大きいことを示すこととする。 FIG. 2 is a table showing an example of similar character string weight rules stored in the rule storage unit 102. The rule number 201 is a number uniquely assigned to identify each similar character string weight rule. The left-side character string 202 that is the first character string and the right-side character string 203 that is the second character string are combinations of character strings that define the similarity according to each similar character string weight rule, and the similarity score 204 is the left-hand side The similarity between the character string 202 and the right-side character string 203 is shown. Here, it is assumed that the smaller the similarity score 204 is, the larger the similarity is.
 図2の例には、発音が類似しており綴り誤りが多く発生する“PH”と“F”(ルール番号201=2)や、“C”と“K”(ルール番号201=20)などの組み合わせが含まれている。また、キーボード等での同一文字の連続入力の入力漏れなどの誤りに対応するための“A”と“AA”(ルール番号201=52)のような組み合わせも含まれている。
 これらの類似文字列重みルールは事前に手作業で定義するようにしても良いし、誤入力の例などを多数収集し、機械学習などの統計的な手法を用いて構成されるようにしても良い。
In the example of FIG. 2, “PH” and “F” (rule number 201 = 2), which are similar in pronunciation and frequently generate spelling errors, “C” and “K” (rule number 201 = 20), etc. A combination of Also included are combinations such as “A” and “AA” (rule number 201 = 52) for dealing with errors such as omission of continuous input of the same character on a keyboard or the like.
These similar string weight rules may be defined manually in advance, or may be configured using statistical methods such as machine learning by collecting many examples of erroneous input. good.
 また、図3は辞書105に記憶された辞書データの例を示すテーブルである。辞書105には見出し文字列とその見出し文字列の属性情報が記録されている。ここで属性情報とは見出し文字列の種別(品詞)や内容などの情報である。 FIG. 3 is a table showing an example of dictionary data stored in the dictionary 105. The dictionary 105 stores a heading character string and attribute information of the heading character string. Here, the attribute information is information such as the type (part of speech) and content of the heading character string.
 次にこの発明の実施の形態1の文字列検索装置の動作を説明する。図4は実施の形態1の文字列検索装置の処理フローである。文字列検索装置に入力文字列110が入力されると、まず重みルール抽出部101がルール抽出処理を行う(ST201)。ST201の処理では、重みルール抽出部101が、入力された入力文字列110に基づいて、ルール記憶部102に記憶された複数の類似文字列重みルールの中から、この入力文字列110に対して適用される可能性のある類似文字列重みルールを抽出する。 Next, the operation of the character string search device according to the first embodiment of the present invention will be described. FIG. 4 is a processing flow of the character string search apparatus according to the first embodiment. When the input character string 110 is input to the character string search device, the weight rule extraction unit 101 first performs rule extraction processing (ST201). In the process of ST201, the weight rule extraction unit 101 performs the input character string 110 on the basis of the input character string 110 that has been input, from among a plurality of similar character string weight rules stored in the rule storage unit 102. Extract similar string weight rules that may be applied.
 ここで適用される可能性のある類似文字列重みルールとは、左辺文字列202が入力文字列110に含まれる類似文字列重みルールである。ここでは、入力文字列110の先頭から末尾までの各文字位置について、先頭の文字から当該文字位置までの部分文字列を参照し、これら部分文字列の末尾の文字(例えば3文字目までの部分文字列であれば3文字目)を含む一部と類似文字列重みルールの左辺文字列202が一致するものを選択するようにする。このようにして抽出することで、抽出した類似文字列重みルールを対応する文字位置毎に分類をすることができる。抽出した類似文字列重みルールは適用ルール記憶部103に記憶される。 The similar character string weight rule that may be applied here is a similar character string weight rule in which the left side character string 202 is included in the input character string 110. Here, for each character position from the beginning to the end of the input character string 110, the partial character string from the first character to the character position is referred to, and the last character of these partial character strings (for example, the part up to the third character) If it is a character string, a part including the third character) is selected, and the left-side character string 202 of the similar character string weight rule is selected. By extracting in this way, the extracted similar character string weight rule can be classified for each corresponding character position. The extracted similar character string weight rule is stored in the application rule storage unit 103.
 図5は適用ルール記憶部103に記憶された類似文字列重みルールの例を示すテーブルである。適用ルール記憶部103では、類似文字列重みルールは入力文字列110の対応する文字位置ごとに分類して記憶される。
 図5の適用ルール記憶部103に記憶された類似文字列重みルールの例は、入力文字列110が”CHA”の3文字であった場合で、図2に例示したルール記憶部102に記憶された類似文字列重みルールから抽出された類似文字列重みルールを示している。
FIG. 5 is a table showing an example of similar character string weight rules stored in the application rule storage unit 103. In the application rule storage unit 103, similar character string weight rules are classified and stored for each corresponding character position in the input character string 110.
An example of the similar character string weight rule stored in the application rule storage unit 103 in FIG. 5 is when the input character string 110 is three characters “CHA”, and is stored in the rule storage unit 102 illustrated in FIG. The similar character string weight rule extracted from the similar character string weight rule is shown.
 図5において、入力文字位置205は、入力文字列110(”CHA”)の何番目の文字に対応するルールであるかを示し、各文字位置の文字に対応するルールが分類されている。
 上述の通り、重みルール抽出部101は入力文字列110の先頭から末尾までの各文字位置について、先頭の文字から当該文字位置までの部分文字列を参照し、これら部分文字列の末尾の文字を含む一部と類似文字列重みルールの左辺文字列202が一致するものを抽出する。
In FIG. 5, an input character position 205 indicates which number character in the input character string 110 (“CHA”) corresponds to the rule, and the rule corresponding to the character at each character position is classified.
As described above, the weight rule extraction unit 101 refers to the partial character string from the first character to the character position for each character position from the beginning to the end of the input character string 110, and determines the last character of these partial character strings. A part in which the left-side character string 202 of the similar character string weight rule matches the part to be included is extracted.
 つまり入力文字列110が”CHA”である場合、文字位置が1の文字(すなわち”C”)について、左辺文字列202が”C”の1文字のみで定義されたルールが抽出される。同様に、文字位置が2の文字(すなわち”H”)について、左辺文字列202が”H”の1文字もしくは”CH”の2文字で定義されたルールが抽出される。文字位置が3の文字(すなわち”A”)について、左辺文字列202が”A”の1文字もしくは”HA”の2文字もしくは”CHA”の3文字で定義されたルールが抽出される。
 この結果として、図5では各文字位置対応に抽出された類似文字列重みルールが分類されている。
That is, when the input character string 110 is “CHA”, for a character whose character position is 1 (that is, “C”), a rule defined by only one character whose left-side character string 202 is “C” is extracted. Similarly, for a character whose character position is 2 (that is, “H”), a rule defined by one character “H” or two characters “CH” in the left side character string 202 is extracted. For a character whose character position is 3 (that is, “A”), the rule defined by the left-side character string 202 as one character “A”, two characters “HA”, or three characters “CHA” is extracted.
As a result, similar character string weight rules extracted for each character position are classified in FIG.
 また、ルール数206は、それぞれの入力文字位置205に対応するルールの数を示す。図5の例は、1文字目(”C”)に対応するルールが10個、2文字目(”H”)に対応するルールが6個、3文字目(”A”)に対応するルールが9個、ルール記憶部102から抽出された場合を示している。
 なお、入力文字列110の文字数が5文字であれば図5の表において入力文字位置が4と5の行が追加されることになる。
The rule number 206 indicates the number of rules corresponding to each input character position 205. In the example of FIG. 5, there are 10 rules corresponding to the first character (“C”), six rules corresponding to the second character (“H”), and rules corresponding to the third character (“A”). 9 are extracted from the rule storage unit 102.
If the number of characters in the input character string 110 is five, lines with input character positions 4 and 5 are added in the table of FIG.
 以降の処理フローの説明を分かり易くするために、適用ルール記憶部103に記憶された類似文字列重みルールは、配列Rule[i][k](1≦i≦入力文字列110の文字数、1≦k≦i番目の文字のルール数206)にルール番号201の順に格納され、編集距離計算部104がこの配列Ruleを参照できるようになっているものとする。またi番目の文字のルール数206が配列N[i]に格納され、同様に編集距離計算部104がこれを参照できるようになっているものとする。
 例えば図5に示した例の場合で2文字目の”H”に関しては、ルール番号201=41のルールがRule[2][1]に、ルール番号201=42のルールがRule[2][2]にというように配列に格納され、また、ルール数206=6がN[2]に記憶される。
In order to make the explanation of the subsequent processing flow easier to understand, the similar character string weight rule stored in the application rule storage unit 103 is an array Rule [i] [k] (1 ≦ i ≦ number of characters in the input character string 110, 1 It is assumed that the rule number 201) is stored in the order of rule number 201) of ≦ k ≦ i-th character so that the edit distance calculation unit 104 can refer to this array Rule. It is also assumed that the rule number 206 of the i-th character is stored in the array N R [i], and that the edit distance calculation unit 104 can refer to this similarly.
For example, in the case of the example shown in FIG. 5, for the second character “H”, the rule with rule number 201 = 41 is Rule [2] [1], and the rule with rule number 201 = 42 is Rule [2] [1]. 2] and the number of rules 206 = 6 is stored in N R [2].
 次に、編集距離計算部104が重み付き編集距離計算を行う(ST202)。ST202の処理では、編集距離計算部104が辞書105から見出し文字列を順次読み込み、各見出し文字列と入力文字列110との重み付き編集距離を計算する。重み付き編集距離計算の処理内容については後述する。 Next, the edit distance calculation unit 104 performs weighted edit distance calculation (ST202). In the processing of ST202, the editing distance calculation unit 104 sequentially reads the heading character strings from the dictionary 105, and calculates the weighted editing distance between each heading character string and the input character string 110. The processing content of the weighted edit distance calculation will be described later.
 次に、距離順整列部106がST202で辞書105から読み出されて重み付き編集距離を計算された各見出し文字列について、重み付き編集距離の小さい順(近い)に整列し、類似文字列リスト111として出力する。出力された類似文字列リスト111は、使用者に対してディスプレイ(図示せず)等に提示されるなどの処理に使用される。 Next, the distance order arranging unit 106 sorts each heading character string read from the dictionary 105 and calculating the weighted edit distance in ST202 in ascending order (closer) of the weighted edit distance, and the similar character string list It outputs as 111. The output similar character string list 111 is used for processing such as being presented to a user on a display (not shown) or the like.
 次に、編集距離計算部104が辞書105から読み出した各見出し文字列に対して行うST202の重み付き編集距離計算の処理内容について説明する。ここでは編集距離計算にDP(Dynamic Programming)手法を用いることとする。なお、この発明は編集距離計算手法をDP手法に限るものではなく、他の手法を用いても良い。図6はST202の重み付き編集距離計算処理の処理フローである。 Next, the processing content of ST202 weighted edit distance calculation performed by the edit distance calculation unit 104 for each heading character string read from the dictionary 105 will be described. Here, the DP (Dynamic Programming) method is used for the edit distance calculation. In the present invention, the editing distance calculation method is not limited to the DP method, and other methods may be used. FIG. 6 is a processing flow of the weighted editing distance calculation processing in ST202.
 入力文字列110である元の文字列(Str1とする)から各見出し文字列である目標の文字列(Str2とする)への編集距離を求める。まず、DP手法による編集距離計算に用いるテーブルの初期化を行う(ST301)。この編集距離計算用テーブルは、Str1の文字列長(以降|Str1|と表記する)とStr2の文字列長(以降|Str2|と表記する)に基づいて、M[i,j](0≦i≦|Str1|, 0≦j≦|Str2|)で表される2次元配列である。ST301では、M[0,0]に0を、M[i,0](1≦i≦|Str1|)にiを、M[0,j](1≦j≦|Str2|)にjを代入して初期化する。
 なお、ST301を実施する前にStr1とStr2が一致しているか否かを判定して、一致している場合にはST302以降の処理を実施しないようにしても良い。
The editing distance from the original character string (referred to as Str1) that is the input character string 110 to the target character string (referred to as Str2) that is each heading character string is obtained. First, the table used for the edit distance calculation by the DP method is initialized (ST301). This edit distance calculation table is based on the character string length of Str1 (hereinafter referred to as | Str1 |) and the character string length of Str2 (hereinafter referred to as | Str2 |). i ≦ | Str1 |, 0 ≦ j ≦ | Str2 |). In ST301, M [0,0] is set to 0, M [i, 0] (1 ≦ i ≦ | Str1 |) is set to i, and M [0, j] (1 ≦ j ≦ | Str2 |) is set to j. Substitute and initialize.
Note that it is determined whether or not Str1 and Str2 match before performing ST301, and if they match, the processing after ST302 may not be performed.
 次に、変数iの初期値を1とし、変数iの値を1ずつカウントアップしながら、ステップST302からステップST310までの処理を|Str1|回繰り返す。
 同様に、上記の変数iに基づくループ処理内で変数jの初期値を1とし、変数jの値を1ずつカウントアップしながら、ステップST303からステップST309までの処理を|Str2|回繰り返す。
Next, the initial value of variable i is set to 1, and the process from step ST302 to step ST310 is repeated | Str1 | times while incrementing the value of variable i by 1.
Similarly, the process from step ST303 to step ST309 is repeated | Str2 | times while the initial value of variable j is set to 1 and the value of variable j is incremented by 1 in the loop processing based on variable i described above.
 変数iおよび変数jに基づくループ処理の詳細を説明する。なお、Str1のi番目の文字をStr1[i]のように表すこととする。まず、文字Str1[i]と文字Str2[j]の比較を行い(ST304)、等しい場合は変数scoreに0を代入し(ST305)、等しくない場合は変数scoreに1を代入する(ST306)。なお、このST306において1を代入するということは、Str1とStr2を同一の文字列とするためには1回の文字置換が必要であることを示すものである。 Details of loop processing based on variable i and variable j will be described. Note that the i-th character of Str1 is represented as Str1 [i]. First, the characters Str1 [i] and Str2 [j] are compared (ST304). If they are equal, 0 is assigned to the variable score (ST305), and if they are not equal, 1 is assigned to the variable score (ST306). It should be noted that substituting 1 in ST306 indicates that one character replacement is required to make Str1 and Str2 the same character string.
 次に、編集距離計算用テーブルのM[i,j]の値を更新する(ST307)。この処理では、M[i-1, j-1]+score、M[i-1,j]+1、M[i,j-1]+1の3つの値を比較して、最も小さい値でM[i,j]を更新する。 Next, the value of M [i, j] in the edit distance calculation table is updated (ST307). In this process, three values M [i-1, j-1] + score, M [i-1, j] +1, M [i, j-1] +1 are compared, and the smallest value is obtained. To update M [i, j].
 次に、重み付き類似文字列重みルールを用いた編集距離計算を行う(ST308)。図7はST308の類似文字列重みルールを用いた編集距離計算処理の処理フローである。ST401からST405までの処理をNR[i]回(すなわち、入力文字110のi番目の文字のルール数206分)繰り返す。 Next, edit distance calculation using a weighted similar character string weight rule is performed (ST308). FIG. 7 is a processing flow of edit distance calculation processing using the similar character string weight rule of ST308. The processing from ST401 to ST405 is repeated N R [i] times (that is, the number of rules for the i-th character of the input character 110 is 206 minutes).
 このループ処理ではまず、Rule[i][k]の適用可否の判定(類似文字列重みルールの適用可否判定)を行う(ST402)。具体的には、Rule[i][k]に格納された類似文字列重みルールの右辺文字列203が、ここで対象としている見出し文字列であるStr2のj文字目で終わる部分文字列と合致するかどうか確認して、合致する場合に適用可と判定する。 In this loop process, it is first determined whether or not Rule [i] [k] is applicable (similar character string weight rule applicability determination) (ST402). Specifically, the right-side character string 203 of the similar character string weight rule stored in Rule [i] [k] matches the partial character string that ends with the j-th character of Str2, which is the target character string here. If it matches, it is determined that it is applicable.
 ST402で適用可と判定した場合には、ルールの内容を取得する(ST403)。ST403では、Rule[i][k]に定義された類似度スコア204を変数rule_scoreに代入し、また、左辺文字列の文字数を変数len1に、右辺文字列の文字数を変数len2に代入する。
 次に、編集距離計算用テーブルのM[i,j]の更新を行う(ST404)。この処理では、M[i,j]、M[i-len1,j-len2]+rule_scoreの2つの値を比較して、小さい方の値でM[i,j]を更新する。
 また、ST402で適用可と判定されなかった場合はST405に遷移する。
 このようにして図6、図7に示した処理フローを終了し、最終的にM[|Str1|,|Str2|]に格納された値が、入力文字列110と見出し文字列との重み付き編集距離となる。
If it is determined in ST402 that it is applicable, the contents of the rule are acquired (ST403). In ST403, the similarity score 204 defined in Rule [i] [k] is assigned to the variable rule_score, and the number of characters of the left side character string is assigned to the variable len1, and the number of characters of the right side character string is assigned to the variable len2.
Next, M [i, j] in the edit distance calculation table is updated (ST404). In this process, the two values M [i, j] and M [i-len1, j-len2] + rule_score are compared, and M [i, j] is updated with the smaller value.
On the other hand, if it is determined in ST402 that it is not applicable, the process proceeds to ST405.
In this way, the processing flow shown in FIGS. 6 and 7 is finished, and the value finally stored in M [| Str1 |, | Str2 |] is weighted between the input character string 110 and the heading character string. Edit distance.
 上記で例に示した入力文字列110が”CHA”である場合で、辞書105から読み出した見出し文字列が”CQA”である場合(すなわち、Str1=”CHA”、Str2=”CQA”である場合)について具体的な動作例を説明する。 When the input character string 110 shown in the above example is “CHA” and the heading character string read from the dictionary 105 is “CQA” (that is, Str1 = “CHA”, Str2 = “CQA”). A specific operation example will be described.
 図8(a)はこの動作例の場合のST301で初期化された編集距離計算用テーブルである。この状態から図6および図7に示したフローに従って処理を実施して、i=2、j=2のループ処理中のST307を終了した時点の編集距離計算用テーブルは図8(b)に示したようになる。
 ST308(すなわち図7のST401)を開始し、k=1のとき、Rule[2][1]に格納された右辺文字列203(”CQ”)とStr2のj=2文字目で終わる部分文字列(”CQ”)が一致するので、ST402の判定結果は真(Y)となる。
FIG. 8A shows the edit distance calculation table initialized in ST301 in this operation example. FIG. 8B shows an edit distance calculation table when processing is performed in accordance with the flow shown in FIGS. 6 and 7 from this state, and ST307 in the loop processing of i = 2 and j = 2 is completed. It becomes like.
When ST308 (ie ST401 in FIG. 7) is started and k = 1, the right-hand side character string 203 (“CQ”) stored in Rule [2] [1] and the partial character ending with the j = 2 character of Str2 Since the columns ("CQ") match, the determination result in ST402 is true (Y).
 類似度スコア204(つまり0.4)が変数rule_scoreに、左辺文字列202の文字列長が変数len1に、右辺文字列203の文字列長が変数len2に代入されて、rule_score=0.4、len1=2、len2=2となる。ST404では、M[i-len1,j-len2]+rule_scoreと更新前のM[i,j]の値を比較し、いずれか小さいほうの値でM[i,j]の値を更新する。この例では、M[2,2]が1であるのに対し、M[i-len1,j-len2]+rule_scoreはM[0,0]+0.4=0.4となるので、編集距離計算用テーブルは図8(c)に示すようになる。 The similarity score 204 (that is, 0.4) is assigned to the variable rule_score, the character string length of the left side character string 202 is assigned to the variable len1, and the character string length of the right side character string 203 is assigned to the variable len2, so that rule_score = 0.4, len1 = 2, len2 = 2. In ST404, M [i-len1, j-len2] + rule_score is compared with the value of M [i, j] before update, and the value of M [i, j] is updated with the smaller value. In this example, M [2,2] is 1, whereas M [i-len1, j-len2] + rule_score is M [0,0] + 0.4 = 0.4, so the edit distance calculation table Is as shown in FIG.
 この後、処理を継続して図6、図7のフローを終了したとき、編集距離計算用テーブルは図8(d)に示すようになり、重み付き編集距離がM[3,3]=0.4と計算される。 Thereafter, when the processing is continued and the flow of FIGS. 6 and 7 is ended, the edit distance calculation table becomes as shown in FIG. 8D, and the weighted edit distance is M [3,3] = 0.4. Is calculated.
 なお、この実施の形態では入力文字列110と見出し文字列を一致させる編集距離を求める場合を説明したが、入力文字列110が見出し文字列より短い場合には、|Str1|≦J≦|Str2|の範囲で重み付き編集距離M[|Str1|,J]を計算して、そのなかで最小の重み付き編集距離を出力するようにしてもよい。このような計算により余剰文字(重み付き編集距離が最小となったJ文字目以降の見出し文字列の文字)を無視した重み付き編集距離をその見出し文字列の重み付き編集距離として求めることで、使用者が目的の名称の先頭の一部のみを入力した場合に、これを補完して正式な名称を取得することが可能となる。 In this embodiment, the case has been described where the edit distance for matching the input character string 110 and the heading character string is obtained. However, when the input character string 110 is shorter than the heading character string, | Str1 | ≦ J ≦ | Str2 The weighted edit distance M [| Str1 |, J] may be calculated in the range of |, and the minimum weighted edit distance may be output. By calculating the weighted edit distance ignoring the surplus characters (characters of the heading character string after the Jth character with the smallest weighted edit distance) by such calculation, as the weighted edit distance of the headline character string, When the user inputs only a part of the beginning of the target name, it is possible to obtain a formal name by complementing this.
 上述のように、ルール記憶部102に記憶された類似文字列重みルールから、検索文字列である入力文字列に基づいてその入力文字列に適用される可能性のある類似文字列重みルールを抽出するルール抽出部と、抽出した類似文字列重みルールを記憶する適用ルール記憶部とを備えて、重み付き編集距離を計算する処理において抽出した類似文字列重みルールを適用可否判定の対象とすることにより、適用可否判定処理の回数を削減することができる。これにより文字列検索装置の類似文字列検索処理を高速化できるという効果を得ることができる。 As described above, the similar character string weight rule that may be applied to the input character string is extracted from the similar character string weight rule stored in the rule storage unit 102 based on the input character string that is the search character string. A similar character string weight rule extracted in the process of calculating the weighted edit distance, and a determination target of applicability determination. Thus, the number of applicability determination processes can be reduced. Thereby, the effect that the similar character string search process of a character string search apparatus can be sped up can be acquired.
 特に、類似文字列重みルールの数が非常に多い場合に大きな効果を得ることができる。例えばフランス語のように似た音を持つ綴りが非常に多い言語では、使用者の勘違いや打鍵誤りや音声認識の誤りに対応するために必要な類似文字列重みルールの数が数百種類に及ぶが、ルール抽出部により重み付き編集距離の計算時に適用可否を判定する類似文字列重みルール数を数十種類程度あるいはそれ以下に削減でき、類似文字列検索処理を大きく高速化できる。 Especially, a great effect can be obtained when the number of similar character string weight rules is very large. For example, in languages such as French that have a lot of spelling with similar sounds, there are hundreds of similar string weight rules required to handle misunderstandings of users, keystroke errors, and voice recognition errors. However, the number of similar character string weight rules for determining applicability when calculating the weighted edit distance by the rule extraction unit can be reduced to about several tens or less, and the similar character string search processing can be greatly speeded up.
 なお、辞書が非常に多くのデータを含み、重み付き編集距離を計算する見出し文字列が大量である場合には、重み付き編集距離の計算の前に例えば特許文献1に示されているtf-idf(term frequency - inverse document frequency)重みを用いた絞り込みのような何らかの事前絞り込みの手段を設けることとしても良い。 If the dictionary contains a large amount of data and there are a large number of heading character strings for calculating the weighted edit distance, the tf− shown in, for example, Patent Document 1 is calculated before the weighted edit distance is calculated. It is also possible to provide some kind of prior narrowing means such as narrowing using idf (term frequency-inverse document frequency) weights.
 また、この実施の形態ではルール記憶部102が記憶する類似文字列重みルールから抽出したルールを適用ルール記憶部103に記憶する構成としたが、例えばルール番号のみを適用ルール記憶部に記憶して、重み付き編集距離の計算時にルール番号を元に必要な類似文字列重みルールをルール記憶部から取得するようにしても良い。 In this embodiment, the rule extracted from the similar character string weight rule stored in the rule storage unit 102 is stored in the application rule storage unit 103. For example, only the rule number is stored in the application rule storage unit. A similar character string weight rule may be acquired from the rule storage unit based on the rule number when calculating the weighted edit distance.
 上述の実施の形態では文字列を検索する例を説明したが、この文字列は単語列であっても良い。
 また、この実施の形態ではアルファベットの入力文字列を用いて説明したが、この発明は入力文字列をアルファベットに限定されるものではなく、他の言語の文字列(例えば、ひらがなや漢字)であっても良い。
In the above-described embodiment, an example of searching for a character string has been described. However, this character string may be a word string.
Further, although this embodiment has been described using an alphabetic input character string, the present invention is not limited to an alphabetic input character string, and is a character string in another language (for example, hiragana or kanji). May be.
実施の形態2.
 図9はこの発明の実施の形態2に係る文字列検索装置の構成図である。実施の形態1の文字列検索装置の違いは、編集距離計算部104aが外部より設定された距離上限値112に基づいて、編集距離計算を途中で打ち切る機能を有する点である。これ以外の重みルール抽出部101、ルール記憶部102、適用ルール記憶部103、辞書105、距離準整列部106、入力文字列110、類似文字列リスト111については実施の形態1と同様である。
Embodiment 2. FIG.
FIG. 9 is a block diagram of a character string search device according to Embodiment 2 of the present invention. The difference between the character string search devices of the first embodiment is that the edit distance calculation unit 104a has a function to cancel the edit distance calculation halfway based on the distance upper limit value 112 set from the outside. The other weight rule extraction unit 101, rule storage unit 102, application rule storage unit 103, dictionary 105, distance quasi-alignment unit 106, input character string 110, and similar character string list 111 are the same as those in the first embodiment.
 次に実施の形態2の文字列検索装置の動作を実施の形態1の文字列検索装置との差分を中心に説明する。図10は実施の形態2の文字列検索装置の重み付き編集距離計算処理の処理フローである。ST301からST310までの各処理については、実施の形態1と同様である。この実施の形態では、ST309とST310の間で重み付き編集距離計算の打ち切り判定を行う(ST311)。 Next, the operation of the character string search device according to the second embodiment will be described focusing on differences from the character string search device according to the first embodiment. FIG. 10 is a processing flow of a weighted edit distance calculation process of the character string search apparatus according to the second embodiment. Each process from ST301 to ST310 is the same as that in the first embodiment. In this embodiment, the weighted editing distance calculation is aborted between ST309 and ST310 (ST311).
 図11はST311の打ち切り判定の処理フローである。以下、図11を参照してST311の処理について詳細に説明する。いま、変数iの値がZであるとする。まず、変数iの値をZから|Str1|まで変化させた場合の重み付き編集距離計算で参照する可能性のある編集距離計算用テーブルの配列の、入力文字列110に対応する文字位置の最小値MinLを計算する(ST501)。
 具体的には、MinLは以下のように求めることができる。Rule[p][r](Z≦p≦|Str1|、1≦r≦p番目の文字のルール数206)に格納された左辺文字列の文字数の最大値をMaxLenpとしたとき、ST307ではM[p-MaxLenp,q](0≦q≦|Str2|)が参照される可能性がある。よって、Z≦p≦|Str1|の範囲でp-MaxLenpの最小値がMinLとなる。
FIG. 11 is a processing flow of ST311 abort determination. Hereinafter, the processing of ST311 will be described in detail with reference to FIG. Assume that the value of the variable i is Z. First, the minimum character position corresponding to the input character string 110 in the edit distance calculation table array that may be referred to in the weighted edit distance calculation when the value of the variable i is changed from Z to | Str1 | The value MinL is calculated (ST501).
Specifically, MinL can be obtained as follows. When the maximum number of characters in the left-hand side string stored in Rule [p] [r] (Z ≦ p ≦ | Str1 |, 1 ≦ r ≦ p-th character rule number 206) is MaxLen p , ST307 M [p-MaxLen p , q] (0 ≦ q ≦ | Str2 |) may be referred to. Therefore, the minimum value of p-MaxLen p is MinL in the range of Z ≦ p ≦ | Str1 |.
 例えばStr1=”CHA”、Str2=”CQA”である場合に、配列Ruleが図5に示すものであったとする。図5より、入力文字位置=2のときの左辺文字列の最大値MaxLen2は2、入力文字位置=3のときの左辺文字列の最大値MaxLen3は2となる。ここでZを2とすると、p-MaxLenp(2≦p≦3)の値は、2-MaxLen2=0、3-MaxLen3=1であり、MinL=0となる。 For example, when Str1 = "CHA" and Str2 = "CQA", it is assumed that the array Rule is as shown in FIG. From FIG. 5, the maximum value MaxLen 2 of the left side character string when the input character position = 2 is 2, and the maximum value MaxLen 3 of the left side character string when the input character position = 3 is 2. Here, when Z is 2, the values of p-MaxLen p (2 ≦ p ≦ 3) are 2-MaxLen 2 = 0 and 3-MaxLen 3 = 1, and MinL = 0.
 次に、距離上限値112とM[x,y](MinL≦x≦Z,0≦y≦|Str2|)の比較を行う(ST502)。M[x,y]に距離上限値112より小さいものが一つでもあれば、ST310へ遷移し、編集距離計算処理を継続する。一方、全てのM[x,y]が距離上限値112より大きければ、これ以上計算を続行しても距離上限値112より重み付き編集距離が小さくなることはないため、編集距離計算を打ち切る。そして、その時点での最小値を見出し文字列Str2に対する重み付き編集距離として出力する。 Next, the distance upper limit 112 is compared with M [x, y] (MinL ≦ x ≦ Z, 0 ≦ y ≦ | Str2 |) (ST502). If there is at least one M [x, y] smaller than the distance upper limit 112, the process proceeds to ST310, and the edit distance calculation process is continued. On the other hand, if all M [x, y] are larger than the distance upper limit 112, the edit distance calculation is aborted because the weighted edit distance will not be smaller than the distance upper limit 112 even if the calculation is continued further. Then, the minimum value at that time is output as a weighted edit distance with respect to the heading character string Str2.
 上記のように、事前に設定された距離上限値による打ち切り判定を行うことにより、距離上限値を超えた重み付き編集距離を計算する処理を省くことができ、類似文字列検索の高速化という効果が得られる。 As described above, it is possible to omit the process of calculating the weighted edit distance exceeding the distance upper limit value by performing the censoring determination with the distance upper limit value set in advance, and the effect of speeding up the similar character string search Is obtained.
 特に、距離上限値を設けて類似性の高いものだけを検索する場合、たとえば施設名称や住所といった辞書に記憶された見出し文字列の数が数十万から数百万を越えるような場合に大きな効果が得られる。 In particular, when a distance upper limit value is set to search only those having high similarity, for example, when the number of heading character strings stored in a dictionary such as facility names and addresses exceeds hundreds of thousands to millions An effect is obtained.
 以上のように、この発明の文字列検索装置は、文字列検索における重み付編集距離計算の処理量を削減することで装置の処理速度の高速化や装置に必要な処理能力の低減を実現することができるので、カーナビゲーションシステムなど文字列検索を行う装置において有用である。 As described above, the character string search device of the present invention realizes an increase in the processing speed of the device and a reduction in the processing capability required for the device by reducing the processing amount of the weighted edit distance calculation in the character string search. Therefore, it is useful in an apparatus for performing a character string search such as a car navigation system.
101 重みルール抽出部、102 ルール記憶部、103 適用ルール記憶部、104,104a 編集距離計算部、105 辞書、106 距離順整列部、110 入力文字列、111 類似文字列リスト、112 距離上限値、201 ルール番号、202 左辺文字列、203 右辺文字列、204 類似度スコア、205 入力文字位置、206 ルール数。 101 weight rule extraction unit, 102 rule storage unit, 103 applied rule storage unit, 104, 104a edit distance calculation unit, 105 dictionary, 106 distance order alignment unit, 110 input character string, 111 similar character string list, 112 distance upper limit value, 201 rule number, 202 left side character string, 203 right side character string, 204 similarity score, 205 input character position, 206 rule number.

Claims (5)

  1.  1文字以上の文字からなる第1の文字列と第2の文字列の類似度が定義された複数の類似文字列重みルールの中から、入力された検索文字列に前記第1の文字列が含まれる前記類似文字列重みルールを抽出する重みルール抽出部と、
     前記重みルール抽出部において抽出された前記類似文字列重みルールを用いて、前記検索文字列と前記検索文字列を検索する辞書から取得された見出し文字列との重み付き編集距離を計算する編集距離計算部と、
     を備えたことを特徴とする文字列検索装置。
    Among the plurality of similar character string weight rules in which the similarity between the first character string consisting of one or more characters and the second character string is defined, the first character string is included in the input search character string. A weight rule extraction unit that extracts the similar character string weight rule included;
    Edit distance for calculating a weighted edit distance between the search character string and a heading character string acquired from a dictionary for searching the search character string, using the similar character string weight rule extracted in the weight rule extraction unit A calculation unit;
    A character string search device comprising:
  2.  前記重みルール抽出部において抽出された前記類似文字列重みルールを記憶する適用ルール記憶部を備えたことを特徴とする請求項1に記載の文字列検索装置。 The character string search device according to claim 1, further comprising an application rule storage unit that stores the similar character string weight rule extracted by the weight rule extraction unit.
  3.  前記適用ルール記憶部は、
     前記検索文字列の各文字位置について、当該文字位置を末尾とする前記検索文字列の一部分の文字列と前記第1の文字列が一致する前記抽出された類似文字列重みルールを、当該文字位置に対応する前記抽出された類似文字列重みルールとして記憶することを特徴とする請求項2に記載の文字列検索装置。
    The application rule storage unit
    For each character position of the search character string, the extracted similar character string weight rule in which the first character string matches the character string that is a part of the search character string that ends with the character position is represented by the character position. The character string search device according to claim 2, wherein the character string search rule is stored as the extracted similar character string weight rule corresponding to.
  4.  前記編集距離計算部において前記重み付き編集距離の計算が行われた前記見出し文字列を計算された前記重み付き編集距離が近い順に整列する距離順整列部を備えたことを特徴とする請求項1から請求項3のいずれか一項に記載の文字列検索装置。 2. The distance order aligning unit that arranges the heading character string for which the weighted edit distance is calculated in the edit distance calculating unit, in order from the calculated weighted edit distance. The character string search device according to any one of claims 1 to 3.
  5.  前記編集距離計算部は、
     前記重み付き編集距離の計算において、算出される前記重み付き編集距離が予め定められた距離上限値以上になることを判断すると、前記重み付き編集距離の計算を中断し、前記判断をした時点の前記重み付き編集距離の計算値を計算結果とすることを特徴とする請求項1に記載の文字列検索装置。
    The edit distance calculation unit
    In the calculation of the weighted edit distance, if it is determined that the calculated weighted edit distance is equal to or greater than a predetermined distance upper limit value, the calculation of the weighted edit distance is interrupted, and at the time of the determination The character string search device according to claim 1, wherein a calculation value of the weighted edit distance is used as a calculation result.
PCT/JP2014/004285 2013-09-20 2014-08-21 Character string retrieval device WO2015040793A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2015537549A JP5846340B2 (en) 2013-09-20 2014-08-21 String search device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2013195257 2013-09-20
JP2013-195257 2013-09-20

Publications (1)

Publication Number Publication Date
WO2015040793A1 true WO2015040793A1 (en) 2015-03-26

Family

ID=52688465

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2014/004285 WO2015040793A1 (en) 2013-09-20 2014-08-21 Character string retrieval device

Country Status (2)

Country Link
JP (1) JP5846340B2 (en)
WO (1) WO2015040793A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109902098A (en) * 2019-01-23 2019-06-18 平安科技(深圳)有限公司 Similar cases are searched and sort method, server and computer readable storage medium
JP2019526142A (en) * 2016-08-31 2019-09-12 北京奇▲芸▼世▲紀▼科技有限公司Beijing Qiyi Century Science & Technology Co., Ltd. Search term error correction method and apparatus
WO2021250837A1 (en) * 2020-06-11 2021-12-16 日本電気株式会社 Search device, search method, and recording medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004062893A (en) * 2002-06-28 2004-02-26 Microsoft Corp System and method for automatic retrieval of example sentence based on weighted editing distance
JP2005011078A (en) * 2003-06-19 2005-01-13 Patolis Corp Similar word retrieval device and method, its program, recording medium with its program recorded and information retreival system
JP2006039866A (en) * 2004-07-26 2006-02-09 Patolis Corp Similar word retrieval device, method, and program, and storage medium recording the program, and information retrieval device
JP2011197716A (en) * 2010-03-17 2011-10-06 Fuji Xerox Co Ltd Pattern matching device, translation device, translation system and translation program

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004062893A (en) * 2002-06-28 2004-02-26 Microsoft Corp System and method for automatic retrieval of example sentence based on weighted editing distance
JP2005011078A (en) * 2003-06-19 2005-01-13 Patolis Corp Similar word retrieval device and method, its program, recording medium with its program recorded and information retreival system
JP2006039866A (en) * 2004-07-26 2006-02-09 Patolis Corp Similar word retrieval device, method, and program, and storage medium recording the program, and information retrieval device
JP2011197716A (en) * 2010-03-17 2011-10-06 Fuji Xerox Co Ltd Pattern matching device, translation device, translation system and translation program

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2019526142A (en) * 2016-08-31 2019-09-12 北京奇▲芸▼世▲紀▼科技有限公司Beijing Qiyi Century Science & Technology Co., Ltd. Search term error correction method and apparatus
JP6997781B2 (en) 2016-08-31 2022-01-18 北京奇▲芸▼世▲紀▼科技有限公司 Error correction method and device for search terms
US11574012B2 (en) 2016-08-31 2023-02-07 Beijing Qiyi Century Science & Technology Co., Ltd. Error correction method and device for search term
CN109902098A (en) * 2019-01-23 2019-06-18 平安科技(深圳)有限公司 Similar cases are searched and sort method, server and computer readable storage medium
WO2021250837A1 (en) * 2020-06-11 2021-12-16 日本電気株式会社 Search device, search method, and recording medium
JP7485030B2 (en) 2020-06-11 2024-05-16 日本電気株式会社 Search device, search method, and program

Also Published As

Publication number Publication date
JP5846340B2 (en) 2016-01-20
JPWO2015040793A1 (en) 2017-03-02

Similar Documents

Publication Publication Date Title
CN108711422B (en) Speech recognition method, speech recognition device, computer-readable storage medium and computer equipment
US8504367B2 (en) Speech retrieval apparatus and speech retrieval method
KR20190020119A (en) Error correction methods and devices for search terms
KR20050005523A (en) Word association method and apparatus
CN104615589A (en) Named-entity recognition model training method and named-entity recognition method and device
CN102725790A (en) Recognition dictionary creation device and speech recognition device
US11531693B2 (en) Information processing apparatus, method and non-transitory computer readable medium
CN111191002A (en) Neural code searching method and device based on hierarchical embedding
CN112331206A (en) Speech recognition method and equipment
JP5846340B2 (en) String search device
US20210019476A1 (en) Methods and apparatus to improve disambiguation and interpretation in automated text analysis using transducers applied on a structured language space
KR20230009564A (en) Learning data correction method and apparatus thereof using ensemble score
KR20040004558A (en) Content conversion method and apparatus
CN112581327A (en) Knowledge graph-based law recommendation method and device and electronic equipment
CN115688779A (en) Address recognition method based on self-supervision deep learning
JP5355483B2 (en) Abbreviation Complete Word Restoration Device, Method and Program
KR102600703B1 (en) Apparatus and method for answering questions related to legal field
Tyers et al. What shall we do with an hour of data? Speech recognition for the un-and under-served languages of Common Voice
JP3983000B2 (en) Compound word segmentation device and Japanese dictionary creation device
JP2009157458A (en) Index creation device, its method, program, and recording medium
JP4915499B2 (en) Synonym dictionary generation system, synonym dictionary generation method, and synonym dictionary generation program
CN117421392B (en) Code searching method and device based on word level alignment
JP5700566B2 (en) Scoring model generation device, learning data generation device, search system, scoring model generation method, learning data generation method, search method and program thereof
US11482214B1 (en) Hypothesis generation and selection for inverse text normalization for search
KR102500106B1 (en) Apparatus and Method for construction of Acronym Dictionary

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14845683

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2015537549

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14845683

Country of ref document: EP

Kind code of ref document: A1