WO2015040793A1

WO2015040793A1 - Character string retrieval device

Info

Publication number: WO2015040793A1
Application number: PCT/JP2014/004285
Authority: WO
Inventors: 相川　勇之; 悠介小路
Original assignee: 三菱電機株式会社
Priority date: 2013-09-20
Filing date: 2014-08-21
Publication date: 2015-03-26
Also published as: JP5846340B2; JPWO2015040793A1

Abstract

　The present invention is provided with: a weight rule extraction unit (101) for extracting, from among a plurality of similar character string weight rules in which similarity of a first character string and a second character string consisting of one or more characters is defined, a similar character string weight rule containing the first character string in an inputted retrieval character string; and an editing distance calculation unit (104) for calculating, by using the similar character string weight rule extracted by the weight rule extraction unit (101), a weighted editing distance between the retrieval character string and an entry character string acquired from a dictionary from which the retrieval character string is retrieved.

Description

String search device

The present invention relates to a character string search device capable of obtaining a character string similar to an input search character string as a search result.

For example, when a user searches for an address or facility name using a common name, abbreviated name, or name that is mistakenly stored, or when a search character string entered due to a user's operation error contains an error, If the search character string contains an error due to a recognition error in the voice signal input in the search by voice input, the search character string that is similar to the search character string is used as the search result, although it does not match the input search character string. A similar character string search is required.

In the similar character string search, it is necessary to evaluate the degree of similarity between the search character string and the character string obtained as a search result. A weighted edit distance is used as a measure for evaluating the similarity between character strings. For example, Patent Document 1 discloses a method of using a weighted edit distance in order to rank candidate sentence examples and narrow them down.
Non-Patent Document 1 discloses a technique for weighting a replacement rule for converting a character string into another character string.

JP 2004-62893 A (FIG. 3)

When a correct character string similar to a search character string including an error is searched from a dictionary using a weighted editing distance as a measure of similarity as described in Patent Document 1, a search character string including an error and a correct character are searched. It is necessary to calculate the weighted edit distance with the column.
When the similarity between two character strings as described in Non-Patent Document 1 is defined in a similar character string weight rule (replacement rule), and a weighted edit distance is calculated based on this similarity, an error occurs. Since a large number of such patterns exist, a large number of similar character string weight rules are required. In the calculation processing of the weighted edit distance, it is necessary to determine whether or not to apply to the character string being calculated, with respect to the large number of similar character string weight rules. For this reason, there has been a problem that the processing amount of the applicability determination of the similar character string weight rule in the calculation processing of the weighted edit distance is very large.

The present invention has been made to solve the above-described problems, and it is an object of the present invention to reduce the calculation processing amount of the weighted edit distance when a similar character string search is performed by a character string search device. .

The character string search device according to the present invention provides an input search character from a plurality of similar character string weight rules in which the similarity between the first character string and the second character string made up of one or more characters is defined. The search character string and the search character string are searched using the weight rule extraction unit that extracts the similar character string weight rule including the first character string in the column, and the similar character string weight rule extracted by the weight rule extraction unit. An editing distance calculation unit for calculating a weighted editing distance with a heading character string acquired from the dictionary.

According to the present invention, a similar character string weight rule that may be applied to a search character string that is input as described above is extracted, and a weighted edit distance at the time of similar character string search is extracted. In this calculation, the applicability determination is made with reference to the extracted similar string weight rule, so the number of applicability determinations can be reduced, and the calculation processing amount of the weighted edit distance is reduced. can do.

1 is a configuration diagram of a character string search device according to Embodiment 1. FIG. 4 is a table showing an example of similar character string weight rules stored in a rule storage unit of the character string search device according to the first embodiment. It is a table which shows the example of the dictionary data memorize | stored in the dictionary. 4 is a processing flow of the character string search device according to the first embodiment. It is a table which shows the example of the extracted similar character string weight rule memorize | stored in the application rule memory | storage part. 6 is a processing flow of a weighted edit distance calculation process of the character string search device according to the first embodiment. 5 is a processing flow of edit distance calculation when a similar character string weight rule is applied to the character string search device according to the first embodiment. 6 is an explanatory diagram illustrating an example of weighted edit distance calculation in the character string search device according to the first embodiment. FIG. 6 is a configuration diagram of a character string search device according to Embodiment 2. FIG. It is a processing flow of the weighted edit distance calculation process of the character string search device of Embodiment 2. 10 is a processing flow of an abort determination process according to the second embodiment.

Embodiments of the present invention will be described below with reference to the drawings. In the drawings to be referred to, the same or corresponding parts are denoted by the same reference numerals.

Embodiment 1 FIG.
1 is a block diagram of a character string search apparatus according to Embodiment 1 of the present invention. The character string search apparatus according to the first embodiment of the present invention includes a weight rule extraction unit 101, a rule storage unit 102, an application rule storage unit 103, an edit distance calculation unit 104, a dictionary 105, and a distance order alignment unit 106. . The weight rule extraction unit 101 extracts a similar character string weight rule related to the input character string 110 from the rule storage unit 102 based on the input character string 110 input by the user and stores it in the application rule storage unit 103. To do.

Here, the input character string 110 is a search character string, and is input using an input device such as a keyboard or a touch panel, or recognized by a voice recognition mechanism (not shown) from a voice signal input by a voice input device. This is a character string. The present invention does not limit the input method of the search character string, and may be a character string input by various other methods.
The similar character string weight rule is a rule that defines the similarity between two character strings (the first character string and the second character string), and the weighted edit distance between the search character string and another character string is determined. Used when calculating.

The edit distance calculation unit 104 refers to the dictionary 105, and uses the similar character string weight rule stored in the application rule storage unit 103 as the weighted edit distance between the input character string 110 and each heading character string stored in the dictionary 105. Calculate using. The distance order arranging unit 106 outputs each heading character string in the dictionary 105 as a similar character string list 111 in ascending order of the weighted editing distance calculated by the editing distance calculating unit 104. The similar character string list 111 is used in processing such as being presented to a user on a display (not shown) or the like.

The weight rule extraction unit 101, the edit distance calculation unit 104, and the distance order alignment unit 106 are configured by hardware such as an ASIC (Application Specific Integrated Circuit), or dedicated hardware using a processor and software executed on the processor It can be realized by, software realized on a general-purpose computer, or a combination of these realization methods.
The rule storage unit 102, the applied rule storage unit 103, and the dictionary 105 are configured using a volatile storage medium such as a RAM (Random Access Memory) or a non-volatile storage medium such as an HDD (Hard Disc Drive). That's fine. Alternatively, it may be configured to be read and written remotely via a communication line, or a detachable device may be used.

FIG. 2 is a table showing an example of similar character string weight rules stored in the rule storage unit 102. The rule number 201 is a number uniquely assigned to identify each similar character string weight rule. The left-side character string 202 that is the first character string and the right-side character string 203 that is the second character string are combinations of character strings that define the similarity according to each similar character string weight rule, and the similarity score 204 is the left-hand side The similarity between the character string 202 and the right-side character string 203 is shown. Here, it is assumed that the smaller the similarity score 204 is, the larger the similarity is.

In the example of FIG. 2, “PH” and “F” (rule number 201 = 2), which are similar in pronunciation and frequently generate spelling errors, “C” and “K” (rule number 201 = 20), etc. A combination of Also included are combinations such as “A” and “AA” (rule number 201 = 52) for dealing with errors such as omission of continuous input of the same character on a keyboard or the like.
These similar string weight rules may be defined manually in advance, or may be configured using statistical methods such as machine learning by collecting many examples of erroneous input. good.

FIG. 3 is a table showing an example of dictionary data stored in the dictionary 105. The dictionary 105 stores a heading character string and attribute information of the heading character string. Here, the attribute information is information such as the type (part of speech) and content of the heading character string.

Next, the operation of the character string search device according to the first embodiment of the present invention will be described. FIG. 4 is a processing flow of the character string search apparatus according to the first embodiment. When the input character string 110 is input to the character string search device, the weight rule extraction unit 101 first performs rule extraction processing (ST201). In the process of ST201, the weight rule extraction unit 101 performs the input character string 110 on the basis of the input character string 110 that has been input, from among a plurality of similar character string weight rules stored in the rule storage unit 102. Extract similar string weight rules that may be applied.

The similar character string weight rule that may be applied here is a similar character string weight rule in which the left side character string 202 is included in the input character string 110. Here, for each character position from the beginning to the end of the input character string 110, the partial character string from the first character to the character position is referred to, and the last character of these partial character strings (for example, the part up to the third character) If it is a character string, a part including the third character) is selected, and the left-side character string 202 of the similar character string weight rule is selected. By extracting in this way, the extracted similar character string weight rule can be classified for each corresponding character position. The extracted similar character string weight rule is stored in the application rule storage unit 103.

FIG. 5 is a table showing an example of similar character string weight rules stored in the application rule storage unit 103. In the application rule storage unit 103, similar character string weight rules are classified and stored for each corresponding character position in the input character string 110.
An example of the similar character string weight rule stored in the application rule storage unit 103 in FIG. 5 is when the input character string 110 is three characters “CHA”, and is stored in the rule storage unit 102 illustrated in FIG. The similar character string weight rule extracted from the similar character string weight rule is shown.

In FIG. 5, an input character position 205 indicates which number character in the input character string 110 (“CHA”) corresponds to the rule, and the rule corresponding to the character at each character position is classified.
As described above, the weight rule extraction unit 101 refers to the partial character string from the first character to the character position for each character position from the beginning to the end of the input character string 110, and determines the last character of these partial character strings. A part in which the left-side character string 202 of the similar character string weight rule matches the part to be included is extracted.

That is, when the input character string 110 is “CHA”, for a character whose character position is 1 (that is, “C”), a rule defined by only one character whose left-side character string 202 is “C” is extracted. Similarly, for a character whose character position is 2 (that is, “H”), a rule defined by one character “H” or two characters “CH” in the left side character string 202 is extracted. For a character whose character position is 3 (that is, “A”), the rule defined by the left-side character string 202 as one character “A”, two characters “HA”, or three characters “CHA” is extracted.
As a result, similar character string weight rules extracted for each character position are classified in FIG.

The rule number 206 indicates the number of rules corresponding to each input character position 205. In the example of FIG. 5, there are 10 rules corresponding to the first character (“C”), six rules corresponding to the second character (“H”), and rules corresponding to the third character (“A”). 9 are extracted from the rule storage unit 102.
If the number of characters in the input character string 110 is five, lines with input character positions 4 and 5 are added in the table of FIG.

In order to make the explanation of the subsequent processing flow easier to understand, the similar character string weight rule stored in the application rule storage unit 103 is an array Rule [i] [k] (1 ≦ i ≦ number of characters in the

input character string

110, 1 It is assumed that the rule number 201) is stored in the order of rule number 201) of ≦ k ≦ i-th character so that the edit distance calculation unit 104 can refer to this array Rule. It is also assumed that the rule number 206 of the i-th character is stored in the array N _R [i], and that the edit distance calculation unit 104 can refer to this similarly.
For example, in the case of the example shown in FIG. 5, for the second character “H”, the rule with rule number 201 = 41 is Rule [2] [1], and the rule with rule number 201 = 42 is Rule [2] [1]. 2] and the number of rules 206 = 6 is stored in N _R [2].

Next, the edit distance calculation unit 104 performs weighted edit distance calculation (ST202). In the processing of ST202, the editing distance calculation unit 104 sequentially reads the heading character strings from the dictionary 105, and calculates the weighted editing distance between each heading character string and the input character string 110. The processing content of the weighted edit distance calculation will be described later.

Next, the distance order arranging unit 106 sorts each heading character string read from the dictionary 105 and calculating the weighted edit distance in ST202 in ascending order (closer) of the weighted edit distance, and the similar character string list It outputs as 111. The output similar character string list 111 is used for processing such as being presented to a user on a display (not shown) or the like.

Next, the processing content of ST202 weighted edit distance calculation performed by the edit distance calculation unit 104 for each heading character string read from the dictionary 105 will be described. Here, the DP (Dynamic Programming) method is used for the edit distance calculation. In the present invention, the editing distance calculation method is not limited to the DP method, and other methods may be used. FIG. 6 is a processing flow of the weighted editing distance calculation processing in ST202.

The editing distance from the original character string (referred to as Str1) that is the input character string 110 to the target character string (referred to as Str2) that is each heading character string is obtained. First, the table used for the edit distance calculation by the DP method is initialized (ST301). This edit distance calculation table is based on the character string length of Str1 (hereinafter referred to as | Str1 |) and the character string length of Str2 (hereinafter referred to as | Str2 |). i ≦ | Str1 |, 0 ≦ j ≦ | Str2 |). In ST301, M [0,0] is set to 0, M [i, 0] (1 ≦ i ≦ | Str1 |) is set to i, and M [0, j] (1 ≦ j ≦ | Str2 |) is set to j. Substitute and initialize.
Note that it is determined whether or not Str1 and Str2 match before performing ST301, and if they match, the processing after ST302 may not be performed.

Next, the initial value of variable i is set to 1, and the process from step ST302 to step ST310 is repeated | Str1 | times while incrementing the value of variable i by 1.
Similarly, the process from step ST303 to step ST309 is repeated | Str2 | times while the initial value of variable j is set to 1 and the value of variable j is incremented by 1 in the loop processing based on variable i described above.

Details of loop processing based on variable i and variable j will be described. Note that the i-th character of Str1 is represented as Str1 [i]. First, the characters Str1 [i] and Str2 [j] are compared (ST304). If they are equal, 0 is assigned to the variable score (ST305), and if they are not equal, 1 is assigned to the variable score (ST306). It should be noted that substituting 1 in ST306 indicates that one character replacement is required to make Str1 and Str2 the same character string.

Next, the value of M [i, j] in the edit distance calculation table is updated (ST307). In this process, three values M [i-1, j-1] + score, M [i-1, j] +1, M [i, j-1] +1 are compared, and the smallest value is obtained. To update M [i, j].

Next, edit distance calculation using a weighted similar character string weight rule is performed (ST308). FIG. 7 is a processing flow of edit distance calculation processing using the similar character string weight rule of ST308. The processing from ST401 to ST405 is repeated N _R [i] times (that is, the number of rules for the i-th character of the input character 110 is 206 minutes).

In this loop process, it is first determined whether or not Rule [i] [k] is applicable (similar character string weight rule applicability determination) (ST402). Specifically, the right-side character string 203 of the similar character string weight rule stored in Rule [i] [k] matches the partial character string that ends with the j-th character of Str2, which is the target character string here. If it matches, it is determined that it is applicable.

If it is determined in ST402 that it is applicable, the contents of the rule are acquired (ST403). In ST403, the similarity score 204 defined in Rule [i] [k] is assigned to the variable rule_score, and the number of characters of the left side character string is assigned to the variable len1, and the number of characters of the right side character string is assigned to the variable len2.
Next, M [i, j] in the edit distance calculation table is updated (ST404). In this process, the two values M [i, j] and M [i-len1, j-len2] + rule_score are compared, and M [i, j] is updated with the smaller value.
On the other hand, if it is determined in ST402 that it is not applicable, the process proceeds to ST405.
In this way, the processing flow shown in FIGS. 6 and 7 is finished, and the value finally stored in M [| Str1 |, | Str2 |] is weighted between the input character string 110 and the heading character string. Edit distance.

When the input character string 110 shown in the above example is “CHA” and the heading character string read from the dictionary 105 is “CQA” (that is, Str1 = “CHA”, Str2 = “CQA”). A specific operation example will be described.

FIG. 8A shows the edit distance calculation table initialized in ST301 in this operation example. FIG. 8B shows an edit distance calculation table when processing is performed in accordance with the flow shown in FIGS. 6 and 7 from this state, and ST307 in the loop processing of i = 2 and j = 2 is completed. It becomes like.
When ST308 (ie ST401 in FIG. 7) is started and k = 1, the right-hand side character string 203 (“CQ”) stored in Rule [2] [1] and the partial character ending with the j = 2 character of Str2 Since the columns ("CQ") match, the determination result in ST402 is true (Y).

The similarity score 204 (that is, 0.4) is assigned to the variable rule_score, the character string length of the left side character string 202 is assigned to the variable len1, and the character string length of the right side character string 203 is assigned to the variable len2, so that rule_score = 0.4, len1 = 2, len2 = 2. In ST404, M [i-len1, j-len2] + rule_score is compared with the value of M [i, j] before update, and the value of M [i, j] is updated with the smaller value. In this example, M [2,2] is 1, whereas M [i-len1, j-len2] + rule_score is M [0,0] + 0.4 = 0.4, so the edit distance calculation table Is as shown in FIG.

Thereafter, when the processing is continued and the flow of FIGS. 6 and 7 is ended, the edit distance calculation table becomes as shown in FIG. 8D, and the weighted edit distance is M [3,3] = 0.4. Is calculated.

In this embodiment, the case has been described where the edit distance for matching the input character string 110 and the heading character string is obtained. However, when the input character string 110 is shorter than the heading character string, | Str1 | ≦ J ≦ | Str2 The weighted edit distance M [| Str1 |, J] may be calculated in the range of |, and the minimum weighted edit distance may be output. By calculating the weighted edit distance ignoring the surplus characters (characters of the heading character string after the Jth character with the smallest weighted edit distance) by such calculation, as the weighted edit distance of the headline character string, When the user inputs only a part of the beginning of the target name, it is possible to obtain a formal name by complementing this.

As described above, the similar character string weight rule that may be applied to the input character string is extracted from the similar character string weight rule stored in the rule storage unit 102 based on the input character string that is the search character string. A similar character string weight rule extracted in the process of calculating the weighted edit distance, and a determination target of applicability determination. Thus, the number of applicability determination processes can be reduced. Thereby, the effect that the similar character string search process of a character string search apparatus can be sped up can be acquired.

Especially, a great effect can be obtained when the number of similar character string weight rules is very large. For example, in languages such as French that have a lot of spelling with similar sounds, there are hundreds of similar string weight rules required to handle misunderstandings of users, keystroke errors, and voice recognition errors. However, the number of similar character string weight rules for determining applicability when calculating the weighted edit distance by the rule extraction unit can be reduced to about several tens or less, and the similar character string search processing can be greatly speeded up.

If the dictionary contains a large amount of data and there are a large number of heading character strings for calculating the weighted edit distance, the tf− shown in, for example, Patent Document 1 is calculated before the weighted edit distance is calculated. It is also possible to provide some kind of prior narrowing means such as narrowing using idf (term frequency-inverse document frequency) weights.

In this embodiment, the rule extracted from the similar character string weight rule stored in the rule storage unit 102 is stored in the application rule storage unit 103. For example, only the rule number is stored in the application rule storage unit. A similar character string weight rule may be acquired from the rule storage unit based on the rule number when calculating the weighted edit distance.

In the above-described embodiment, an example of searching for a character string has been described. However, this character string may be a word string.
Further, although this embodiment has been described using an alphabetic input character string, the present invention is not limited to an alphabetic input character string, and is a character string in another language (for example, hiragana or kanji). May be.

Embodiment 2. FIG.
FIG. 9 is a block diagram of a character string search device according to Embodiment 2 of the present invention. The difference between the character string search devices of the first embodiment is that the edit distance calculation unit 104a has a function to cancel the edit distance calculation halfway based on the distance upper limit value 112 set from the outside. The other weight rule extraction unit 101, rule storage unit 102, application rule storage unit 103, dictionary 105, distance quasi-alignment unit 106, input character string 110, and similar character string list 111 are the same as those in the first embodiment.

Next, the operation of the character string search device according to the second embodiment will be described focusing on differences from the character string search device according to the first embodiment. FIG. 10 is a processing flow of a weighted edit distance calculation process of the character string search apparatus according to the second embodiment. Each process from ST301 to ST310 is the same as that in the first embodiment. In this embodiment, the weighted editing distance calculation is aborted between ST309 and ST310 (ST311).

FIG. 11 is a processing flow of ST311 abort determination. Hereinafter, the processing of ST311 will be described in detail with reference to FIG. Assume that the value of the variable i is Z. First, the minimum character position corresponding to the input character string 110 in the edit distance calculation table array that may be referred to in the weighted edit distance calculation when the value of the variable i is changed from Z to | Str1 | The value MinL is calculated (ST501).
Specifically, MinL can be obtained as follows. When the maximum number of characters in the left-hand side string stored in Rule [p] [r] (Z ≦ p ≦ | Str1 |, 1 ≦ r ≦ p-th character rule number 206) is MaxLen _p , ST307 M [p-MaxLen _p , q] (0 ≦ q ≦ | Str2 |) may be referred to. Therefore, the minimum value of p-MaxLen _p is MinL in the range of Z ≦ p ≦ | Str1 |.

For example, when Str1 = "CHA" and Str2 = "CQA", it is assumed that the array Rule is as shown in FIG. From FIG. 5, the maximum value MaxLen ₂ of the left side character string when the input character position = ₂ is 2, and the maximum value MaxLen ₃ of the left side character string when the input character position = ₃ is 2. Here, when Z is 2, the values of p-MaxLen _p (2 ≦ p ≦ 3) are 2-MaxLen ₂ = 0 and 3-MaxLen ₃ = 1, and MinL = 0.

Next, the distance upper limit 112 is compared with M [x, y] (MinL ≦ x ≦ Z, 0 ≦ y ≦ | Str2 |) (ST502). If there is at least one M [x, y] smaller than the distance upper limit 112, the process proceeds to ST310, and the edit distance calculation process is continued. On the other hand, if all M [x, y] are larger than the distance upper limit 112, the edit distance calculation is aborted because the weighted edit distance will not be smaller than the distance upper limit 112 even if the calculation is continued further. Then, the minimum value at that time is output as a weighted edit distance with respect to the heading character string Str2.

As described above, it is possible to omit the process of calculating the weighted edit distance exceeding the distance upper limit value by performing the censoring determination with the distance upper limit value set in advance, and the effect of speeding up the similar character string search Is obtained.

In particular, when a distance upper limit value is set to search only those having high similarity, for example, when the number of heading character strings stored in a dictionary such as facility names and addresses exceeds hundreds of thousands to millions An effect is obtained.

As described above, the character string search device of the present invention realizes an increase in the processing speed of the device and a reduction in the processing capability required for the device by reducing the processing amount of the weighted edit distance calculation in the character string search. Therefore, it is useful in an apparatus for performing a character string search such as a car navigation system.

101 weight rule extraction unit, 102 rule storage unit, 103 applied rule storage unit, 104, 104a edit distance calculation unit, 105 dictionary, 106 distance order alignment unit, 110 input character string, 111 similar character string list, 112 distance upper limit value, 201 rule number, 202 left side character string, 203 right side character string, 204 similarity score, 205 input character position, 206 rule number.

Claims

Among the plurality of similar character string weight rules in which the similarity between the first character string consisting of one or more characters and the second character string is defined, the first character string is included in the input search character string. A weight rule extraction unit that extracts the similar character string weight rule included;
Edit distance for calculating a weighted edit distance between the search character string and a heading character string acquired from a dictionary for searching the search character string, using the similar character string weight rule extracted in the weight rule extraction unit A calculation unit;
A character string search device comprising:
The character string search device according to claim 1, further comprising an application rule storage unit that stores the similar character string weight rule extracted by the weight rule extraction unit.
The application rule storage unit
For each character position of the search character string, the extracted similar character string weight rule in which the first character string matches the character string that is a part of the search character string that ends with the character position is represented by the character position. The character string search device according to claim 2, wherein the character string search rule is stored as the extracted similar character string weight rule corresponding to.
2. The distance order aligning unit that arranges the heading character string for which the weighted edit distance is calculated in the edit distance calculating unit, in order from the calculated weighted edit distance. The character string search device according to any one of claims 1 to 3.
The edit distance calculation unit
In the calculation of the weighted edit distance, if it is determined that the calculated weighted edit distance is equal to or greater than a predetermined distance upper limit value, the calculation of the weighted edit distance is interrupted, and at the time of the determination The character string search device according to claim 1, wherein a calculation value of the weighted edit distance is used as a calculation result.