WO2015014287A1

WO2015014287A1 - Method and apparatus for calculating similarity between chinese character strings based on editing distance

Info

Publication number: WO2015014287A1
Application number: PCT/CN2014/083326
Authority: WO
Inventors: 王平; 贾西贝
Original assignee: 深圳市华傲数据技术有限公司
Priority date: 2013-07-31
Filing date: 2014-07-30
Publication date: 2015-02-05
Also published as: CN103399907A

Abstract

In a method for calculating a similarity between Chinese character strings based on an editing distance, a similarity between Chinese characters in to-be-compared character strings is first calculated, and then a similarity between the to-be-compared Chinese character strings is calculated. In this method, Chinese characters in character strings are converted into four-corner code by using four-corner coding, so that a similarity between the Chinese characters is calculated based on an editing distance; and on this basis, a weight of the editing distance is replaced with the similarity between the Chinese characters to calculate a similarity between the character strings. In this method, Chinese characters are converted into numeric strings for comparison, to improve the precision of matching of the Chinese characters; and a weight of an editing distance is replaced with a similarity between the Chinese characters to calculate a similarity between character strings, so as to implement the practicability of an editing distance algorithm in matching of the Chinese character strings in the environment of the Chinese language and improve the accuracy of matching results. In addition, further provided is an apparatus for calculating a similarity between Chinese character strings based on an editing distance.

Description

Method and device for calculating Chinese string similarity based on edit distance

The present invention relates to the field of Chinese string similarity, and more particularly to a method and apparatus for calculating Chinese string similarity based on edit distance.

Background technique

The comparison of Chinese string similarity is a common technique in the technical fields of string matching, text comparison, and information extraction. Different Chinese language string similarity techniques are used in different applications. Common techniques include matching algorithm based on edit distance, matching algorithm based on glyph and pronunciation, and smith-Waterman algorithm.

In the edit distance algorithm, the steps of finding the similarity of two strings have the following steps: First, the edit distance matrix should be constructed in advance; second, the values of the matrix unit are calculated from left to right and top to bottom; Third, the calculated bottom right matrix unit value is the edit distance of the two strings. The algorithm is suitable for spelling errors and is easy to implement and use.

In the glyph and pronunciation based algorithm, the steps of finding the similarity of two strings include the following steps: First, the glyph similarity between the strings is calculated by the glyph coding-five coding; second, the pronunciation of the initials of the Chinese characters and the finals are used. Regularity is used to calculate the initial similarity and finality of the string, and the pronunciation similarity between the strings is calculated in combination with the fuzzy sounds commonly found in dialects or Mandarin. In order to improve the accuracy, in many application scenarios, the edit distance algorithm and the algorithm based on Chinese character glyph and pronunciation are simultaneously used to improve the accuracy of calculating string similarity. The smith-Waterman algorithm is an improved version of the edit distance algorithm. The improvement is: Calculate the matrix unit values by performing compensation operations for the delete, insert, and replace operations. The algorithm is currently a widely used sequence similarity comparison algorithm that is suitable for finding locally similar sequence pairs.

In the face of the specific problem of Chinese string matching in Chinese language environment, the practicality of the classical string matching method based on editing distance has decreased. Based on Chinese characters Although the shape algorithm takes into account the glyphs, it is only based on the five-stroke encoding of Chinese characters.

Summary of the invention

In order to solve one of the above defects.

Therefore, an embodiment of the present invention provides a method and apparatus for calculating Chinese string similarity based on an edit distance. The invention converts Chinese characters in a character string into four-corner codes by using four-corner number coding, thereby calculating the similarity degree of Chinese characters based on the editing distance, and then replacing the weight of the editing distance with the similarity of the Chinese characters, thereby calculating the similarity of the character strings. . The invention converts Chinese characters into digital strings for comparison, and improves the precision of Chinese character matching, and uses the similarity of the Chinese characters instead of the editing distance to calculate the similarity of the string to realize the Chinese character string matching of the editing distance algorithm in the Chinese language environment. Practicality and improved accuracy of matching results.

Therefore, an embodiment of the present invention provides a method for calculating the similarity of a Chinese character string based on an edit distance. The method first calculates the similarity of the Chinese characters in the string to be compared, and then calculates the similarity of the Chinese character string to be compared. The method includes: calculating the similarity of the Chinese characters comprises the following steps:

转换 Convert Chinese characters into four-corner codes with four-corner numbers; 计算 Calculate the similarity of Chinese characters with editing distances;

The calculation of the similarity of the Chinese string to be compared includes the following steps:

代替 Use the similarity of Chinese characters instead of the weight of the editing distance;

The similarity of the strings to be compared is calculated based on the improved edit distance.

Preferably, the converting the Chinese characters into the four-corner encoding comprises: pre-establishing a rule table of the four-corner number check word method; and converting the Chinese characters in the character string to be compared into the corresponding four-corner code according to the rule table.

Another embodiment of the present invention provides an apparatus for calculating a Chinese character string similarity based on an edit distance, the apparatus comprising: a Chinese character similarity calculation device, configured to acquire a Chinese character similarity according to an edit distance; and a Chinese string similarity calculation device, Used to get the string to be compared Similarity.

Preferably, the Chinese character similarity calculation device includes a four-corner number rule device and a four-corner coding device;

The four-corner number rule device is configured to pre-establish a rule table of the four-corner number check method; the four-corner code conversion device converts the Chinese characters in the to-be-compared character string into a four-corner code according to the rule information in the four-corner number rule device;

The Chinese character similarity calculating means acquires a numeric string in the four-corner encoding and forwarding device, and calculates the similarity of the Chinese character based on the editing distance.

Preferably, the Chinese character string similarity calculating device includes an editing distance weighting device; the editing distance weighting device acquires similarity information of a Chinese character in the Chinese character similarity calculating device, and sets the weight of the editing distance;

The Chinese character similarity calculating means acquires the weight information in the editing distance device, and calculates the similarity of the character string based on the improved editing distance.

DRAWINGS

FIG. 1 is a schematic flow chart of a method for calculating a Chinese string similarity based on an edit distance according to an embodiment of the present invention.

FIG. 2 is a schematic flow chart of calculating the similarity of Chinese characters by using the editing distance according to an embodiment of the present invention. detailed description

The present invention will be further described in detail below in conjunction with the drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Embodiments of the present invention provide a method and apparatus for calculating Chinese string similarity based on an edit distance. The invention converts Chinese characters in a character string into four-corner codes by using four-corner number coding, thereby calculating the similarity degree of Chinese characters based on the editing distance, and then replacing the weight of the editing distance with the similarity of the Chinese characters, thereby calculating the similarity of the character strings. . The invention converts Chinese characters into numbers The comparison of strings improves the accuracy of Chinese character matching. Using the similarity of the Chinese characters instead of the weight of the editing distance to calculate the similarity of the strings realizes the practicability of the editing distance algorithm in Chinese language context in Chinese language environment, and improves the matching. The accuracy of the results.

As shown in FIG. 1 , it is a schematic flowchart of a method for calculating Chinese string similarity based on edit distance, which is implemented by an embodiment of the present invention. The method first calculates the similarity of Chinese characters in the string to be compared, and then calculates the Chinese to be compared. The similarity of strings. Specifically, the following detailed steps are included:

Step S110: Convert the Chinese character into a four-corner code with a four-corner number.

For the implementation of the present invention, embodiments of the present invention require the use of a four-corner number and an edit distance algorithm. The so-called four-corner number refers to: One of the commonly used word-detecting methods in Chinese dictionaries, which classifies Chinese characters with up to five Arabic numerals. The four-corner number check method uses the numbers 0 to 9 to indicate the ten pen shapes of a Chinese character, and sometimes adds one complement at the end. The so-called edit distance algorithm is: (Edit Distance), which refers to the minimum number of edit operations required between two strings, one from one to another. Licensed editing operations include replacing one character with another, inserting one character, and deleting one character.

In this step, the rule table of the four-corner number check method needs to be established in advance, according to its code port: 横一垂二三点捺, fork four inserts five squares six, seven corners eight eight nine is small, the point has a horizontal Change the head.

According to the above rules, the Chinese characters in the string to be compared are obtained. For example, the Chinese characters in the Chinese character string si in the example 1 are "Huaao Data" and the Chinese characters in the Chinese string s2 are "Chinese Pride".

With four-corner number coding, Chinese characters can be converted into a form represented by a four-corner number. For example: Hua, the four corner number is 24401; the flower, the four corner number is 44214, as shown in Table 1 below:

Proud 28240 Hua 24401 Number 98440 Pride 72128 According to 57064 Proud 28240 Table 1 Chinese characters corresponding to the four corner numbers

Step S120: 计算 Calculate the similarity of the Chinese characters by using the editing distance.

To calculate the edit distance between the string si and the string s2, first define such a function - edit(i,j), which represents the substring of the first string si of length i to the second string s2 The edit distance of the substring of length j. Obviously there can be the following dynamic programming formula:

( 1 ) if i==0 JL j ==0, edit(i,j) = 0;

(2) if i==0 JL j >0, edit(i, j)=j;

(3) if i>0 JLj == 0, edit(i, j) = I;

(4)if0< i< 1 JL 0 <j < 1 , edit(i, j) == min{ edit(il, j) + 1, edit(i, j-1) + 1, edit(il,jl + f(i, j) }, when the ith character of the first string is not equal to the jth character of the second string, f(i,j)= l; otherwise, f(i, j) = 0.

The process of calculating the similarity of Chinese characters using the edit distance in this step is shown in Figure 2.

First, the lengths of the Chinese characters of the Chinese characters si and s2 of the example 1 are obtained, m and n=4. According to the above-mentioned edit distance algorithm dynamic programming formulas (1), (2) and (3), the initialization results can be obtained, as shown in the following Table 2: Chinese pride

0 1 2 3 4 Hua 1

Proud 2

Number 3 According to 4

Table 2 Initialization results

Calculate the value of Edit(i,j), which is the edit distance of the string of length i in si and the string of length j in s2.

First calculate the value of Edit(l,l), which is the edit distance between the string "hua" and the string "in". According to the dynamic programming formula (4): Edit(i,j)= min{edit(il, j) + 1, edit(i, j-1) + 1, edit(il, jl) + f(i,j) }, where f(i, j) is the edit distance between "Hua" and "Medium".

The edit distance algorithm is used to calculate the edit distance of the four-corner code, that is, the edit distance of the Chinese character.

Calculate edit(l, l), as shown in Table 3 below, edit(0, 1) + 1 ==2, edit(l, 0) + 1 == 2, edit(0, 0) + f(l, 1 )==0+ 1 == 1, min(edit(0, 1), edit(l, 0), edit(0, 0) + f(l, 1))==1, so edit(l, 1 ) == 1.

Table 3: Calculation of the edit distance of the Chinese characters "Hua" (24404) and "Medium" (50006) (1). And so on: edit(2, 1)+ 1 ==3, edit(l,2)+ 1 ==2, edit(l, l) + f(2, 2) == 1 + 1 ==2, where Sl[2]=='4, and s2[2]=='5,, the two are different, so the operation of exchanging adjacent characters is not counted in the comparative minimum number, as shown in Table 4 and Table 5 below. 5 0 0 0 6

0 1 2 3 4 5

2 1 1

4 2 2

4 3

0 4

1 5

Table 4: Edit distance meter for Chinese characters "Hua" (24404) and "Medium" (50006)

Table 5: Edit distance meter for Chinese characters "Hua" (24404) and "Medium" (50006)

Therefore, the edit distance between "Hua" (24401) and "Medium" (50006) is 4, 4/5=0.8 „ In this step, the Chinese characters are converted into a numeric string for similarity comparison, which can avoid the font and sound according to the Chinese characters. The ambiguity of similarity matching improves the accuracy of Chinese character matching.

Step S130: The similarity of the Chinese characters is used instead of the weight of the editing distance.

Using the similarity of the above Chinese characters instead of the weight of the editing distance to calculate the similarity of the strings to be compared, the practicality of the Chinese character string matching in the Chinese language environment by the editing distance algorithm is realized. Sex, and improve the accuracy of the matching results.

Step S140: Calculate the similarity of the string to be compared based on the improved edit distance.

Therefore, in step S120, the value of f(i, j) is 0.8, that is, the edit distance between "Hua" and "Medium" is 0.8. Edit(i,j)= min{edit(il, j) + 1, edit(i, j-1) + 1, edit(il, j-1) + f(i, j) }={ 1+1 , 1+1,0+0.8} = 0.8. And so on, calculate the value of Edit(i,j), where 0<i<=m;0<j<=n. Results ^ Table 6 below:

Table 5: Chinese string "Huaao Data" and "China Pride" Edit Distance Calculation. In the embodiment of the present invention, the Chinese string "Huaao Data" and "China Pride" edit distance is 3.2, and the similarity is 1- 3.2/4 = 0.2.

Another embodiment of the present invention provides an apparatus for calculating a Chinese character string similarity based on an edit distance, the apparatus comprising: a Chinese character similarity calculation device, configured to acquire a Chinese character similarity according to an edit distance; and a Chinese string similarity calculation device, Used to obtain the similarity of the strings to be compared.

The Chinese character similarity calculation device includes a four-corner number rule device and a four-corner coding device; the four-corner number rule device is configured to pre-establish a rule table of the four-corner number check method, and according to the coding port, the code is: , fork four inserts five squares six, seven corners eight eight nine is small, point has a horizontal change.

Obtain the Chinese characters in the string to be compared according to the above rules, as in Example 1 The Chinese characters in the string sl are "Huaao Data" and the Chinese characters in the Chinese string s2 are "Chinese Pride". The four-corner code conversion device converts the Chinese characters in the character string to be compared into four-corner codes according to the rule information in the four-corner number rule device, and the Chinese characters in the Chinese character string sl to be compared in the above example 1 are "Huaao data" and Chinese characters. The Chinese character in string s2 is "China Pride", Hua, the four-corner number is 24401; the flower, the four-corner number is 44214.

The Chinese character similarity calculating means acquires a numeric string in the four-corner code transponder, and calculates the similarity of the Chinese character based on the edited distance. Obtain the Chinese characters of the Chinese string sl and s2 for the length of m and n, that is, m=n=4. According to the above-mentioned edit distance algorithm dynamic programming formulas (1), (2) and (3), the initialization result can be obtained, and the value of Edit(i,j) can be calculated, that is, the length of the string i in sl and the length in s2 are j. The edit distance of the string.

First calculate the value of Edit(l,l), which is the edit distance between the string "hua" and the string "in". According to the dynamic programming formula (4): Edit(i,j)= min{edit(il, j) + 1, edit(i, j-1) + 1, edit(il, jl) + f(i, j) } where f(i, j) is the edit distance between "Hua" and "Medium".

The edit distance algorithm is used to calculate the edit distance of the four-corner code, that is, the edit distance of the Chinese character. Calculate edit(l, 1) as shown in Table 3 below, edit(0, 1) + 1 == 2 , edit(l, 0) + 1 == 2 , edit(0, 0) + f(l, 1 ) == 0 + 1 == 1 , min(edit(0, 1) , edit(l, 0) , edit(0, 0) + f(l, 1))==1 , so edit(l, 1 ) == 1.

The Chinese character string similarity calculation device includes an edit distance weight device; the edit distance weight device acquires similarity information of a Chinese character in the Chinese character similarity calculation device, and is set as a weight of the edit distance; the Chinese character similarity calculation device acquires The weight information in the distance device is edited, and the similarity of the string is calculated based on the improved edit distance. The value of f(i, j) is 0.8, that is, the edit distance between "Hua" and "Medium" is 0.8. Edit(i,j)= min{edit(il, j) + 1, edit(i, j-1) + 1, edit(il, j-1) + f(i, j) }={1 + 1 , 1+1,0+0.8} = 0.8. And so on, calculate the value of Edit(iJ), where 0<i<=m;0<j<=n. Therefore, in the embodiment of the present invention, the Chinese character string "Huaao Data" and "China Pride" have an edit distance of 3.2 and a similarity of 1-3.2/4 = 0.2. Embodiments of the present invention provide a method and apparatus for calculating Chinese string similarity based on an edit distance. The invention converts Chinese characters in a character string into four-corner codes by using four-corner number coding, thereby calculating the similarity degree of Chinese characters based on the editing distance, and then replacing the weight of the editing distance with the similarity of the Chinese characters, thereby calculating the similarity of the character strings. . The invention converts Chinese characters into digital strings for comparison, and improves the precision of Chinese character matching, and uses the similarity of the Chinese characters instead of the editing distance to calculate the similarity of the string to realize the Chinese character string matching of the editing distance algorithm in the Chinese language environment. Practicality and improved accuracy of matching results.

The above is a further detailed description of the present invention in connection with the specific preferred embodiments, and the specific embodiments of the present invention are not limited to the description. For those skilled in the art to which the present invention pertains, a number of simple deductions or substitutions may be made without departing from the inventive concept.

Claims

Rights request

A method for calculating a Chinese string similarity based on an edit distance, comprising: calculating a similarity of a Chinese character in a string to be compared; and calculating a similarity of the Chinese character string to be compared, wherein the similarity of the Chinese character is The calculation includes the following steps:

转换 Convert Chinese characters into four-corner codes with four-corner numbers;

计算 Calculate the similarity of Chinese characters by using the editing distance;

The method according to claim 1, wherein the converting the Chinese character into a four-corner code comprises:

Pre-establishing a rule form for the four-corner number check method;

The Chinese characters in the string to be compared are converted into corresponding four-corner codes according to the above rule table.

3. An apparatus for calculating a Chinese string similarity based on an edit distance, wherein the apparatus comprises:

The Chinese character similarity calculating device is configured to obtain a Chinese character similarity according to the editing distance; the Chinese string similarity calculating device is configured to obtain the similarity of the character string to be compared.

The device according to claim 3, wherein the Chinese character similarity calculation device comprises a four-corner number rule device and a four-corner coding device;

The four-corner number rule device is configured to pre-establish a rule table of the four-corner number check method; the four-corner code conversion device converts the Chinese characters in the character string to be compared into a four-corner code according to the rule information in the four-corner number rule device; The Chinese character similarity calculating device acquires a numeric string in the four-corner encoding forwarding device, and calculates a similarity of the Chinese character based on the editing distance.

The device according to claim 3, wherein the Chinese character string similarity calculation device comprises an edit distance weight device;

The edit distance weighting device acquires the similarity information of the Chinese characters in the Chinese character similarity calculation device, and sets the weight of the edit distance;