WO2016119507A1

WO2016119507A1 - Object name edit distance calculating method and matching method based on information entropy

Info

Publication number: WO2016119507A1
Application number: PCT/CN2015/094370
Authority: WO
Inventors: 王明兴; 吴颖徽; 马帅; 汤南; 贾西贝
Original assignee: 深圳市华傲数据技术有限公司
Priority date: 2015-01-30
Filing date: 2015-11-12
Publication date: 2016-08-04
Also published as: CN104572627B; CN104572627A

Abstract

The present invention relates to an object name edit distance calculating method and matching method based on information entropy. The edit distance calculating method comprises: step 10. counting the number of times (freq) each character appears and the total number (totalNum) of object names, wherein if the character appears multiple times in an object name, it is calculated only by once; step 20. calculating information entropy of the character according to the ratio between the total number (totalNum) of the object names and the number of times (freq) the character appears to obtain an edit cost of the character; and step 30. when the edit distances of the object names are calculated, enabling the edit cost of inserting or deleting of one character to equal the edit cost of the character, and performing a substitution operation in the case where the edit cost of substitution is zero when two characters are the same, otherwise the edit cost of substitution is the sum of the edit costs of the two characters. The present invention further provides a corresponding matching method. The present invention can more accurately reflect the absolute difference of character strings of two object names; and the present invention can effectively recognizing the similarity between two object names, and the effect of handling the problem of matching of data of a name type is better.

Description

Title: Inventive Name: Object Name Editing Distance Calculation Method and Matching Method Based on Information Entropy

[0001] The present invention relates to the field of data processing technologies, and in particular, to an object name edit distance calculation method and an object name matching method based on information entropy.

BACKGROUND OF THE INVENTION

[0003] Object recognition, also known as record matching, aims to identify records representing the same real object from various (unreliable) data sources. Object recognition plays an important role in applications such as data cleaning, data integration, and data analysis. Among the data used for object recognition, one type of data that is commonly encountered and very important is name class data, such as institution name, drug name, building name, and the like. How to effectively calculate the similarity between two names is crucial for object recognition.

[0004] The result of name matching is usually obtained by comparing string similarities. Existing string similarity calculation methods include edit distance, vector space, QGmm, and the like. The edit distance refers to the minimum number of edit operations required between two strings, from one to another. The permitted editing operations include replacing one character with another, inserting a character, deleting a character, inserting or deleting. The editing cost of a character is 1, and for a replacement operation, when two characters are the same, the editing cost of the replacement operation is 0, otherwise it is 1. Similarity is a measure of the degree of similarity between two strings. The edit distance between two strings is itself a similar measure, and dynamic programming algorithms are often used to calculate the edit distance. Intuitively, the smaller the edit distance, the greater the similarity.

[0005] In an application, people often calculate the similarity between strings by transforming the edit distance into similarities according to a predetermined formula. The edit distance reflects the absolute difference between the two strings, and the similarity reflects the similarity between the two strings with a value between [0, 1]. The larger the value, the higher the similarity. Commonly used formulas for calculating the similarity of two strings based on edit distance are as follows:

[0006] similarity = 1.0 - d(n)(m) / (m+n) (1) or

[0007] similarity = 1.0 - d(n)(m) /max (m, n) (2);

[0008] wherein d(n)(m), that is, d (n, m) represents an edit distance between two strings; m and n are respectively lengths of two strings; the calculated degree of similarity is larger , indicating that the two strings have higher similarity. [0009] However, the existing string similarity calculation method cannot well recognize the intrinsic similarity between two object names. For example, when using the traditional edit distance calculation method according to formula (1) to judge "Shenzhen Huaao Data Technology Co., Ltd." and "Huaao Data Technology Co., Ltd.", the similarity is as low as 0.77, similar to Different degrees of results, it is easy to distinguish that these two names actually represent a company; "Tianjin Nanxun District Hongye Auto Parts Business Department" and "Tianjin Nanxun District Jiuyi Auto Parts Business Department" similarity The degree is 0.87, but people know that they represent two companies. Therefore, if the user uses the edit distance method for name matching, some incorrect conclusions will be drawn, and the similarity between the two object names cannot be effectively recognized.

SUMMARY OF THE INVENTION

[0011] An object of the present invention is to provide an object name edit distance calculation method based on information entropy, which improves the calculation of the edit distance between two object names.

Another object of the present invention is to provide an object name matching method based on information entropy, which improves the recognition of similarity between two object names.

[0013] In order to achieve the above object, the present invention provides an object name edit distance calculation method based on information entropy, including:

[0014] Step 10: Collect all the names of the objects to be identified, count the number of occurrences of each character freq and the total number of object names totalNum, if the characters appear in the object name multiple times, the calculation is performed once;

[0015] Step 20: Calculate the information entropy of the character according to the ratio between the total number of the object name totalNum and the number of occurrences of the character freq for each character, and obtain the editing cost of the character according to the information entropy of the character;

[0016] Step 30: Calculate the edit distance of the object name 吋, the edit cost of inserting or deleting one character is equal to the edit cost of the character. For the replacement operation, when the two characters are the same, the edit cost of the replacement is 0, otherwise two The sum of the editing costs of the characters.

[0017] wherein, the editing cost of the character = the information entropy of the character = log (totalNum / freq).

[0018] wherein, the dynamic programming method is used to calculate the edit distance between the object names.

[0019] wherein the object name is an institution name, a drug name, or a building name.

[0020] wherein the object name includes a Chinese character or an English character.

[0021] The present invention further provides an object name matching method based on information entropy, including:

[0022] Step 1, collecting all the names of the objects to be identified, counting the number of occurrences of each character freq and the name of the object The total number of totalNum, if the character appears in an object name multiple times to calculate;

[0023] Step 2: For each character, calculate the information entropy of the character according to the ratio between the total number of the object name totalNum and the number of occurrences of the character freq, and obtain the editing cost of the character according to the information entropy of the character;

[0024] Step 3: Calculate the edit distance of the object name 吋, the edit cost of inserting or deleting one character is equal to the edit cost of the character. For the replacement operation, when the two characters are the same, the edit cost of the replacement is 0, otherwise two The sum of the editing costs of the characters;

[0025] Step 4. Calculate the similarity between the object names according to the edit distance between the object names.

[0026] wherein d(n)(m) represents an edit distance between two object names of a string length n and a string length m, then the similarity between the two object names is similarity=1.0 — d(n)(m)/(d(n)(0) + d(0)(m)).

[0027] wherein d(n)(m) represents an edit distance between two object names of a string length n and a string length m, then the similarity between the two object names is similarity=1.0 — d(n)(m)/max(d(n)(0), d(0)(m)).

[0028] wherein, the editing cost of the character = the information entropy of the character = log (totalNum / freq).

[0029] wherein the dynamic programming method is used to calculate the edit distance between the object names.

[0030] In summary, the object name edit distance calculation method based on information entropy improves the calculation method of the edit distance, and more accurately reflects the absolute difference between two object name strings; the object name based on information entropy of the present invention The matching method can effectively identify the similarity between two object names, and it is better to deal with the name class data matching problem.

BRIEF DESCRIPTION OF THE DRAWINGS

1 is a flowchart of a method for calculating an object name edit distance based on information entropy according to a preferred embodiment of the present invention. [0033]

The technical scheme of the present invention and its advantageous effects will be apparent from the following detailed description of the embodiments of the invention.

Referring to FIG. 1, it is a flowchart of a preferred embodiment of an object entropy-based object name edit distance calculation method according to the present invention. The information entropy-based object name edit distance calculation method mainly includes:

[0036] Step 10: Collect all the names of the objects to be identified, and count the number of occurrences of each character freq and the object name. The total number of totals called totalNum, if the character appears in the object name multiple times, it is calculated once;

[0037] Step 20: Calculate the information entropy of the character according to the ratio between the total number of the object name totalNum and the number of occurrences of the character freq for each character, and obtain the editing cost of the character according to the information entropy of the character;

[0038] The information entropy of a character can be calculated by the formula log (totalNum/freq). The editing cost of a character can be simply equal to the information entropy of the character. The log can take 2, e or any other suitable constant as the base. In the present invention, the information entropy of a character or the calculation formula of the editing cost of a character can be selected according to the following conditions: If a character appears more frequently, the lower the information content, the lower the editing cost; otherwise, the information is explained. The content is high, the editor is more expensive, and the distinction between objects is more valuable.

[0039] Through steps 10 and 20, the editing cost of each character can be calculated; thus, an edit cost table (costTable) of all characters can be obtained, which can be used to further calculate the editing distance and/or similarity of the two object names. .

[0040] Step 30: Calculate the editing distance of the object name 吋, the editing cost of inserting or deleting one character is equal to the editing cost of the character, and for the replacement operation, when the two characters are the same, the editing cost of the replacement is 0, otherwise two The sum of the editing costs of the characters.

[0041] The calculation of the object name edit distance based on the information entropy of the present invention can be implemented by using an existing algorithm. For example, the dynamic programming method can be used to calculate the edit distance between object names. The invention considers that each character of the object name has different weights in the entire name, and the information entropy of the introduced character improves the editing distance calculation, and associates the information entropy of the character with the editing distance calculation, compared with the original editing distance calculation, The editing cost of inserting or deleting a character is replaced by 1 for the editing cost of the character. For the replacement operation, when the two characters are different, the editing cost is replaced by 1 to the sum of the editing costs of the two characters, so that the final editing The distance calculation result can more accurately reflect the absolute difference between the two object name strings. The final edit distance calculation result can be used to further calculate the similarity of the object name, and can also replace the original edit distance calculation result in the appropriate application field. To use.

[0042] The present invention further provides an object entropy-based object name matching method, which calculates the similarity between the object names according to the object entropy-based object name editing distance. The present invention considers that each character of an object name has a different weight in the entire name, and some characters are critical, and some characters are usually ignored in some occasions, such as the institution name "Shenzhen Huaao Data Technology" Ltd. "中," Shenzhen City" 3 characters represent the area in which the company is located, when calculating the similarity between a group of institution names in a specific area吋 If you identify all enterprises in Guangdong Province, these three characters are usually irrelevant; "Huaao" is the most important part of the name; "Data Technology" represents the category of the enterprise, which has certain reference significance; "On behalf of the nature of the business, it is usually irrelevant. Therefore, comparing names requires distinguishing the weight of each character. The solution of the present invention is to calculate the similarity method based on the edit distance, and utilize the information entropy of each character. The larger the information entropy, the larger the edit distance (editing cost).

[0043] Referring to the process of calculating the original editing distance by using the dynamic programming algorithm, the object entropy-based object name editing distance calculation method and the object name matching method of the present invention are specifically described by pseudo code. The editing distance and similarity of two object names are calculated as follows:

[0044] 1.

Let the object name l(namel) be n, and the object name 2 (name2) be m in length, initialize an n+1 line

, m+1 column matrix: d[n+l][m+l] _; set d[0][0] to 0;

[0045] The present invention uses d( _n )(m), that is, d(n, m), to represent an edit distance between two object names of a string length n and a string length m;

[0046] 2. For the first column value of the matrix, the calculation method is:

For (i <- 1 to n) {

d(i)(0) = d(i - 1)(0) + insertCost(namel.charAt(i - 1))

[0049] }

[0050] where ^^]10^^1^1.(*3^^ - 1)) is the cost of inserting the first character of the object name 1 (number starting from 0), that is, the i-th The edit cost of 1 character can be found in the edit cost table. The meaning of d(i)(0) is the total editing cost of 0 to i characters of the inserted object name 1;

[0051] 3. The first row value of the matrix is calculated as:

[0052] for (j <- 1 to m) {

d(0)(j) = d(0)(j - 1) + insertCost(name2.charAt(j - 1))

[0054] }

[0055] wherein insertCost(name2.charAt(j - 1)) is an editing cost of the j-1th character of the object name 2;

[0056] 4. Calculate the most likely editing cost for the two object names: maxCost = d(n)(0) + d(0)(m);

[0057] 5. Calculate the values of other rows and columns in d(i)(j). For each row and column d(i)(j), the calculation method is as follows: [0058] a. Calculate i- on namel The cost of deleting characters in 1 position: [0059] delCosti= d(i - 1)1 + delCost(namel.charAt(i - 1))

[0060] b. Calculate the cost of deleting characters at the j-1 position on _n ame2:

[0061] delCostj = d(i)(j - 1) + delCost(name2.charAt(j - 1))

[0062] c. Calculate the cost of the i-1 position on namel and the j-1 position character substitution on name2:

[0063] change = d(i - l)(j - 1) + changeCost(name 1 xharAt(i - 1), name2.charAt(j - 1))

[0064] The cost of replacing two characters is: If the two characters are the same, the cost of the replacement is 0; otherwise, the sum of the two-character editing costs.

[0065] d. The value of d(i)(j) takes the minimum of the above three costs:

[0066] d(i)(j) = min3(delCosti, delCostj, change)

[0067] 6. After all the values of d(i)(j) are calculated, d(n)(m) is the minimum cost of editing between namel and name2.

The higher the editor's cost, the more dissimilar the two are. So the similarity between namel and _n ame2 can be expressed as:

[0068] similarity = 1.0 - d(n)(m) I maxCost, ie similarity = 1.0 - d(n)(m) I (d(n)(0) +

d(0)(m)).

[0069] The similarity between namel and name2 can also be expressed as similarity = 1.0 - d(n)(m) I max(d(n)(0)

, d(0)(m)) , or use other suitable similarity calculation formulas.

[0070] Using the approximation calculation formula 吋 above, the base of the formula log (totalNum/freq) can be eliminated by the base change formula in the calculation process, so the selection of the base does not affect the calculation result of the similarity.

[0071] So far, the similarity between the two object names is calculated.

[0072] The object name edit distance calculation method and the object name matching method based on information entropy of the present invention may be suitable for various object names, in particular, an institution name, a drug name or a building name, and preferably applied to the same type of object to be identified. The matching, for example, the data to be identified are the organization names, all of which are drug names or are building names. The object name can contain Chinese characters or English characters, characters in other languages, or other symbols.

[0073] Experiments have shown that the calculation effect of the present invention is significantly improved compared to the original edit distance calculation method, for example:

[0074] 1. For "Tianjin Nanxun District Hongye Auto Parts Business Department" and "Tianjin Nanxun District Jiuyi Auto Parts Business Department", the original editing distance similarity is 0.867, and the calculated value of this method is 0.59. , this method is more capable They are different companies;

[0075] 2. For "Tianjin Computer Consumables Operation Department of Nanxun District of Tianjin" and "Shunwei Computer Consumables Operation Department of Nanxun District of Tianjin", the original editing distance similarity is 0.875, and the calculated value of this method is 0.576. Also more distinguishable;

[0076] 3. For the "Nanjing District Tiancheng Medicine and Health Products Research Institute" and "Tianjin Nanxun District Tiancheng Medicine and Health Products Research Institute", the original editing distance similarity is 0.8125, and the calculated value of this method is 0.998. This method is more revealing that they represent the same company.

[0077] In summary, the object name edit distance calculation method based on information entropy improves the calculation method of the edit distance, and more accurately reflects the absolute difference between two object name strings; the object name based on information entropy of the present invention The matching method can effectively identify the similarity between two object names, and it is better to deal with the name class data matching problem.

The above description is only the preferred embodiment of the present invention, and is not intended to limit the present invention. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the present invention. Within the scope of protection.

technical problem

Problem solution

Advantageous effects of the invention

Claims

Claim

A method for calculating an object name edit distance based on information entropy, comprising: step 10: collecting all the names of objects to be identified, counting the number of occurrences of each character freq and the total number of object names totalNum, if the characters are in an object name Multiple times in the calculation;

Step 20: For each character, calculate the information entropy of the character according to the ratio between the total number of the object name totalNum and the number of occurrences of the character freq, and obtain the editing cost of the character according to the information entropy of the character;

Step 30: Calculate the edit distance of the object name 吋, the edit cost of inserting or deleting a character is equal to the edit cost of the character. For the replacement operation, when the two characters are the same, the edit cost of the replacement is 0, otherwise the edit of the two characters is The sum of the costs.

The information entropy-based object name edit distance calculation method according to claim 1, wherein the edit cost of the character=the information entropy of the character=log (totalNum/freq) ₀ is based on the information entropy according to claim 1. The object name edit distance calculation method is characterized in that a dynamic plan method is used to calculate an edit distance between object names.

The object entropy-based object name edit distance calculation method according to claim 1, wherein the object name is an institution name, a drug name, or a building name. The object entropy-based object name edit distance calculation method according to claim 1, wherein the object name includes a Chinese character or an English character.

An object name matching method based on information entropy, comprising: step 1: collecting all the names of objects to be identified, counting the number of occurrences of each character freq and the total number of object names totalNum, if the characters appear in an object name Press once to calculate;

Step 2. For each character, calculate the information entropy of the character according to the ratio between the total number of the object name totalNum and the number of occurrences of the character freq, and obtain the editing cost of the character according to the information entropy of the character;

Step 3. Calculate the edit distance of the object name. The edit cost of inserting or deleting a character is equal to the edit cost of the character. For the replacement operation, when the two characters are the same, the replacement is performed. The editing cost is 0, otherwise it is the sum of the editing costs of two characters;

Step 4. Calculate the similarity between the object names according to the edit distance between the object names.

[Claim 7] The object entropy-based object name matching method according to claim 6, wherein

, d(n)(m) represents the edit distance between two object names with a string length of n and a string length of m, then the similarity between the two object names is similarity =

1.0—d(n)(m) I (d(n)(0) + d(0)(m)).

[Claim 8] The object entropy-based object name matching method according to claim 6, wherein d(n)(m) represents two object names of a character string length n and a character string length m The edit distance between the two object names is similarity = 1.0-d(n)(m) / max(d(n)(0), d(0)(m)).

[Claim 9] The object entropy-based object name matching method according to claim 6, wherein the editing cost of the character = the information entropy of the character = log (totalNum/freq).

[Claim 10] The object entropy-based object name matching method according to claim 6, wherein the dynamic programming method is used to calculate an edit distance between object names.