WO2016119507A1 - Object name edit distance calculating method and matching method based on information entropy - Google Patents
Object name edit distance calculating method and matching method based on information entropy Download PDFInfo
- Publication number
- WO2016119507A1 WO2016119507A1 PCT/CN2015/094370 CN2015094370W WO2016119507A1 WO 2016119507 A1 WO2016119507 A1 WO 2016119507A1 CN 2015094370 W CN2015094370 W CN 2015094370W WO 2016119507 A1 WO2016119507 A1 WO 2016119507A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- character
- edit
- name
- cost
- object name
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
Definitions
- the present invention relates to the field of data processing technologies, and in particular, to an object name edit distance calculation method and an object name matching method based on information entropy.
- Object recognition also known as record matching, aims to identify records representing the same real object from various (unreliable) data sources. Object recognition plays an important role in applications such as data cleaning, data integration, and data analysis. Among the data used for object recognition, one type of data that is commonly encountered and very important is name class data, such as institution name, drug name, building name, and the like. How to effectively calculate the similarity between two names is crucial for object recognition.
- the result of name matching is usually obtained by comparing string similarities.
- Existing string similarity calculation methods include edit distance, vector space, QGmm, and the like.
- the edit distance refers to the minimum number of edit operations required between two strings, from one to another.
- the permitted editing operations include replacing one character with another, inserting a character, deleting a character, inserting or deleting.
- the editing cost of a character is 1, and for a replacement operation, when two characters are the same, the editing cost of the replacement operation is 0, otherwise it is 1.
- Similarity is a measure of the degree of similarity between two strings.
- the edit distance between two strings is itself a similar measure, and dynamic programming algorithms are often used to calculate the edit distance. Intuitively, the smaller the edit distance, the greater the similarity.
- d(n)(m) that is, d (n, m) represents an edit distance between two strings
- m and n are respectively lengths of two strings
- the calculated degree of similarity is larger , indicating that the two strings have higher similarity.
- the existing string similarity calculation method cannot well recognize the intrinsic similarity between two object names.
- An object of the present invention is to provide an object name edit distance calculation method based on information entropy, which improves the calculation of the edit distance between two object names.
- Another object of the present invention is to provide an object name matching method based on information entropy, which improves the recognition of similarity between two object names.
- the present invention provides an object name edit distance calculation method based on information entropy, including:
- Step 10 Collect all the names of the objects to be identified, count the number of occurrences of each character freq and the total number of object names totalNum, if the characters appear in the object name multiple times, the calculation is performed once;
- Step 20 Calculate the information entropy of the character according to the ratio between the total number of the object name totalNum and the number of occurrences of the character freq for each character, and obtain the editing cost of the character according to the information entropy of the character;
- Step 30 Calculate the edit distance of the object name ⁇ , the edit cost of inserting or deleting one character is equal to the edit cost of the character. For the replacement operation, when the two characters are the same, the edit cost of the replacement is 0, otherwise two The sum of the editing costs of the characters.
- the dynamic programming method is used to calculate the edit distance between the object names.
- object name is an institution name, a drug name, or a building name.
- the object name includes a Chinese character or an English character.
- the present invention further provides an object name matching method based on information entropy, including:
- Step 1 collecting all the names of the objects to be identified, counting the number of occurrences of each character freq and the name of the object The total number of totalNum, if the character appears in an object name multiple times to calculate;
- Step 2 For each character, calculate the information entropy of the character according to the ratio between the total number of the object name totalNum and the number of occurrences of the character freq, and obtain the editing cost of the character according to the information entropy of the character;
- Step 3 Calculate the edit distance of the object name ⁇ , the edit cost of inserting or deleting one character is equal to the edit cost of the character. For the replacement operation, when the two characters are the same, the edit cost of the replacement is 0, otherwise two The sum of the editing costs of the characters;
- Step 4. Calculate the similarity between the object names according to the edit distance between the object names.
- d(n)(m) represents an edit distance between two object names of a string length n and a string length m
- d(n)(m) represents an edit distance between two object names of a string length n and a string length m
- the object name edit distance calculation method based on information entropy improves the calculation method of the edit distance, and more accurately reflects the absolute difference between two object name strings; the object name based on information entropy of the present invention
- the matching method can effectively identify the similarity between two object names, and it is better to deal with the name class data matching problem.
- FIG. 1 is a flowchart of a method for calculating an object name edit distance based on information entropy according to a preferred embodiment of the present invention.
- the information entropy-based object name edit distance calculation method mainly includes:
- Step 10 Collect all the names of the objects to be identified, and count the number of occurrences of each character freq and the object name. The total number of totals called totalNum, if the character appears in the object name multiple times, it is calculated once;
- Step 20 Calculate the information entropy of the character according to the ratio between the total number of the object name totalNum and the number of occurrences of the character freq for each character, and obtain the editing cost of the character according to the information entropy of the character;
- the information entropy of a character can be calculated by the formula log (totalNum/freq).
- the editing cost of a character can be simply equal to the information entropy of the character.
- the log can take 2, e or any other suitable constant as the base.
- the information entropy of a character or the calculation formula of the editing cost of a character can be selected according to the following conditions: If a character appears more frequently, the lower the information content, the lower the editing cost; otherwise, the information is explained. The content is high, the editor is more expensive, and the distinction between objects is more valuable.
- Step 30 Calculate the editing distance of the object name ⁇ , the editing cost of inserting or deleting one character is equal to the editing cost of the character, and for the replacement operation, when the two characters are the same, the editing cost of the replacement is 0, otherwise two The sum of the editing costs of the characters.
- the calculation of the object name edit distance based on the information entropy of the present invention can be implemented by using an existing algorithm.
- the dynamic programming method can be used to calculate the edit distance between object names.
- the invention considers that each character of the object name has different weights in the entire name, and the information entropy of the introduced character improves the editing distance calculation, and associates the information entropy of the character with the editing distance calculation, compared with the original editing distance calculation, The editing cost of inserting or deleting a character is replaced by 1 for the editing cost of the character.
- the editing cost is replaced by 1 to the sum of the editing costs of the two characters, so that the final editing
- the distance calculation result can more accurately reflect the absolute difference between the two object name strings.
- the final edit distance calculation result can be used to further calculate the similarity of the object name, and can also replace the original edit distance calculation result in the appropriate application field. To use.
- the present invention further provides an object entropy-based object name matching method, which calculates the similarity between the object names according to the object entropy-based object name editing distance.
- the present invention considers that each character of an object name has a different weight in the entire name, and some characters are critical, and some characters are usually ignored in some occasions, such as the institution name "Shenzhen Huaao Data Technology” Ltd.
- m+1 column matrix d[n+l][m+l] ; set d[0][0] to 0;
- the present invention uses d( n )(m), that is, d(n, m), to represent an edit distance between two object names of a string length n and a string length m;
- ⁇ ]10 ⁇ 1 ⁇ 1.(*3 ⁇ - 1)) is the cost of inserting the first character of the object name 1 (number starting from 0), that is, the i-th
- the edit cost of 1 character can be found in the edit cost table.
- the meaning of d(i)(0) is the total editing cost of 0 to i characters of the inserted object name 1;
- the first row value of the matrix is calculated as:
- insertCost (name2.charAt(j - 1)) is an editing cost of the j-1th character of the object name 2;
- delCostj d(i)(j - 1) + delCost(name2.charAt(j - 1))
- the cost of replacing two characters is: If the two characters are the same, the cost of the replacement is 0; otherwise, the sum of the two-character editing costs.
- d(n)(m) is the minimum cost of editing between namel and name2.
- the object name edit distance calculation method and the object name matching method based on information entropy of the present invention may be suitable for various object names, in particular, an institution name, a drug name or a building name, and preferably applied to the same type of object to be identified.
- the matching for example, the data to be identified are the organization names, all of which are drug names or are building names.
- the object name can contain Chinese characters or English characters, characters in other languages, or other symbols.
- the object name edit distance calculation method based on information entropy improves the calculation method of the edit distance, and more accurately reflects the absolute difference between two object name strings; the object name based on information entropy of the present invention
- the matching method can effectively identify the similarity between two object names, and it is better to deal with the name class data matching problem.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Algebra (AREA)
- Pure & Applied Mathematics (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to an object name edit distance calculating method and matching method based on information entropy. The edit distance calculating method comprises: step 10. counting the number of times (freq) each character appears and the total number (totalNum) of object names, wherein if the character appears multiple times in an object name, it is calculated only by once; step 20. calculating information entropy of the character according to the ratio between the total number (totalNum) of the object names and the number of times (freq) the character appears to obtain an edit cost of the character; and step 30. when the edit distances of the object names are calculated, enabling the edit cost of inserting or deleting of one character to equal the edit cost of the character, and performing a substitution operation in the case where the edit cost of substitution is zero when two characters are the same, otherwise the edit cost of substitution is the sum of the edit costs of the two characters. The present invention further provides a corresponding matching method. The present invention can more accurately reflect the absolute difference of character strings of two object names; and the present invention can effectively recognizing the similarity between two object names, and the effect of handling the problem of matching of data of a name type is better.
Description
说明书 发明名称:基于信息熵的对象名称编辑距离计算方法及匹配方法 技术领域 Title: Inventive Name: Object Name Editing Distance Calculation Method and Matching Method Based on Information Entropy
[0001] 本发明涉及数据处理技术领域, 尤其涉及一种基于信息熵的对象名称编辑距离 计算方法及对象名称匹配方法。 [0001] The present invention relates to the field of data processing technologies, and in particular, to an object name edit distance calculation method and an object name matching method based on information entropy.
[0002] 背景技术 BACKGROUND OF THE INVENTION
[0003] 对象识别又称记录匹配, 其目的是从 (不可靠的) 各种数据源中识别出表示同 一现实对象的记录。 对象识别在数据清洗、 数据集成、 数据分析等应用中具有 重要作用。 对象识别所用的数据中, 一类普遍遇到且非常重要的数据是名称类 数据, 如机构名称、 药品名称、 建筑物名称等。 如何有效的计算出两个名称之 间的相似度对对象识别至关重要。 [0003] Object recognition, also known as record matching, aims to identify records representing the same real object from various (unreliable) data sources. Object recognition plays an important role in applications such as data cleaning, data integration, and data analysis. Among the data used for object recognition, one type of data that is commonly encountered and very important is name class data, such as institution name, drug name, building name, and the like. How to effectively calculate the similarity between two names is crucial for object recognition.
[0004] 名称匹配的结果通常通过比较字符串相似度来得出。 现有的字符串相似度计算 方法包括编辑距离、 向量空间、 QGmm等。 编辑距离是指两个字符串之间, 由 一个转成另一个所需的最少编辑操作次数, 许可的编辑操作包括将一个字符替 换成另一个字符, 插入一个字符, 刪除一个字符, 插入或刪除一个字符的编辑 代价为 1, 对于替换操作, 当两个字符相同吋替换操作的编辑代价为 0, 否则为 1 。 相似度是两个字符串之间相似程度的度量。 两个字符串之间的编辑距离本身 就是一种相似度量, 通常使用动态规划算法来计算编辑距离。 直观上, 编辑距 离越小, 相似度越大。 [0004] The result of name matching is usually obtained by comparing string similarities. Existing string similarity calculation methods include edit distance, vector space, QGmm, and the like. The edit distance refers to the minimum number of edit operations required between two strings, from one to another. The permitted editing operations include replacing one character with another, inserting a character, deleting a character, inserting or deleting. The editing cost of a character is 1, and for a replacement operation, when two characters are the same, the editing cost of the replacement operation is 0, otherwise it is 1. Similarity is a measure of the degree of similarity between two strings. The edit distance between two strings is itself a similar measure, and dynamic programming algorithms are often used to calculate the edit distance. Intuitively, the smaller the edit distance, the greater the similarity.
[0005] 应用中, 人们常根据预定的公式将编辑距离变换为相似度来评估字符串之间的 相似度。 编辑距离反映了两个字符串的绝对差异, 而相似度以一个 [0,1]之间的数 值反映两个字符串的相似程度, 数值越大相似程度越高。 常用的基于编辑距离 计算两个字符串相似度 (similarity) 的公式如: [0005] In an application, people often calculate the similarity between strings by transforming the edit distance into similarities according to a predetermined formula. The edit distance reflects the absolute difference between the two strings, and the similarity reflects the similarity between the two strings with a value between [0, 1]. The larger the value, the higher the similarity. Commonly used formulas for calculating the similarity of two strings based on edit distance are as follows:
[0006] similarity = 1.0— d(n)(m) / (m+n) (1)或 [0006] similarity = 1.0 - d(n)(m) / (m+n) (1) or
[0007] similarity = 1.0— d(n)(m) /max (m, n) (2); [0007] similarity = 1.0 - d(n)(m) /max (m, n) (2);
[0008] 其中, d(n)(m)即 d (n, m) 表示两个字符串之间的编辑距离; m和 n分别为两 个字符串的长度; 计算得出的相似度越大, 表示 2个字符串相似度越高。
[0009] 但是, 现有的字符串相似度计算方法不能很好的识别两个对象名称之间内在的 相似度。 例如, 当按照公式 (1) 采用传统的编辑距离计算方法来判断"深圳市华 傲数据技术有限公司 "与"华傲数据技术有限公司"吋, 得出的相似度较低为 0.77 , 与相似度结果不同, 人们很容易判别出这两个名字实际上代表一家企业; "天 津市南幵区宏业汽车配件经营部"与"天津市南幵区久晟汽车配件经营部"之间的 相似度为 0.87, 但人们知道它们代表的是两家企业。 因此, 用户利用编辑距离方 法进行名称匹配吋, 会得出一些不正确的结论, 无法有效识别两个对象名称之 间的相似度。 [0008] wherein d(n)(m), that is, d (n, m) represents an edit distance between two strings; m and n are respectively lengths of two strings; the calculated degree of similarity is larger , indicating that the two strings have higher similarity. [0009] However, the existing string similarity calculation method cannot well recognize the intrinsic similarity between two object names. For example, when using the traditional edit distance calculation method according to formula (1) to judge "Shenzhen Huaao Data Technology Co., Ltd." and "Huaao Data Technology Co., Ltd.", the similarity is as low as 0.77, similar to Different degrees of results, it is easy to distinguish that these two names actually represent a company; "Tianjin Nanxun District Hongye Auto Parts Business Department" and "Tianjin Nanxun District Jiuyi Auto Parts Business Department" similarity The degree is 0.87, but people know that they represent two companies. Therefore, if the user uses the edit distance method for name matching, some incorrect conclusions will be drawn, and the similarity between the two object names cannot be effectively recognized.
[0010] 发明内容 SUMMARY OF THE INVENTION
[0011] 本发明的目的在于提供一种基于信息熵的对象名称编辑距离计算方法, 改进两 个对象名称之间编辑距离的计算。 [0011] An object of the present invention is to provide an object name edit distance calculation method based on information entropy, which improves the calculation of the edit distance between two object names.
[0012] 本发明的另一目的在于提供一种基于信息熵的对象名称匹配方法, 改进两个对 象名称之间相似度的识别。 Another object of the present invention is to provide an object name matching method based on information entropy, which improves the recognition of similarity between two object names.
[0013] 为实现上述目的, 本发明提供一种基于信息熵的对象名称编辑距离计算方法, 包括: [0013] In order to achieve the above object, the present invention provides an object name edit distance calculation method based on information entropy, including:
[0014] 步骤 10、 收集所有待识别对象名称, 统计每个字符出现的次数 freq以及对象名 称的总数 totalNum, 如果字符在一对象名称中出现多次按一次计算; [0014] Step 10: Collect all the names of the objects to be identified, count the number of occurrences of each character freq and the total number of object names totalNum, if the characters appear in the object name multiple times, the calculation is performed once;
[0015] 步骤 20、 对每个字符, 根据对象名称的总数 totalNum及字符出现的次数 freq之 间的比值计算字符的信息熵, 根据字符的信息熵得到字符的编辑代价; [0015] Step 20: Calculate the information entropy of the character according to the ratio between the total number of the object name totalNum and the number of occurrences of the character freq for each character, and obtain the editing cost of the character according to the information entropy of the character;
[0016] 步骤 30、 计算对象名称的编辑距离吋, 插入或刪除一个字符的编辑代价等于该 字符的编辑代价, 对于替换操作, 当两个字符相同吋替换的编辑代价为 0, 否则 为两个字符的编辑代价之和。 [0016] Step 30: Calculate the edit distance of the object name 吋, the edit cost of inserting or deleting one character is equal to the edit cost of the character. For the replacement operation, when the two characters are the same, the edit cost of the replacement is 0, otherwise two The sum of the editing costs of the characters.
[0017] 其中, 字符的编辑代价 =字符的信息熵= log (totalNum/freq)。 [0017] wherein, the editing cost of the character = the information entropy of the character = log (totalNum / freq).
[0018] 其中, 采用动态规划方法计算对象名称之间的编辑距离。 [0018] wherein, the dynamic programming method is used to calculate the edit distance between the object names.
[0019] 其中, 所述对象名称为机构名称、 药品名称或建筑物名称。 [0019] wherein the object name is an institution name, a drug name, or a building name.
[0020] 其中, 所述对象名称包含中文字符或英文字符。 [0020] wherein the object name includes a Chinese character or an English character.
[0021] 本发明还提供一种基于信息熵的对象名称匹配方法, 包括: [0021] The present invention further provides an object name matching method based on information entropy, including:
[0022] 步骤 1、 收集所有待识别对象名称, 统计每个字符出现的次数 freq以及对象名称
的总数 totalNum, 如果字符在一对象名称中出现多次按一次计算; [0022] Step 1, collecting all the names of the objects to be identified, counting the number of occurrences of each character freq and the name of the object The total number of totalNum, if the character appears in an object name multiple times to calculate;
[0023] 步骤 2、 对每个字符, 根据对象名称的总数 totalNum及字符出现的次数 freq之间 的比值计算字符的信息熵, 根据字符的信息熵得到字符的编辑代价; [0023] Step 2: For each character, calculate the information entropy of the character according to the ratio between the total number of the object name totalNum and the number of occurrences of the character freq, and obtain the editing cost of the character according to the information entropy of the character;
[0024] 步骤 3、 计算对象名称的编辑距离吋, 插入或刪除一个字符的编辑代价等于该 字符的编辑代价, 对于替换操作, 当两个字符相同吋替换的编辑代价为 0, 否则 为两个字符的编辑代价之和; [0024] Step 3: Calculate the edit distance of the object name 吋, the edit cost of inserting or deleting one character is equal to the edit cost of the character. For the replacement operation, when the two characters are the same, the edit cost of the replacement is 0, otherwise two The sum of the editing costs of the characters;
[0025] 步骤 4、 根据对象名称之间的编辑距离计算对象名称之间的相似度。 [0025] Step 4. Calculate the similarity between the object names according to the edit distance between the object names.
[0026] 其中, 以 d(n)(m)表示字符串长度为 n和字符串长度为 m的两个对象名称之间的 编辑距离, 则该两个对象名称之间的相似度 similarity=1.0— d(n)(m)/(d(n)(0) + d(0)(m))。 [0026] wherein d(n)(m) represents an edit distance between two object names of a string length n and a string length m, then the similarity between the two object names is similarity=1.0 — d(n)(m)/(d(n)(0) + d(0)(m)).
[0027] 其中, 以 d(n)(m)表示字符串长度为 n和字符串长度为 m的两个对象名称之间的 编辑距离, 则该两个对象名称之间的相似度 similarity=1.0— d(n)(m)/max(d(n)(0), d(0)(m))。 [0027] wherein d(n)(m) represents an edit distance between two object names of a string length n and a string length m, then the similarity between the two object names is similarity=1.0 — d(n)(m)/max(d(n)(0), d(0)(m)).
[0028] 其中, 字符的编辑代价 =字符的信息熵= log (totalNum/freq)。 [0028] wherein, the editing cost of the character = the information entropy of the character = log (totalNum / freq).
[0029] 其中, 采用动态规划方法计算对象名称之间的编辑距离。 [0029] wherein the dynamic programming method is used to calculate the edit distance between the object names.
[0030] 综上所述, 本发明基于信息熵的对象名称编辑距离计算方法改进了编辑距离的 计算方式, 更准确的反映两个对象名称字符串的绝对差异; 本发明基于信息熵 的对象名称匹配方法能够有效识别两个对象名称之间的相似度, 处理名称类数 据匹配问题效果更佳。 [0030] In summary, the object name edit distance calculation method based on information entropy improves the calculation method of the edit distance, and more accurately reflects the absolute difference between two object name strings; the object name based on information entropy of the present invention The matching method can effectively identify the similarity between two object names, and it is better to deal with the name class data matching problem.
[0031] 附图说明 BRIEF DESCRIPTION OF THE DRAWINGS
[0032] 图 1为本发明基于信息熵的对象名称编辑距离计算方法一较佳实施例的流程图 [0033] 具体实施方式 1 is a flowchart of a method for calculating an object name edit distance based on information entropy according to a preferred embodiment of the present invention. [0033]
[0034] 下面结合附图, 通过对本发明的具体实施方式详细描述, 将使本发明的技术方 案及其有益效果显而易见。 The technical scheme of the present invention and its advantageous effects will be apparent from the following detailed description of the embodiments of the invention.
[0035] 参见图 1, 其为本发明基于信息熵的对象名称编辑距离计算方法一较佳实施例 的流程图。 该基于信息熵的对象名称编辑距离计算方法主要包括: Referring to FIG. 1, it is a flowchart of a preferred embodiment of an object entropy-based object name edit distance calculation method according to the present invention. The information entropy-based object name edit distance calculation method mainly includes:
[0036] 步骤 10、 收集所有待识别对象名称, 统计每个字符出现的次数 freq以及对象名
称的总数 totalNum, 如果字符在一对象名称中出现多次按一次计算; [0036] Step 10: Collect all the names of the objects to be identified, and count the number of occurrences of each character freq and the object name. The total number of totals called totalNum, if the character appears in the object name multiple times, it is calculated once;
[0037] 步骤 20、 对每个字符, 根据对象名称的总数 totalNum及字符出现的次数 freq之 间的比值计算字符的信息熵, 根据字符的信息熵得到字符的编辑代价; [0037] Step 20: Calculate the information entropy of the character according to the ratio between the total number of the object name totalNum and the number of occurrences of the character freq for each character, and obtain the editing cost of the character according to the information entropy of the character;
[0038] 字符的信息熵可以用公式 log (totalNum/freq)来计算, 字符的编辑代价可以简单 的令其等于字符的信息熵, log可以取 2、 e或其它任意适合的常数为底。 在本发 明中, 字符的信息熵或字符的编辑代价的计算公式可以根据如下条件选定: 如 果某个字符出现的越频繁, 其信息含量越低, 编辑代价也越低; 反之, 说明其 信息含量高, 编辑代价越大, 对对象的区分更有价值。 [0038] The information entropy of a character can be calculated by the formula log (totalNum/freq). The editing cost of a character can be simply equal to the information entropy of the character. The log can take 2, e or any other suitable constant as the base. In the present invention, the information entropy of a character or the calculation formula of the editing cost of a character can be selected according to the following conditions: If a character appears more frequently, the lower the information content, the lower the editing cost; otherwise, the information is explained. The content is high, the editor is more expensive, and the distinction between objects is more valuable.
[0039] 通过步骤 10和 20, 可以计算得出每个字符的编辑代价; 由此可得到所有字符的 编辑代价表 (costTable) , 可用于进一步计算两个对象名称的编辑距离和 /或相 似度。 [0039] Through steps 10 and 20, the editing cost of each character can be calculated; thus, an edit cost table (costTable) of all characters can be obtained, which can be used to further calculate the editing distance and/or similarity of the two object names. .
[0040] 步骤 30、 计算对象名称的编辑距离吋, 插入或刪除一个字符的编辑代价等于该 字符的编辑代价, 对于替换操作, 当两个字符相同吋替换的编辑代价为 0, 否则 为两个字符的编辑代价之和。 [0040] Step 30: Calculate the editing distance of the object name 吋, the editing cost of inserting or deleting one character is equal to the editing cost of the character, and for the replacement operation, when the two characters are the same, the editing cost of the replacement is 0, otherwise two The sum of the editing costs of the characters.
[0041] 本发明基于信息熵的对象名称编辑距离的计算可以采用现有的算法来实现, 例 如可以采用动态规划方法计算对象名称之间的编辑距离。 本发明考虑到对象名 称的每个字符在整个名称中的权重不同, 引入字符的信息熵改进编辑距离的计 算, 将字符的信息熵与编辑距离计算相关联, 相较于原始的编辑距离计算, 将 插入或刪除一个字符的编辑代价由 1替换为该字符的编辑代价, 对于替换操作, 当两个字符不同吋的编辑代价由 1替换为该两个字符的编辑代价之和, 从而最终 的编辑距离计算结果能够更准确的反映两个对象名称字符串的绝对差异, 最终 的编辑距离计算结果除可以用于进一步计算对象名称的相似度, 也可以在适当 的应用领域代替原始的编辑距离计算结果来使用。 [0041] The calculation of the object name edit distance based on the information entropy of the present invention can be implemented by using an existing algorithm. For example, the dynamic programming method can be used to calculate the edit distance between object names. The invention considers that each character of the object name has different weights in the entire name, and the information entropy of the introduced character improves the editing distance calculation, and associates the information entropy of the character with the editing distance calculation, compared with the original editing distance calculation, The editing cost of inserting or deleting a character is replaced by 1 for the editing cost of the character. For the replacement operation, when the two characters are different, the editing cost is replaced by 1 to the sum of the editing costs of the two characters, so that the final editing The distance calculation result can more accurately reflect the absolute difference between the two object name strings. The final edit distance calculation result can be used to further calculate the similarity of the object name, and can also replace the original edit distance calculation result in the appropriate application field. To use.
[0042] 本发明还相应提供了基于信息熵的对象名称匹配方法, 根据基于信息熵的对象 名称编辑距离计算对象名称之间的相似度。 本发明考虑到对象名称的每个字符 在整个名称中的权重是不一样的, 有些字符是很关键的, 而有些字符在某些场 合些通常会忽略, 如机构名"深圳市华傲数据技术有限公司"中, "深圳市" 3个字 符代表企业所处区域, 当在某个特定区域内计算一批机构名之间的相似度吋 (
如识别所有广东省内的企业) , 这 3个字符通常是无关紧要的; "华傲 "是名称中 最关键的部分; "数据技术"代表企业的类别, 有一定的参考意义; "有限公司"代 表企业的性质, 通常在比较吋也是无关紧要的。 因此比较名称吋需要区分每个 字符的权重。 本发明的方案是基于编辑距离来计算相似度方法, 同吋利用每个 字符的信息熵, 信息熵越大的字符其编辑距离 (编辑代价) 也越大。 [0042] The present invention further provides an object entropy-based object name matching method, which calculates the similarity between the object names according to the object entropy-based object name editing distance. The present invention considers that each character of an object name has a different weight in the entire name, and some characters are critical, and some characters are usually ignored in some occasions, such as the institution name "Shenzhen Huaao Data Technology" Ltd. "中," Shenzhen City" 3 characters represent the area in which the company is located, when calculating the similarity between a group of institution names in a specific area吋 If you identify all enterprises in Guangdong Province, these three characters are usually irrelevant; "Huaao" is the most important part of the name; "Data Technology" represents the category of the enterprise, which has certain reference significance; "On behalf of the nature of the business, it is usually irrelevant. Therefore, comparing names requires distinguishing the weight of each character. The solution of the present invention is to calculate the similarity method based on the edit distance, and utilize the information entropy of each character. The larger the information entropy, the larger the edit distance (editing cost).
[0043] 参照利用动态规划算法来计算原始编辑距离的过程, 下面通过伪代码来具体描 述本发明基于信息熵的对象名称编辑距离计算方法及对象名称匹配方法。 两个 对象名称的编辑距离及相似度计算方法如下: [0043] Referring to the process of calculating the original editing distance by using the dynamic programming algorithm, the object entropy-based object name editing distance calculation method and the object name matching method of the present invention are specifically described by pseudo code. The editing distance and similarity of two object names are calculated as follows:
[0044] 1. [0044] 1.
设对象名称 l(namel)的长度为 n,对象名称 2(name2)的长度为 m, 初始化一个 n+1行 Let the object name l(namel) be n, and the object name 2 (name2) be m in length, initialize an n+1 line
, m+1列的矩阵: d[n+l][m+l] ; 设定 d[0][0]为 0; , m+1 column matrix: d[n+l][m+l] ; set d[0][0] to 0;
[0045] 本发明以 d(n)(m)即 d (n, m) 来表示字符串长度为 n和字符串长度为 m的两个对 象名称之间的编辑距离; [0045] The present invention uses d( n )(m), that is, d(n, m), to represent an edit distance between two object names of a string length n and a string length m;
[0046] 2.对矩阵的第一列值, 其计算方法为: [0046] 2. For the first column value of the matrix, the calculation method is:
[0047] for (i <- 1 to n) { For (i <- 1 to n) {
[0048] d(i)(0) = d(i - 1)(0) + insertCost(namel.charAt(i - 1)) d(i)(0) = d(i - 1)(0) + insertCost(namel.charAt(i - 1))
[0049] } [0049] }
[0050] 其中^^]10^ ^1^1.(*3^^ - 1))为插入对象名称1中第1-1个字符 (编号从 0幵 始) 的代价, 也即第 i-1个字符的编辑代价, 可在编辑代价表中査得。 d(i)(0)的含 义为插入对象名称 1的 0到 i个字符总的编辑代价; [0050] where ^^]10^^1^1.(*3^^ - 1)) is the cost of inserting the first character of the object name 1 (number starting from 0), that is, the i-th The edit cost of 1 character can be found in the edit cost table. The meaning of d(i)(0) is the total editing cost of 0 to i characters of the inserted object name 1;
[0051] 3.对矩阵的第一行值, 其计算方法为: [0051] 3. The first row value of the matrix is calculated as:
[0052] for (j <- 1 to m) { [0052] for (j <- 1 to m) {
[0053] d(0)(j) = d(0)(j - 1) + insertCost(name2.charAt(j - 1)) d(0)(j) = d(0)(j - 1) + insertCost(name2.charAt(j - 1))
[0054] } [0054] }
[0055] 其中 insertCost(name2.charAt(j - 1))为对象名称 2第 j-1个字符的编辑代价; [0055] wherein insertCost(name2.charAt(j - 1)) is an editing cost of the j-1th character of the object name 2;
[0056] 4.计算 2个对象名称可能的最多的编辑代价为: maxCost = d(n)(0) + d(0)(m); [0056] 4. Calculate the most likely editing cost for the two object names: maxCost = d(n)(0) + d(0)(m);
[0057] 5.计算 d(i)(j)中其他行列的值, 对于每行、 每列 d(i)(j)的值, 计算方法如下: [0058] a.计算在 namel上 i-1位置刪除字符的代价:
[0059] delCosti= d(i - 1)① + delCost(namel.charAt(i - 1)) [0057] 5. Calculate the values of other rows and columns in d(i)(j). For each row and column d(i)(j), the calculation method is as follows: [0058] a. Calculate i- on namel The cost of deleting characters in 1 position: [0059] delCosti= d(i - 1)1 + delCost(namel.charAt(i - 1))
[0060] b.计算在 name2上 j-1位置刪除字符的代价: [0060] b. Calculate the cost of deleting characters at the j-1 position on n ame2:
[0061] delCostj = d(i)(j - 1) + delCost(name2.charAt(j - 1)) [0061] delCostj = d(i)(j - 1) + delCost(name2.charAt(j - 1))
[0062] c.计算 namel上 i-1位置和 name2上 j-1位置字符替换的代价: [0062] c. Calculate the cost of the i-1 position on namel and the j-1 position character substitution on name2:
[0063] change = d(i - l)(j - 1) + changeCost(name 1 xharAt(i - 1), name2.charAt(j - 1)) [0063] change = d(i - l)(j - 1) + changeCost(name 1 xharAt(i - 1), name2.charAt(j - 1))
[0064] 其中 2个字符替换的代价为: 如果 2个字符相同, 则替换的代价为 0; 否则为 2个 字符编辑代价之和。 [0064] The cost of replacing two characters is: If the two characters are the same, the cost of the replacement is 0; otherwise, the sum of the two-character editing costs.
[0065] d. d(i)(j)的值取上面 3个代价中的最小值: [0065] d. The value of d(i)(j) takes the minimum of the above three costs:
[0066] d(i)(j) = min3(delCosti, delCostj, change) [0066] d(i)(j) = min3(delCosti, delCostj, change)
[0067] 6.所有 d(i)(j)的值计算完毕后, d(n)(m)即为 namel和 name2之间编辑的最小代价 [0067] 6. After all the values of d(i)(j) are calculated, d(n)(m) is the minimum cost of editing between namel and name2.
, 编辑代价越高, 说明两者之间越不相似。 因此 namel和 name2之间的相似度可 以表示为: The higher the editor's cost, the more dissimilar the two are. So the similarity between namel and n ame2 can be expressed as:
[0068] similarity = 1.0 - d(n)(m) I maxCost, 即 similarity = 1.0— d(n)(m) I (d(n)(0) + [0068] similarity = 1.0 - d(n)(m) I maxCost, ie similarity = 1.0 - d(n)(m) I (d(n)(0) +
d(0)(m))。 d(0)(m)).
[0069] namel和 name2之间的相似度也可以表示为 similarity = 1.0— d(n)(m) I max(d(n)(0) [0069] The similarity between namel and name2 can also be expressed as similarity = 1.0 - d(n)(m) I max(d(n)(0)
, d(0)(m)) , 或者采用其它适合的相似度计算公式。 , d(0)(m)) , or use other suitable similarity calculation formulas.
[0070] 采用上述的近似度计算公式吋, 公式 log (totalNum/freq)的底数在计算过程中可 以通过换底公式消去, 因此底数的选取不影响相似度的计算结果。 [0070] Using the approximation calculation formula 吋 above, the base of the formula log (totalNum/freq) can be eliminated by the base change formula in the calculation process, so the selection of the base does not affect the calculation result of the similarity.
[0071] 至此, 2个对象名称之间的相似度计算完毕。 [0071] So far, the similarity between the two object names is calculated.
[0072] 本发明基于信息熵的对象名称编辑距离计算方法及对象名称匹配方法可以适合 于各类对象名称, 特别是机构名称、 药品名称或建筑物名称, 而且优选适用于 同一类待识别对象名称的匹配, 例如, 待识别数据均为机构名称, 均为药品名 称或均为建筑物名称。 对象名称中可以包含中文字符或英文字符, 其它语言的 字符, 或其它符号。 [0072] The object name edit distance calculation method and the object name matching method based on information entropy of the present invention may be suitable for various object names, in particular, an institution name, a drug name or a building name, and preferably applied to the same type of object to be identified. The matching, for example, the data to be identified are the organization names, all of which are drug names or are building names. The object name can contain Chinese characters or English characters, characters in other languages, or other symbols.
[0073] 实验表明, 相比于原始的编辑距离计算相似度的方法, 本发明的计算效果有明 显的改善, 例如: [0073] Experiments have shown that the calculation effect of the present invention is significantly improved compared to the original edit distance calculation method, for example:
[0074] 1.对于"天津市南幵区宏业汽车配件经营部 "与"天津市南幵区久晟汽车配件经 营部", 原始编辑距离相似度为 0.867, 本方法计算出的值为 0.59, 本方法更能区
分它们是不同的企业; [0074] 1. For "Tianjin Nanxun District Hongye Auto Parts Business Department" and "Tianjin Nanxun District Jiuyi Auto Parts Business Department", the original editing distance similarity is 0.867, and the calculated value of this method is 0.59. , this method is more capable They are different companies;
[0075] 2.对于"天津市南幵区星辰计算机耗材经营部"和"天津市南幵区顺惟计算机耗 材经营部", 原始编辑距离相似度为 0.875, 本方法计算出的值为 0.576, 同样更具 有区分度; [0075] 2. For "Tianjin Computer Consumables Operation Department of Nanxun District of Tianjin" and "Shunwei Computer Consumables Operation Department of Nanxun District of Tianjin", the original editing distance similarity is 0.875, and the calculated value of this method is 0.576. Also more distinguishable;
[0076] 3.对于"南幵区天诚医药保健品研究所"和"天津市南幵区天诚医药保健品研究 所", 原始编辑距离相似度为 0.8125, 本方法计算出的值为 0.998, 本方法更能揭 示它们代表同一家企业。 [0076] 3. For the "Nanjing District Tiancheng Medicine and Health Products Research Institute" and "Tianjin Nanxun District Tiancheng Medicine and Health Products Research Institute", the original editing distance similarity is 0.8125, and the calculated value of this method is 0.998. This method is more revealing that they represent the same company.
[0077] 综上所述, 本发明基于信息熵的对象名称编辑距离计算方法改进了编辑距离的 计算方式, 更准确的反映两个对象名称字符串的绝对差异; 本发明基于信息熵 的对象名称匹配方法能够有效识别两个对象名称之间的相似度, 处理名称类数 据匹配问题效果更佳。 [0077] In summary, the object name edit distance calculation method based on information entropy improves the calculation method of the edit distance, and more accurately reflects the absolute difference between two object name strings; the object name based on information entropy of the present invention The matching method can effectively identify the similarity between two object names, and it is better to deal with the name class data matching problem.
[0078] 以上所述仅为本发明的较佳实施例, 并不用以限制本发明, 凡在本发明的精神 和原则之内所作的任何修改、 等同替换和改进等, 均应包含在本发明的保护范 围之内。 The above description is only the preferred embodiment of the present invention, and is not intended to limit the present invention. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the present invention. Within the scope of protection.
技术问题 technical problem
问题的解决方案 Problem solution
发明的有益效果
Advantageous effects of the invention
Claims
权利要求书 Claim
一种基于信息熵的对象名称编辑距离计算方法, 其特征在于, 包括: 步骤 10、 收集所有待识别对象名称, 统计每个字符出现的次数 freq以 及对象名称的总数 totalNum, 如果字符在一对象名称中出现多次按一 次计算; A method for calculating an object name edit distance based on information entropy, comprising: step 10: collecting all the names of objects to be identified, counting the number of occurrences of each character freq and the total number of object names totalNum, if the characters are in an object name Multiple times in the calculation;
步骤 20、 对每个字符, 根据对象名称的总数 totalNum及字符出现的次 数 freq之间的比值计算字符的信息熵, 根据字符的信息熵得到字符的 编辑代价; Step 20: For each character, calculate the information entropy of the character according to the ratio between the total number of the object name totalNum and the number of occurrences of the character freq, and obtain the editing cost of the character according to the information entropy of the character;
步骤 30、 计算对象名称的编辑距离吋, 插入或刪除一个字符的编辑代 价等于该字符的编辑代价, 对于替换操作, 当两个字符相同吋替换的 编辑代价为 0, 否则为两个字符的编辑代价之和。 Step 30: Calculate the edit distance of the object name 吋, the edit cost of inserting or deleting a character is equal to the edit cost of the character. For the replacement operation, when the two characters are the same, the edit cost of the replacement is 0, otherwise the edit of the two characters is The sum of the costs.
根据权利要求 1所述的基于信息熵的对象名称编辑距离计算方法, 其 特征在于, 字符的编辑代价 =字符的信息熵= log (totalNum/freq)0 根据权利要求 1所述的基于信息熵的对象名称编辑距离计算方法, 其 特征在于, 采用动态规划方法计算对象名称之间的编辑距离。 The information entropy-based object name edit distance calculation method according to claim 1, wherein the edit cost of the character=the information entropy of the character=log (totalNum/freq) 0 is based on the information entropy according to claim 1. The object name edit distance calculation method is characterized in that a dynamic plan method is used to calculate an edit distance between object names.
根据权利要求 1所述的基于信息熵的对象名称编辑距离计算方法, 其 特征在于, 所述对象名称为机构名称、 药品名称或建筑物名称。 根据权利要求 1所述的基于信息熵的对象名称编辑距离计算方法, 其 特征在于, 所述对象名称包含中文字符或英文字符。 The object entropy-based object name edit distance calculation method according to claim 1, wherein the object name is an institution name, a drug name, or a building name. The object entropy-based object name edit distance calculation method according to claim 1, wherein the object name includes a Chinese character or an English character.
一种基于信息熵的对象名称匹配方法, 其特征在于, 包括: 步骤 1、 收集所有待识别对象名称, 统计每个字符出现的次数 freq以 及对象名称的总数 totalNum, 如果字符在一对象名称中出现多次按一 次计算; An object name matching method based on information entropy, comprising: step 1: collecting all the names of objects to be identified, counting the number of occurrences of each character freq and the total number of object names totalNum, if the characters appear in an object name Press once to calculate;
步骤 2、 对每个字符, 根据对象名称的总数 totalNum及字符出现的次 数 freq之间的比值计算字符的信息熵, 根据字符的信息熵得到字符的 编辑代价; Step 2. For each character, calculate the information entropy of the character according to the ratio between the total number of the object name totalNum and the number of occurrences of the character freq, and obtain the editing cost of the character according to the information entropy of the character;
步骤 3、 计算对象名称的编辑距离吋, 插入或刪除一个字符的编辑代 价等于该字符的编辑代价, 对于替换操作, 当两个字符相同吋替换的
编辑代价为 0, 否则为两个字符的编辑代价之和; Step 3. Calculate the edit distance of the object name. The edit cost of inserting or deleting a character is equal to the edit cost of the character. For the replacement operation, when the two characters are the same, the replacement is performed. The editing cost is 0, otherwise it is the sum of the editing costs of two characters;
步骤 4、 根据对象名称之间的编辑距离计算对象名称之间的相似度。 Step 4. Calculate the similarity between the object names according to the edit distance between the object names.
[权利要求 7] 根据权利要求 6所述的基于信息熵的对象名称匹配方法, 其特征在于 [Claim 7] The object entropy-based object name matching method according to claim 6, wherein
, 以 d(n)(m)表示字符串长度为 n和字符串长度为 m的两个对象名称之 间的编辑距离, 则该两个对象名称之间的相似度 similarity =, d(n)(m) represents the edit distance between two object names with a string length of n and a string length of m, then the similarity between the two object names is similarity =
1.0— d(n)(m) I (d(n)(0) + d(0)(m))。 1.0—d(n)(m) I (d(n)(0) + d(0)(m)).
[权利要求 8] 根据权利要求 6所述的基于信息熵的对象名称匹配方法, 其特征在于 , 以 d(n)(m)表示字符串长度为 n和字符串长度为 m的两个对象名称之 间的编辑距离, 则该两个对象名称之间的相似度 similarity = 1.0-d(n)(m) / max(d(n)(0), d(0)(m))。 [Claim 8] The object entropy-based object name matching method according to claim 6, wherein d(n)(m) represents two object names of a character string length n and a character string length m The edit distance between the two object names is similarity = 1.0-d(n)(m) / max(d(n)(0), d(0)(m)).
[权利要求 9] 根据权利要求 6所述的基于信息熵的对象名称匹配方法, 其特征在于 , 字符的编辑代价 =字符的信息熵= log (totalNum/freq)。 [Claim 9] The object entropy-based object name matching method according to claim 6, wherein the editing cost of the character = the information entropy of the character = log (totalNum/freq).
[权利要求 10] 根据权利要求 6所述的基于信息熵的对象名称匹配方法, 其特征在于 , 采用动态规划方法计算对象名称之间的编辑距离。
[Claim 10] The object entropy-based object name matching method according to claim 6, wherein the dynamic programming method is used to calculate an edit distance between object names.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510047831.9A CN104572627B (en) | 2015-01-30 | 2015-01-30 | Object oriented editing distance computational methods and matching process based on comentropy |
CN2015100478319 | 2015-01-30 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2016119507A1 true WO2016119507A1 (en) | 2016-08-04 |
Family
ID=53088731
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2015/094370 WO2016119507A1 (en) | 2015-01-30 | 2015-11-12 | Object name edit distance calculating method and matching method based on information entropy |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN104572627B (en) |
WO (1) | WO2016119507A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110097040A (en) * | 2018-01-31 | 2019-08-06 | 精工爱普生株式会社 | Image processing apparatus and storage medium |
CN110781876A (en) * | 2019-10-15 | 2020-02-11 | 北京工业大学 | Visual feature-based counterfeit domain name lightweight detection method and system |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104572627B (en) * | 2015-01-30 | 2018-01-23 | 深圳市华傲数据技术有限公司 | Object oriented editing distance computational methods and matching process based on comentropy |
CN104899189B (en) * | 2015-05-27 | 2017-11-28 | 深圳市华傲数据技术有限公司 | Object oriented matching process based on comentropy |
CN105184713A (en) * | 2015-07-17 | 2015-12-23 | 四川久远银海软件股份有限公司 | Intelligent matching and sorting system and method capable of benefitting contrast of assigned drugs of medical insurance |
CN105335899A (en) * | 2015-11-11 | 2016-02-17 | 国网山东省电力公司德州供电公司 | Intelligent power line naming system |
CN107220334A (en) * | 2017-05-25 | 2017-09-29 | 北京小度信息科技有限公司 | Similarity calculating method, device and the equipment of name of firm |
CN108874756B (en) * | 2018-06-29 | 2022-05-20 | 广东智媒云图科技股份有限公司 | Verification code optimization method |
CN111261165B (en) * | 2020-01-13 | 2023-05-16 | 佳都科技集团股份有限公司 | Station name recognition method, device, equipment and storage medium |
CN113515933A (en) * | 2021-09-13 | 2021-10-19 | 中国电力科学研究院有限公司 | Power primary and secondary equipment fusion processing method, system, equipment and storage medium |
CN117573943B (en) * | 2024-01-11 | 2024-05-28 | 云筑信息科技(成都)有限公司 | Data comparison method based on serialization similarity calculation |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2008095153A2 (en) * | 2007-02-01 | 2008-08-07 | Tegic Communications, Inc. | Spell-check for a keyboard system with automatic correction |
CN102929930A (en) * | 2012-09-24 | 2013-02-13 | 南京大学 | Automatic Web text data extraction template generating and extracting method for small samples |
US20130080164A1 (en) * | 2011-09-28 | 2013-03-28 | Google Inc. | Selective Feedback For Text Recognition Systems |
CN103020022A (en) * | 2012-11-20 | 2013-04-03 | 北京航空航天大学 | Chinese unregistered word recognition system and method based on improvement information entropy characteristics |
CN103399907A (en) * | 2013-07-31 | 2013-11-20 | 深圳市华傲数据技术有限公司 | Method and device for calculating similarity of Chinese character strings on the basis of edit distance |
CN104572627A (en) * | 2015-01-30 | 2015-04-29 | 深圳市华傲数据技术有限公司 | Object name editing distance calculating method and object name editing distance matching method based on information entropy |
-
2015
- 2015-01-30 CN CN201510047831.9A patent/CN104572627B/en active Active
- 2015-11-12 WO PCT/CN2015/094370 patent/WO2016119507A1/en active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2008095153A2 (en) * | 2007-02-01 | 2008-08-07 | Tegic Communications, Inc. | Spell-check for a keyboard system with automatic correction |
US20130080164A1 (en) * | 2011-09-28 | 2013-03-28 | Google Inc. | Selective Feedback For Text Recognition Systems |
CN102929930A (en) * | 2012-09-24 | 2013-02-13 | 南京大学 | Automatic Web text data extraction template generating and extracting method for small samples |
CN103020022A (en) * | 2012-11-20 | 2013-04-03 | 北京航空航天大学 | Chinese unregistered word recognition system and method based on improvement information entropy characteristics |
CN103399907A (en) * | 2013-07-31 | 2013-11-20 | 深圳市华傲数据技术有限公司 | Method and device for calculating similarity of Chinese character strings on the basis of edit distance |
CN104572627A (en) * | 2015-01-30 | 2015-04-29 | 深圳市华傲数据技术有限公司 | Object name editing distance calculating method and object name editing distance matching method based on information entropy |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110097040A (en) * | 2018-01-31 | 2019-08-06 | 精工爱普生株式会社 | Image processing apparatus and storage medium |
CN110097040B (en) * | 2018-01-31 | 2023-07-04 | 精工爱普生株式会社 | Image processing apparatus and storage medium |
CN110781876A (en) * | 2019-10-15 | 2020-02-11 | 北京工业大学 | Visual feature-based counterfeit domain name lightweight detection method and system |
CN110781876B (en) * | 2019-10-15 | 2023-11-24 | 北京工业大学 | Method and system for detecting light weight of counterfeit domain name based on visual characteristics |
Also Published As
Publication number | Publication date |
---|---|
CN104572627B (en) | 2018-01-23 |
CN104572627A (en) | 2015-04-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2016119507A1 (en) | Object name edit distance calculating method and matching method based on information entropy | |
CN109117464B (en) | Editing distance-based data similarity detection method | |
US8868569B2 (en) | Methods for detecting and removing duplicates in video search results | |
JP4997856B2 (en) | Database analysis program, database analysis apparatus, and database analysis method | |
CN105956053B (en) | A kind of searching method and device based on the network information | |
CN104199965A (en) | Semantic information retrieval method | |
WO2013185107A1 (en) | Systems and methods for recognizing ambiguity in metadata | |
CN113342976B (en) | Method, device, storage medium and equipment for automatically acquiring and processing data | |
WO2020164272A1 (en) | Network access device identifying method and apparatus, storage medium and computer device | |
WO2020074017A1 (en) | Deep learning-based method and device for screening for keywords in medical document | |
US8583415B2 (en) | Phonetic search using normalized string | |
WO2016095645A1 (en) | Stroke input method, device and system | |
CN105701083A (en) | Text representation method and device | |
GB2493587A (en) | Entity resolution system identifying non-distinct names in a set of names | |
US20140075299A1 (en) | Systems and methods for generating extraction models | |
CN104156373B (en) | Coded format detection method and device | |
WO2018059430A1 (en) | Database searching | |
CN111857660A (en) | Context-aware API recommendation method and terminal based on query statement | |
WO2016188051A1 (en) | Information entropy-based object name matching method | |
WO2017065891A1 (en) | Automated join detection | |
JP5506527B2 (en) | Synonymous column detection device and synonymous column detection method | |
CN107291951B (en) | Data processing method, device, storage medium and processor | |
TWI621952B (en) | Comparison table automatic generation method, device and computer program product of the same | |
TWI234720B (en) | Related document linking managing system, method and recording medium | |
TWM523901U (en) | Search engine device for performing semantic keyword analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 15879705 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 15879705 Country of ref document: EP Kind code of ref document: A1 |