WO2015014287A1 - Method and apparatus for calculating similarity between chinese character strings based on editing distance - Google Patents

Method and apparatus for calculating similarity between chinese character strings based on editing distance Download PDF

Info

Publication number
WO2015014287A1
WO2015014287A1 PCT/CN2014/083326 CN2014083326W WO2015014287A1 WO 2015014287 A1 WO2015014287 A1 WO 2015014287A1 CN 2014083326 W CN2014083326 W CN 2014083326W WO 2015014287 A1 WO2015014287 A1 WO 2015014287A1
Authority
WO
WIPO (PCT)
Prior art keywords
similarity
chinese
string
corner
edit
Prior art date
Application number
PCT/CN2014/083326
Other languages
French (fr)
Chinese (zh)
Inventor
王平
贾西贝
Original Assignee
深圳市华傲数据技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市华傲数据技术有限公司 filed Critical 深圳市华傲数据技术有限公司
Publication of WO2015014287A1 publication Critical patent/WO2015014287A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/53Processing of non-Latin text

Definitions

  • the present invention relates to the field of Chinese string similarity, and more particularly to a method and apparatus for calculating Chinese string similarity based on edit distance.
  • the comparison of Chinese string similarity is a common technique in the technical fields of string matching, text comparison, and information extraction.
  • Different Chinese language string similarity techniques are used in different applications. Common techniques include matching algorithm based on edit distance, matching algorithm based on glyph and pronunciation, and smith-Waterman algorithm.
  • the steps of finding the similarity of two strings have the following steps: First, the edit distance matrix should be constructed in advance; second, the values of the matrix unit are calculated from left to right and top to bottom; Third, the calculated bottom right matrix unit value is the edit distance of the two strings.
  • the algorithm is suitable for spelling errors and is easy to implement and use.
  • the steps of finding the similarity of two strings include the following steps: First, the glyph similarity between the strings is calculated by the glyph coding-five coding; second, the pronunciation of the initials of the Chinese characters and the finals are used. Regularity is used to calculate the initial similarity and finality of the string, and the pronunciation similarity between the strings is calculated in combination with the fuzzy sounds commonly found in dialects or Mandarin.
  • the edit distance algorithm and the algorithm based on Chinese character glyph and pronunciation are simultaneously used to improve the accuracy of calculating string similarity.
  • the smith-Waterman algorithm is an improved version of the edit distance algorithm. The improvement is: Calculate the matrix unit values by performing compensation operations for the delete, insert, and replace operations.
  • the algorithm is currently a widely used sequence similarity comparison algorithm that is suitable for finding locally similar sequence pairs.
  • an embodiment of the present invention provides a method and apparatus for calculating Chinese string similarity based on an edit distance.
  • the invention converts Chinese characters in a character string into four-corner codes by using four-corner number coding, thereby calculating the similarity degree of Chinese characters based on the editing distance, and then replacing the weight of the editing distance with the similarity of the Chinese characters, thereby calculating the similarity of the character strings.
  • the invention converts Chinese characters into digital strings for comparison, and improves the precision of Chinese character matching, and uses the similarity of the Chinese characters instead of the editing distance to calculate the similarity of the string to realize the Chinese character string matching of the editing distance algorithm in the Chinese language environment. Practicality and improved accuracy of matching results.
  • an embodiment of the present invention provides a method for calculating the similarity of a Chinese character string based on an edit distance.
  • the method first calculates the similarity of the Chinese characters in the string to be compared, and then calculates the similarity of the Chinese character string to be compared.
  • the method includes: calculating the similarity of the Chinese characters comprises the following steps:
  • the calculation of the similarity of the Chinese string to be compared includes the following steps:
  • the similarity of the strings to be compared is calculated based on the improved edit distance.
  • the converting the Chinese characters into the four-corner encoding comprises: pre-establishing a rule table of the four-corner number check word method; and converting the Chinese characters in the character string to be compared into the corresponding four-corner code according to the rule table.
  • Another embodiment of the present invention provides an apparatus for calculating a Chinese character string similarity based on an edit distance, the apparatus comprising: a Chinese character similarity calculation device, configured to acquire a Chinese character similarity according to an edit distance; and a Chinese string similarity calculation device, Used to get the string to be compared Similarity.
  • the Chinese character similarity calculation device includes a four-corner number rule device and a four-corner coding device;
  • the four-corner number rule device is configured to pre-establish a rule table of the four-corner number check method; the four-corner code conversion device converts the Chinese characters in the to-be-compared character string into a four-corner code according to the rule information in the four-corner number rule device;
  • the Chinese character similarity calculating means acquires a numeric string in the four-corner encoding and forwarding device, and calculates the similarity of the Chinese character based on the editing distance.
  • the Chinese character string similarity calculating device includes an editing distance weighting device; the editing distance weighting device acquires similarity information of a Chinese character in the Chinese character similarity calculating device, and sets the weight of the editing distance;
  • the Chinese character similarity calculating means acquires the weight information in the editing distance device, and calculates the similarity of the character string based on the improved editing distance.
  • FIG. 1 is a schematic flow chart of a method for calculating a Chinese string similarity based on an edit distance according to an embodiment of the present invention.
  • FIG. 2 is a schematic flow chart of calculating the similarity of Chinese characters by using the editing distance according to an embodiment of the present invention. detailed description
  • Embodiments of the present invention provide a method and apparatus for calculating Chinese string similarity based on an edit distance.
  • the invention converts Chinese characters in a character string into four-corner codes by using four-corner number coding, thereby calculating the similarity degree of Chinese characters based on the editing distance, and then replacing the weight of the editing distance with the similarity of the Chinese characters, thereby calculating the similarity of the character strings. .
  • the invention converts Chinese characters into numbers
  • the comparison of strings improves the accuracy of Chinese character matching.
  • Using the similarity of the Chinese characters instead of the weight of the editing distance to calculate the similarity of the strings realizes the practicability of the editing distance algorithm in Chinese language context in Chinese language environment, and improves the matching. The accuracy of the results.
  • FIG. 1 it is a schematic flowchart of a method for calculating Chinese string similarity based on edit distance, which is implemented by an embodiment of the present invention.
  • the method first calculates the similarity of Chinese characters in the string to be compared, and then calculates the Chinese to be compared.
  • the similarity of strings Specifically, the following detailed steps are included:
  • Step S110 Convert the Chinese character into a four-corner code with a four-corner number.
  • embodiments of the present invention require the use of a four-corner number and an edit distance algorithm.
  • the so-called four-corner number refers to: One of the commonly used word-detecting methods in Chinese dictionaries, which classifies Chinese characters with up to five Arabic numerals.
  • the four-corner number check method uses the numbers 0 to 9 to indicate the ten pen shapes of a Chinese character, and sometimes adds one complement at the end.
  • the so-called edit distance algorithm is: (Edit Distance), which refers to the minimum number of edit operations required between two strings, one from one to another. Licensed editing operations include replacing one character with another, inserting one character, and deleting one character.
  • the rule table of the four-corner number check method needs to be established in advance, according to its code port: ⁇ , fork four inserts five squares six, seven corners eight eight nine is small, the point has a horizontal Change the head.
  • the Chinese characters in the string to be compared are obtained.
  • the Chinese characters in the Chinese character string si in the example 1 are "Huaao Data” and the Chinese characters in the Chinese string s2 are "Chinese Pride”.
  • Chinese characters can be converted into a form represented by a four-corner number.
  • Hua the four corner number is 24401; the flower, the four corner number is 44214, as shown in Table 1 below: Proud 28240 Hua 24401 Number 98440 Pride 72128 According to 57064 Proud 28240 Table 1 Chinese characters corresponding to the four corner numbers
  • Step S120 ⁇ Calculate the similarity of the Chinese characters by using the editing distance.
  • Edit(l,l) is the edit distance between the string "hua” and the string "in”.
  • Edit(i,j) min ⁇ edit(il, j) + 1, edit(i, j-1) + 1, edit(il, jl) + f(i,j) ⁇ , where f(i, j) is the edit distance between "Hua” and "Medium”.
  • the edit distance algorithm is used to calculate the edit distance of the four-corner code, that is, the edit distance of the Chinese character.
  • Step S130 The similarity of the Chinese characters is used instead of the weight of the editing distance.
  • Step S140 Calculate the similarity of the string to be compared based on the improved edit distance.
  • step S120 the value of f(i, j) is 0.8, that is, the edit distance between "Hua” and "Medium” is 0.8.
  • Table 5 Chinese string "Huaao Data” and “China Pride” Edit Distance Calculation.
  • Another embodiment of the present invention provides an apparatus for calculating a Chinese character string similarity based on an edit distance, the apparatus comprising: a Chinese character similarity calculation device, configured to acquire a Chinese character similarity according to an edit distance; and a Chinese string similarity calculation device, Used to obtain the similarity of the strings to be compared.
  • the Chinese character similarity calculation device includes a four-corner number rule device and a four-corner coding device; the four-corner number rule device is configured to pre-establish a rule table of the four-corner number check method, and according to the coding port, the code is: , fork four inserts five squares six, seven corners eight eight nine is small, point has a horizontal change.
  • the Chinese characters in the string sl are "Huaao Data” and the Chinese characters in the Chinese string s2 are "Chinese Pride".
  • the four-corner code conversion device converts the Chinese characters in the character string to be compared into four-corner codes according to the rule information in the four-corner number rule device, and the Chinese characters in the Chinese character string sl to be compared in the above example 1 are "Huaao data” and Chinese characters.
  • the Chinese character in string s2 is "China Pride", Hua, the four-corner number is 24401; the flower, the four-corner number is 44214.
  • Edit(l,l) is the edit distance between the string "hua” and the string "in”.
  • Edit(i,j) min ⁇ edit(il, j) + 1, edit(i, j-1) + 1, edit(il, jl) + f(i, j) ⁇ where f(i, j) is the edit distance between "Hua” and "Medium”.
  • the edit distance algorithm is used to calculate the edit distance of the four-corner code, that is, the edit distance of the Chinese character.
  • the Chinese character string similarity calculation device includes an edit distance weight device; the edit distance weight device acquires similarity information of a Chinese character in the Chinese character similarity calculation device, and is set as a weight of the edit distance; the Chinese character similarity calculation device acquires The weight information in the distance device is edited, and the similarity of the string is calculated based on the improved edit distance.
  • the value of f(i, j) is 0.8, that is, the edit distance between "Hua" and "Medium” is 0.8.
  • Embodiments of the present invention provide a method and apparatus for calculating Chinese string similarity based on an edit distance.
  • the invention converts Chinese characters in a character string into four-corner codes by using four-corner number coding, thereby calculating the similarity degree of Chinese characters based on the editing distance, and then replacing the weight of the editing distance with the similarity of the Chinese characters, thereby calculating the similarity of the character strings. .
  • the invention converts Chinese characters into digital strings for comparison, and improves the precision of Chinese character matching, and uses the similarity of the Chinese characters instead of the editing distance to calculate the similarity of the string to realize the Chinese character string matching of the editing distance algorithm in the Chinese language environment. Practicality and improved accuracy of matching results.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Document Processing Apparatus (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

In a method for calculating a similarity between Chinese character strings based on an editing distance, a similarity between Chinese characters in to-be-compared character strings is first calculated, and then a similarity between the to-be-compared Chinese character strings is calculated. In this method, Chinese characters in character strings are converted into four-corner code by using four-corner coding, so that a similarity between the Chinese characters is calculated based on an editing distance; and on this basis, a weight of the editing distance is replaced with the similarity between the Chinese characters to calculate a similarity between the character strings. In this method, Chinese characters are converted into numeric strings for comparison, to improve the precision of matching of the Chinese characters; and a weight of an editing distance is replaced with a similarity between the Chinese characters to calculate a similarity between character strings, so as to implement the practicability of an editing distance algorithm in matching of the Chinese character strings in the environment of the Chinese language and improve the accuracy of matching results. In addition, further provided is an apparatus for calculating a similarity between Chinese character strings based on an editing distance.

Description

一种基于编辑距离计算中文字符串相似度的方法及装置 技术领域  Method and device for calculating Chinese string similarity based on edit distance
本发明涉及中文字符串相似度领域, 尤其涉及一种基于编辑距离计 算中文字符串相似度的方法及装置。  The present invention relates to the field of Chinese string similarity, and more particularly to a method and apparatus for calculating Chinese string similarity based on edit distance.
背景技术 Background technique
中文字符串相似度的比较是字符串匹配、 文本比较、 信息抽取等技 术领域中常见的技术。 在不同的应用场合会釆取不同的中文字符串相似 度的技术手段, 常见的技术手段有基于编辑距离的匹配算法、 基于字形 和发音的匹配算法以及 smith-Waterman 巨离算法。  The comparison of Chinese string similarity is a common technique in the technical fields of string matching, text comparison, and information extraction. Different Chinese language string similarity techniques are used in different applications. Common techniques include matching algorithm based on edit distance, matching algorithm based on glyph and pronunciation, and smith-Waterman algorithm.
在编辑距离算法中, 求两个字符串相似度的步骤有以下几步: 第一, 应预先建构好编辑距离矩阵; 第二, 依次由左到右, 上到下计算矩阵单 元的值; 第三, 计算出的最右下的矩阵单元值即为两个字符串的编辑距 离。 该算法适合于拼写错误, 且易于实现、 使用。  In the edit distance algorithm, the steps of finding the similarity of two strings have the following steps: First, the edit distance matrix should be constructed in advance; second, the values of the matrix unit are calculated from left to right and top to bottom; Third, the calculated bottom right matrix unit value is the edit distance of the two strings. The algorithm is suitable for spelling errors and is easy to implement and use.
基于字形和发音的算法中, 求两个字符串相似度的步骤包括以下几 步: 首先,利用字形编码-五笔编码计算字符串之间的字形相似度; 第二, 利用汉字声母、 韵母的发音规律来计算字符串的声母相似度和韵母相似 度, 并结合方言或者普通话中常见的模糊音, 计算字符串之间的发音相 似度。 为了提高精确度, 很多应用场景下会同时釆取编辑距离算法和基 于汉字字形和发音的算法用以提高计算字符串相似度的精度。 而 smith- Waterman算法是编辑距离算法的改进版, 其改进的地方在 于: 通过釆用对删除、 插入和替换三种操作分别进行补偿操作来计算矩 阵单元值。 该算法目前是被广泛使用的序列相似性比较算法, 其适用于 寻找局部相似序列对。  In the glyph and pronunciation based algorithm, the steps of finding the similarity of two strings include the following steps: First, the glyph similarity between the strings is calculated by the glyph coding-five coding; second, the pronunciation of the initials of the Chinese characters and the finals are used. Regularity is used to calculate the initial similarity and finality of the string, and the pronunciation similarity between the strings is calculated in combination with the fuzzy sounds commonly found in dialects or Mandarin. In order to improve the accuracy, in many application scenarios, the edit distance algorithm and the algorithm based on Chinese character glyph and pronunciation are simultaneously used to improve the accuracy of calculating string similarity. The smith-Waterman algorithm is an improved version of the edit distance algorithm. The improvement is: Calculate the matrix unit values by performing compensation operations for the delete, insert, and replace operations. The algorithm is currently a widely used sequence similarity comparison algorithm that is suitable for finding locally similar sequence pairs.
在面对中文语言环境下中文字符串匹配这一具体问题时, 经典的基 于编辑距离进行字符串相似度匹配方法的实用性有所下降。 基于汉字字 形的算法虽然考虑到了字形, 但是仅是根据汉字的五笔编码。 In the face of the specific problem of Chinese string matching in Chinese language environment, the practicality of the classical string matching method based on editing distance has decreased. Based on Chinese characters Although the shape algorithm takes into account the glyphs, it is only based on the five-stroke encoding of Chinese characters.
发明内容 Summary of the invention
为了解决上述缺陷之一。  In order to solve one of the above defects.
因此, 本发明实施例提供一种基于编辑距离计算中文字符串相似度 的方法及装置。 本发明釆用四角号码编码将字符串中的汉字转换成四角 编码, 从而基于编辑距离计算汉字的相似度, 在此基础上用汉字的相似 度替代编辑距离的权重, 进而计算字符串的相似度。 本发明将汉字转换 成数字串进行比较提高了汉字匹配的精度, 利用该汉字的相似度替代编 辑距离的权重来计算字符串的相似度实现了编辑距离算法在中文语言环 境下中文字符串匹配的实用性, 并提高了匹配结果的精确性。  Therefore, an embodiment of the present invention provides a method and apparatus for calculating Chinese string similarity based on an edit distance. The invention converts Chinese characters in a character string into four-corner codes by using four-corner number coding, thereby calculating the similarity degree of Chinese characters based on the editing distance, and then replacing the weight of the editing distance with the similarity of the Chinese characters, thereby calculating the similarity of the character strings. . The invention converts Chinese characters into digital strings for comparison, and improves the precision of Chinese character matching, and uses the similarity of the Chinese characters instead of the editing distance to calculate the similarity of the string to realize the Chinese character string matching of the editing distance algorithm in the Chinese language environment. Practicality and improved accuracy of matching results.
所以, 本发明一个实施例提供一种基于编辑距离计算中文字符串相 似度的方法, 该方法先计算待比较字符串中汉字的相似度, 再计算待比 较中文字符串的相似度。 所述方法包括: 所述汉字相似度的计算包括以 下步骤:  Therefore, an embodiment of the present invention provides a method for calculating the similarity of a Chinese character string based on an edit distance. The method first calculates the similarity of the Chinese characters in the string to be compared, and then calculates the similarity of the Chinese character string to be compared. The method includes: calculating the similarity of the Chinese characters comprises the following steps:
釆用四角号码将汉字转换成四角编码; 釆用编辑距离计算汉字的相似度;  转换 Convert Chinese characters into four-corner codes with four-corner numbers; 计算 Calculate the similarity of Chinese characters with editing distances;
所述待比较中文字符串相似度的计算包括以下步骤:  The calculation of the similarity of the Chinese string to be compared includes the following steps:
釆用汉字的相似度代替编辑距离的权重;  代替 Use the similarity of Chinese characters instead of the weight of the editing distance;
基于改进的编辑距离计算待比较字符串的相似度。  The similarity of the strings to be compared is calculated based on the improved edit distance.
优选地, 所述将汉字转换成四角编码包括: 预先建立四角号码检字 法的规则表格; 根据上述规则表格将待比较字符串中的汉字转换成对应 的四角编码。  Preferably, the converting the Chinese characters into the four-corner encoding comprises: pre-establishing a rule table of the four-corner number check word method; and converting the Chinese characters in the character string to be compared into the corresponding four-corner code according to the rule table.
本发明另一个实施例提供一种基于编辑距离计算中文字符串相似度 的装置, 该装置包括: 汉字相似度计算装置, 用以获取根据编辑距离计 算汉字相似度; 中文字符串相似度计算装置, 用以获取待比较字符串的 相似度。 Another embodiment of the present invention provides an apparatus for calculating a Chinese character string similarity based on an edit distance, the apparatus comprising: a Chinese character similarity calculation device, configured to acquire a Chinese character similarity according to an edit distance; and a Chinese string similarity calculation device, Used to get the string to be compared Similarity.
优选地, 所述汉字相似度计算装置包括四角号码规则装置和四角编 码装置;  Preferably, the Chinese character similarity calculation device includes a four-corner number rule device and a four-corner coding device;
所述四角号码规则装置,用以预先建立四角号码检字法的规则表格; 所述四角编码转换装置根据四角号码规则装置中的规则信息将待比 较字符串中的汉字转换成四角编码;  The four-corner number rule device is configured to pre-establish a rule table of the four-corner number check method; the four-corner code conversion device converts the Chinese characters in the to-be-compared character string into a four-corner code according to the rule information in the four-corner number rule device;
所述汉字相似度计算装置获取四角编码转发装置中的数字串, 并基 于编辑距离计算汉字的相似度。  The Chinese character similarity calculating means acquires a numeric string in the four-corner encoding and forwarding device, and calculates the similarity of the Chinese character based on the editing distance.
优选地, 所述中文字符串相似度计算装置包括编辑距离权重装置; 所述编辑距离权重装置获取汉字相似度计算装置中汉字的相似度信 息, 并设置为编辑距离的权重;  Preferably, the Chinese character string similarity calculating device includes an editing distance weighting device; the editing distance weighting device acquires similarity information of a Chinese character in the Chinese character similarity calculating device, and sets the weight of the editing distance;
所述汉字相似度计算装置获取编辑距离装置中权重信息, 基于改进 的编辑距离计算字符串的相似度。  The Chinese character similarity calculating means acquires the weight information in the editing distance device, and calculates the similarity of the character string based on the improved editing distance.
附图说明 DRAWINGS
图 1是本发明实施例实现的一种基于编辑距离计算中文字符串相似 度的方法的流程示意图。  FIG. 1 is a schematic flow chart of a method for calculating a Chinese string similarity based on an edit distance according to an embodiment of the present invention.
图 2是本发明实施例釆用编辑距离计算汉字相似度的流程示意图。 具体实施方式  FIG. 2 is a schematic flow chart of calculating the similarity of Chinese characters by using the editing distance according to an embodiment of the present invention. detailed description
为了使本发明的目的、 技术方案及优点更加清楚明白, 以下结合附 图及实施例, 对本发明进行进一步的详细说明。 应当理解, 此处所描述 的具体实施例仅仅用于解释本发明, 并不用于限定本发明。  The present invention will be further described in detail below in conjunction with the drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
本发明实施例提供一种基于编辑距离计算中文字符串相似度的方法 及装置。 本发明釆用四角号码编码将字符串中的汉字转换成四角编码, 从而基于编辑距离计算汉字的相似度, 在此基础上用汉字的相似度替代 编辑距离的权重, 进而计算字符串的相似度。 本发明将汉字转换成数字 串进行比较提高了汉字匹配的精度, 利用该汉字的相似度替代编辑距离 的权重来计算字符串的相似度实现了编辑距离算法在中文语言环境下中 文字符串匹配的实用性, 并提高了匹配结果的精确性。 Embodiments of the present invention provide a method and apparatus for calculating Chinese string similarity based on an edit distance. The invention converts Chinese characters in a character string into four-corner codes by using four-corner number coding, thereby calculating the similarity degree of Chinese characters based on the editing distance, and then replacing the weight of the editing distance with the similarity of the Chinese characters, thereby calculating the similarity of the character strings. . The invention converts Chinese characters into numbers The comparison of strings improves the accuracy of Chinese character matching. Using the similarity of the Chinese characters instead of the weight of the editing distance to calculate the similarity of the strings realizes the practicability of the editing distance algorithm in Chinese language context in Chinese language environment, and improves the matching. The accuracy of the results.
如图 1所示, 是本发明一个实施例实现的一种基于编辑距离计算中 文字符串相似度的方法的流程示意图, 该方法先计算待比较字符串中汉 字的相似度, 再计算待比较中文字符串的相似度。 具体包括以下详细步 骤:  As shown in FIG. 1 , it is a schematic flowchart of a method for calculating Chinese string similarity based on edit distance, which is implemented by an embodiment of the present invention. The method first calculates the similarity of Chinese characters in the string to be compared, and then calculates the Chinese to be compared. The similarity of strings. Specifically, the following detailed steps are included:
步骤 S110: 釆用四角号码将汉字转换成四角编码。  Step S110: Convert the Chinese character into a four-corner code with a four-corner number.
为了本发明的实施, 本发明实施例需要釆用四角号码和编辑距离算 法。 所谓四角号码是指: 汉语词典常用检字方法之一, 用最多 5个阿拉 伯数字来对汉字进行归类。 四角号码检字法用数字 0到 9表示一个汉字 四角的十种笔形, 有时在最后增加一位补码。 所谓编辑距离算法是指: 又称 (Edit Distance ), 是指两个字串之间, 由一个转成另一个所需的最 少编辑操作次数。 许可的编辑操作包括将一个字符替换成另一个字符, 插入一个字符, 删除一个字符。  For the implementation of the present invention, embodiments of the present invention require the use of a four-corner number and an edit distance algorithm. The so-called four-corner number refers to: One of the commonly used word-detecting methods in Chinese dictionaries, which classifies Chinese characters with up to five Arabic numerals. The four-corner number check method uses the numbers 0 to 9 to indicate the ten pen shapes of a Chinese character, and sometimes adds one complement at the end. The so-called edit distance algorithm is: (Edit Distance), which refers to the minimum number of edit operations required between two strings, one from one to another. Licensed editing operations include replacing one character with another, inserting one character, and deleting one character.
在本步骤中, 需预先建立四角号码检字法的规则表格, 根据其编码 口诀为: 横一垂二三点捺, 叉四插五方块六, 七角八八九是小, 点下有 横变零头。  In this step, the rule table of the four-corner number check method needs to be established in advance, according to its code port: 横一垂二三点捺, fork four inserts five squares six, seven corners eight eight nine is small, the point has a horizontal Change the head.
根据上述规则获取待比较字符串中的汉字, 如例 1中待比较中文字 符串 si中汉字为 "华傲数据" 和中文字符串 s2中汉字为 "中华骄傲"。  According to the above rules, the Chinese characters in the string to be compared are obtained. For example, the Chinese characters in the Chinese character string si in the example 1 are "Huaao Data" and the Chinese characters in the Chinese string s2 are "Chinese Pride".
借助四角号码编码,可将汉字转换为以四角号码表示的形式。 比如: 华, 四角号码为 24401 ; 花, 四角号码为 44214 , 具体如下表 1所示:
Figure imgf000006_0001
傲 28240 华 24401 数 98440 骄 72128 据 57064 傲 28240 表 1 汉字对应的四角号码
With four-corner number coding, Chinese characters can be converted into a form represented by a four-corner number. For example: Hua, the four corner number is 24401; the flower, the four corner number is 44214, as shown in Table 1 below:
Figure imgf000006_0001
Proud 28240 Hua 24401 Number 98440 Pride 72128 According to 57064 Proud 28240 Table 1 Chinese characters corresponding to the four corner numbers
步骤 S120: 釆用编辑距离计算汉字的相似度。  Step S120: 计算 Calculate the similarity of the Chinese characters by using the editing distance.
若要计算字符串 si和字符串 s2的编辑距离, 首先定义这样一个函 数—— edit(i,j),它表示第一个字符串 si的长度为 i的子串到第二个字符 串 s2的长度为 j的子串的编辑距离。 显然可以有如下动态规划公式: To calculate the edit distance between the string si and the string s2, first define such a function - edit(i,j), which represents the substring of the first string si of length i to the second string s2 The edit distance of the substring of length j. Obviously there can be the following dynamic programming formula:
( 1 ) if i==0 JL j ==0, edit(i,j) = 0; ( 1 ) if i==0 JL j ==0, edit(i,j) = 0;
(2) if i==0 JL j >0, edit(i, j)=j;  (2) if i==0 JL j >0, edit(i, j)=j;
(3 ) if i>0 JLj == 0, edit(i, j) = I;  (3) if i>0 JLj == 0, edit(i, j) = I;
(4)if0< i< 1 JL 0 <j < 1 , edit(i, j) == min{ edit(i-l, j) + 1, edit(i, j-1) + 1, edit(i-l,j-l) + f(i, j) }, 当第一个字符串的第 i个字符不等于第二 个字符串的第 j个字符时, f(i,j)= l; 否则, f(i,j) = 0。  (4)if0< i< 1 JL 0 <j < 1 , edit(i, j) == min{ edit(il, j) + 1, edit(i, j-1) + 1, edit(il,jl + f(i, j) }, when the ith character of the first string is not equal to the jth character of the second string, f(i,j)= l; otherwise, f(i, j) = 0.
本步骤釆用编辑距离计算汉字相似度的流程如图 2所示。  The process of calculating the similarity of Chinese characters using the edit distance in this step is shown in Figure 2.
首先获取例 1 中文字符串 si 和 s2 的各中文汉字长度 m和 n, 即 m=n=4。 根据上述编辑距离算法动态规划公式 ( 1)、 (2) 和 (3 ), 可得 初始化结果, 如下表 2所示: 中 华 骄 傲  First, the lengths of the Chinese characters of the Chinese characters si and s2 of the example 1 are obtained, m and n=4. According to the above-mentioned edit distance algorithm dynamic programming formulas (1), (2) and (3), the initialization results can be obtained, as shown in the following Table 2: Chinese pride
0 1 2 3 4 华 1  0 1 2 3 4 Hua 1
傲 2  Proud 2
数 3 据 4 Number 3 According to 4
表 2 初始化结果  Table 2 Initialization results
计算 Edit(i,j)的值, 即 si中长度为 i的字符串与 s2中长度为 j的字 符串的编辑距离。  Calculate the value of Edit(i,j), which is the edit distance of the string of length i in si and the string of length j in s2.
首先计算 Edit(l,l)的值, 即串 "华" 与串 "中" 的编辑距离。 根据 动态规划公式 ( 4 ): Edit(i,j)= min{edit(i-l, j) + 1, edit(i, j-1) + 1, edit(i-l, j-l) + f(i,j) }, 其中, f(i, j)为 "华" 与 "中" 的编辑距离。  First calculate the value of Edit(l,l), which is the edit distance between the string "hua" and the string "in". According to the dynamic programming formula (4): Edit(i,j)= min{edit(il, j) + 1, edit(i, j-1) + 1, edit(il, jl) + f(i,j) }, where f(i, j) is the edit distance between "Hua" and "Medium".
用编辑距离算法计算四角编码的编辑距离, 即对应中文字符的编辑 距离。  The edit distance algorithm is used to calculate the edit distance of the four-corner code, that is, the edit distance of the Chinese character.
计算 edit(l, l), 如下表 3所示, edit(0, 1) + 1 ==2, edit(l, 0) + 1 == 2, edit(0, 0) + f(l, 1)==0+ 1 == 1, min(edit(0, 1), edit(l, 0), edit(0, 0) + f(l, 1))==1, 因此 edit(l, 1)== 1。 Calculate edit(l, l), as shown in Table 3 below, edit(0, 1) + 1 ==2, edit(l, 0) + 1 == 2, edit(0, 0) + f(l, 1 )==0+ 1 == 1, min(edit(0, 1), edit(l, 0), edit(0, 0) + f(l, 1))==1, so edit(l, 1 ) == 1.
Figure imgf000008_0001
Figure imgf000008_0001
表 3: 汉字 "华" (24404) 与 "中" (50006)的编辑距离计算 (一)。 依次类推: edit(2, 1)+ 1 ==3, edit(l,2)+ 1 ==2, edit(l, l) + f(2, 2) == 1 + 1 ==2, 其中 sl[2]=='4, 而 s2[2]=='5,, 两者不同, 所以交换相 邻字符的操作不计入比较最小数中计算, 如下表 4和表 5所示。 5 0 0 0 6Table 3: Calculation of the edit distance of the Chinese characters "Hua" (24404) and "Medium" (50006) (1). And so on: edit(2, 1)+ 1 ==3, edit(l,2)+ 1 ==2, edit(l, l) + f(2, 2) == 1 + 1 ==2, where Sl[2]=='4, and s2[2]=='5,, the two are different, so the operation of exchanging adjacent characters is not counted in the comparative minimum number, as shown in Table 4 and Table 5 below. 5 0 0 0 6
0 1 2 3 4 50 1 2 3 4 5
2 1 1 2 1 1
4 2 2  4 2 2
4 3  4 3
0 4  0 4
1 5  1 5
表 4: 汉字 "华" ( 24404 ) 与 "中" (50006)的编辑距离计  Table 4: Edit distance meter for Chinese characters "Hua" (24404) and "Medium" (50006)
Figure imgf000009_0001
Figure imgf000009_0001
表 5: 汉字 "华" ( 24404 ) 与 "中" (50006)的编辑距离计  Table 5: Edit distance meter for Chinese characters "Hua" (24404) and "Medium" (50006)
所以, "华" ( 24401 ) 和 "中" ( 50006 ) 编辑距离为 4 , 4/5=0.8„ 在本步骤中将汉字转换成数字串进行相似度比较, 可以避免根据汉 字的字形和声音进行相似度匹配存在的模糊性,提高了汉字匹配的精度。  Therefore, the edit distance between "Hua" (24401) and "Medium" (50006) is 4, 4/5=0.8 „ In this step, the Chinese characters are converted into a numeric string for similarity comparison, which can avoid the font and sound according to the Chinese characters. The ambiguity of similarity matching improves the accuracy of Chinese character matching.
步骤 S130: 釆用汉字的相似度代替编辑距离的权重。  Step S130: The similarity of the Chinese characters is used instead of the weight of the editing distance.
利用上述汉字的相似度替代编辑距离的权重来计算待比较字符串的 相似度实现了编辑距离算法在中文语言环境下中文字符串匹配的实用 性, 并提高了匹配结果的精确性。 Using the similarity of the above Chinese characters instead of the weight of the editing distance to calculate the similarity of the strings to be compared, the practicality of the Chinese character string matching in the Chinese language environment by the editing distance algorithm is realized. Sex, and improve the accuracy of the matching results.
步骤 S140: 基于改进的编辑距离计算待比较字符串的相似度。  Step S140: Calculate the similarity of the string to be compared based on the improved edit distance.
所以, 在步骤 S 120 中, f(i, j)的值为 0.8 , 即 "华" 与 "中" 的编 辑距离为 0.8。 Edit(i,j)= min{edit(i-l, j) + 1, edit(i, j-1) + 1, edit(i-l, j-1) + f(i, j) }={ 1+1, 1+1,0+0.8} = 0.8。 依次类推, 计算 Edit(i,j)的值, 其中 0<i<=m;0<j<=n。 结果 ^下表 6所示:  Therefore, in step S120, the value of f(i, j) is 0.8, that is, the edit distance between "Hua" and "Medium" is 0.8. Edit(i,j)= min{edit(il, j) + 1, edit(i, j-1) + 1, edit(il, j-1) + f(i, j) }={ 1+1 , 1+1,0+0.8} = 0.8. And so on, calculate the value of Edit(i,j), where 0<i<=m;0<j<=n. Results ^ Table 6 below:
Figure imgf000010_0001
Figure imgf000010_0001
表 5: 中文字符串 "华傲数据" 与 "中华骄傲" 编辑距离计算 故本发明实施例中例 1中文字符串 "华傲数据" 与 "中华骄傲" 编 辑距离为 3.2 , 相似度为 1-3.2/4 = 0.2。  Table 5: Chinese string "Huaao Data" and "China Pride" Edit Distance Calculation. In the embodiment of the present invention, the Chinese string "Huaao Data" and "China Pride" edit distance is 3.2, and the similarity is 1- 3.2/4 = 0.2.
本发明另一个实施例提供一种基于编辑距离计算中文字符串相似度 的装置, 该装置包括: 汉字相似度计算装置, 用以获取根据编辑距离计 算汉字相似度; 中文字符串相似度计算装置, 用以获取待比较字符串的 相似度。  Another embodiment of the present invention provides an apparatus for calculating a Chinese character string similarity based on an edit distance, the apparatus comprising: a Chinese character similarity calculation device, configured to acquire a Chinese character similarity according to an edit distance; and a Chinese string similarity calculation device, Used to obtain the similarity of the strings to be compared.
所述汉字相似度计算装置包括四角号码规则装置和四角编码装置; 所述四角号码规则装置,用以预先建立四角号码检字法的规则表格, 根据其编码口诀为: 横一垂二三点捺, 叉四插五方块六, 七角八八九是 小, 点下有横变零头。  The Chinese character similarity calculation device includes a four-corner number rule device and a four-corner coding device; the four-corner number rule device is configured to pre-establish a rule table of the four-corner number check method, and according to the coding port, the code is: , fork four inserts five squares six, seven corners eight eight nine is small, point has a horizontal change.
根据上述规则获取待比较字符串中的汉字, 如上例 1中待比较中文 字符串 sl中汉字为 "华傲数据"和中文字符串 s2中汉字为 "中华骄傲"。 所述四角编码转换装置根据四角号码规则装置中的规则信息将待比 较字符串中的汉字转换成四角编码, 如上例 1所示待比较中文字符串 sl 中汉字为 "华傲数据" 和中文字符串 s2中汉字为 "中华骄傲", 华, 四 角号码为 24401 ; 花, 四角号码为 44214。 Obtain the Chinese characters in the string to be compared according to the above rules, as in Example 1 The Chinese characters in the string sl are "Huaao Data" and the Chinese characters in the Chinese string s2 are "Chinese Pride". The four-corner code conversion device converts the Chinese characters in the character string to be compared into four-corner codes according to the rule information in the four-corner number rule device, and the Chinese characters in the Chinese character string sl to be compared in the above example 1 are "Huaao data" and Chinese characters. The Chinese character in string s2 is "China Pride", Hua, the four-corner number is 24401; the flower, the four-corner number is 44214.
汉字相似度计算装置获取四角编码转发装置中的数字串, 并基于编 辑距离计算汉字的相似度。 获取例 1 中文字符串 sl和 s2的各中文汉字 长度 m和 n, 即 m=n=4。 根据上述编辑距离算法动态规划公式( 1 )、 ( 2 ) 和 (3 ), 可得初始化结果, 计算 Edit(i,j)的值, 即 sl 中长度为 i的字符 串与 s2中长度为 j的字符串的编辑距离。  The Chinese character similarity calculating means acquires a numeric string in the four-corner code transponder, and calculates the similarity of the Chinese character based on the edited distance. Obtain the Chinese characters of the Chinese string sl and s2 for the length of m and n, that is, m=n=4. According to the above-mentioned edit distance algorithm dynamic programming formulas (1), (2) and (3), the initialization result can be obtained, and the value of Edit(i,j) can be calculated, that is, the length of the string i in sl and the length in s2 are j. The edit distance of the string.
首先计算 Edit(l,l)的值, 即串 "华" 与串 "中" 的编辑距离。 根据 动态规划公式 ( 4 ): Edit(i,j)= min{edit(i-l, j) + 1, edit(i, j-1) + 1, edit(i-l, j-l) + f(i, j) }其中, f(i, j)为 "华" 与 "中" 的编辑距离。  First calculate the value of Edit(l,l), which is the edit distance between the string "hua" and the string "in". According to the dynamic programming formula (4): Edit(i,j)= min{edit(il, j) + 1, edit(i, j-1) + 1, edit(il, jl) + f(i, j) } where f(i, j) is the edit distance between "Hua" and "Medium".
用编辑距离算法计算四角编码的编辑距离, 即对应中文字符的编辑 距离。计算 edit(l, 1) ,如下表 3所示, edit(0, 1) + 1 == 2 , edit(l, 0) + 1 == 2 , edit(0, 0) + f(l, 1) == 0 + 1 == 1 , min(edit(0, 1) , edit(l, 0) , edit(0, 0) + f(l, 1))==1 , 因此 edit(l, 1) == 1。  The edit distance algorithm is used to calculate the edit distance of the four-corner code, that is, the edit distance of the Chinese character. Calculate edit(l, 1) as shown in Table 3 below, edit(0, 1) + 1 == 2 , edit(l, 0) + 1 == 2 , edit(0, 0) + f(l, 1 ) == 0 + 1 == 1 , min(edit(0, 1) , edit(l, 0) , edit(0, 0) + f(l, 1))==1 , so edit(l, 1 ) == 1.
所述中文字符串相似度计算装置包括编辑距离权重装置; 所述编辑 距离权重装置获取汉字相似度计算装置中汉字的相似度信息, 并设置为 编辑距离的权重; 所述汉字相似度计算装置获取编辑距离装置中权重信 息,基于改进的编辑距离计算字符串的相似度。 f(i, j)的值为 0.8 ,即 "华" 与 "中" 的编辑距离为 0.8。 Edit(i,j)= min{edit(i-l, j) + 1, edit(i, j-1) + 1, edit(i-l, j-1) + f(i, j) }={1 + 1, 1+1,0+0.8} = 0.8。 依次类推, 计算 Edit(iJ) 的值, 其中 0<i<=m;0<j<=n。 故本发明实施例中例 1 中文字符串 "华傲 数据" 与 "中华骄傲" 编辑距离为 3.2 , 相似度为 1-3.2/4 = 0.2。 本发明实施例提供一种基于编辑距离计算中文字符串相似度的方法 及装置。 本发明釆用四角号码编码将字符串中的汉字转换成四角编码, 从而基于编辑距离计算汉字的相似度, 在此基础上用汉字的相似度替代 编辑距离的权重, 进而计算字符串的相似度。 本发明将汉字转换成数字 串进行比较提高了汉字匹配的精度, 利用该汉字的相似度替代编辑距离 的权重来计算字符串的相似度实现了编辑距离算法在中文语言环境下中 文字符串匹配的实用性, 并提高了匹配结果的精确性。 The Chinese character string similarity calculation device includes an edit distance weight device; the edit distance weight device acquires similarity information of a Chinese character in the Chinese character similarity calculation device, and is set as a weight of the edit distance; the Chinese character similarity calculation device acquires The weight information in the distance device is edited, and the similarity of the string is calculated based on the improved edit distance. The value of f(i, j) is 0.8, that is, the edit distance between "Hua" and "Medium" is 0.8. Edit(i,j)= min{edit(il, j) + 1, edit(i, j-1) + 1, edit(il, j-1) + f(i, j) }={1 + 1 , 1+1,0+0.8} = 0.8. And so on, calculate the value of Edit(iJ), where 0<i<=m;0<j<=n. Therefore, in the embodiment of the present invention, the Chinese character string "Huaao Data" and "China Pride" have an edit distance of 3.2 and a similarity of 1-3.2/4 = 0.2. Embodiments of the present invention provide a method and apparatus for calculating Chinese string similarity based on an edit distance. The invention converts Chinese characters in a character string into four-corner codes by using four-corner number coding, thereby calculating the similarity degree of Chinese characters based on the editing distance, and then replacing the weight of the editing distance with the similarity of the Chinese characters, thereby calculating the similarity of the character strings. . The invention converts Chinese characters into digital strings for comparison, and improves the precision of Chinese character matching, and uses the similarity of the Chinese characters instead of the editing distance to calculate the similarity of the string to realize the Chinese character string matching of the editing distance algorithm in the Chinese language environment. Practicality and improved accuracy of matching results.
以上内容是结合具体的优选实施方式对本发明所作的进一步详细说 明, 不能认定本发明的具体实施只局限于这些说明。 对于本发明所属技 术领域的普通技术人员来说, 在不脱离本发明构思的前提下, 还可以做 出若干简单推演或替换。  The above is a further detailed description of the present invention in connection with the specific preferred embodiments, and the specific embodiments of the present invention are not limited to the description. For those skilled in the art to which the present invention pertains, a number of simple deductions or substitutions may be made without departing from the inventive concept.

Claims

权 利 要 求 Rights request
1.一种基于编辑距离计算中文字符串相似度的方法, 包括: 计算待 比较字符串中汉字的相似度; 再计算待比较中文字符串的相似度, 其特 征在于, 所述汉字相似度的计算包括以下步骤:  A method for calculating a Chinese string similarity based on an edit distance, comprising: calculating a similarity of a Chinese character in a string to be compared; and calculating a similarity of the Chinese character string to be compared, wherein the similarity of the Chinese character is The calculation includes the following steps:
釆用四角号码将汉字转换成四角编码;  转换 Convert Chinese characters into four-corner codes with four-corner numbers;
釆用编辑距离计算汉字的相似度;  计算 Calculate the similarity of Chinese characters by using the editing distance;
所述待比较中文字符串相似度的计算包括以下步骤:  The calculation of the similarity of the Chinese string to be compared includes the following steps:
釆用汉字的相似度代替编辑距离的权重;  代替 Use the similarity of Chinese characters instead of the weight of the editing distance;
基于改进的编辑距离计算待比较字符串的相似度。  The similarity of the strings to be compared is calculated based on the improved edit distance.
2.根据权利要求 1 所述的方法, 其特征在于, 所述将汉字转换成四 角编码包括:  The method according to claim 1, wherein the converting the Chinese character into a four-corner code comprises:
预先建立四角号码检字法的规则表格;  Pre-establishing a rule form for the four-corner number check method;
根据上述规则表格将待比较字符串中的汉字转换成对应的四角编 码。  The Chinese characters in the string to be compared are converted into corresponding four-corner codes according to the above rule table.
3.—种基于编辑距离计算中文字符串相似度的装置, 其特征在于, 所述装置包括:  3. An apparatus for calculating a Chinese string similarity based on an edit distance, wherein the apparatus comprises:
汉字相似度计算装置, 用以获取根据编辑距离计算汉字相似度; 中文字符串相似度计算装置, 用以获取待比较字符串的相似度。 The Chinese character similarity calculating device is configured to obtain a Chinese character similarity according to the editing distance; the Chinese string similarity calculating device is configured to obtain the similarity of the character string to be compared.
4.根据权利要求 3 所述的装置, 其特征在于, 所述汉字相似度计算 装置包括四角号码规则装置和四角编码装置; The device according to claim 3, wherein the Chinese character similarity calculation device comprises a four-corner number rule device and a four-corner coding device;
所述四角号码规则装置,用以预先建立四角号码检字法的规则表格; 所述四角编码转换装置根据四角号码规则装置中的规则信息将待比 较字符串中的汉字转换成四角编码; 所述汉字相似度计算装置获取四角编码转发装置中的数字串, 并基 于编辑距离计算汉字的相似度。 The four-corner number rule device is configured to pre-establish a rule table of the four-corner number check method; the four-corner code conversion device converts the Chinese characters in the character string to be compared into a four-corner code according to the rule information in the four-corner number rule device; The Chinese character similarity calculating device acquires a numeric string in the four-corner encoding forwarding device, and calculates a similarity of the Chinese character based on the editing distance.
5.根据权利要求 3 所述的装置, 其特征在于, 所述中文字符串相似 度计算装置包括编辑距离权重装置;  The device according to claim 3, wherein the Chinese character string similarity calculation device comprises an edit distance weight device;
所述编辑距离权重装置获取汉字相似度计算装置中汉字的相似度信 息, 并设置为编辑距离的权重;  The edit distance weighting device acquires the similarity information of the Chinese characters in the Chinese character similarity calculation device, and sets the weight of the edit distance;
所述汉字相似度计算装置获取编辑距离装置中权重信息, 基于改进 的编辑距离计算字符串的相似度。  The Chinese character similarity calculating means acquires the weight information in the editing distance device, and calculates the similarity of the character string based on the improved editing distance.
PCT/CN2014/083326 2013-07-31 2014-07-30 Method and apparatus for calculating similarity between chinese character strings based on editing distance WO2015014287A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2013103249789A CN103399907A (en) 2013-07-31 2013-07-31 Method and device for calculating similarity of Chinese character strings on the basis of edit distance
CN201310324978.9 2013-07-31

Publications (1)

Publication Number Publication Date
WO2015014287A1 true WO2015014287A1 (en) 2015-02-05

Family

ID=49563536

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/083326 WO2015014287A1 (en) 2013-07-31 2014-07-30 Method and apparatus for calculating similarity between chinese character strings based on editing distance

Country Status (2)

Country Link
CN (1) CN103399907A (en)
WO (1) WO2015014287A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110674859A (en) * 2019-09-17 2020-01-10 智阳网络技术(上海)有限公司 Chinese short text similarity detection method and system based on Chinese character strokes
US10811003B2 (en) 2018-10-31 2020-10-20 International Business Machines Corporation Language phonetic processing based on fine-grained mapping of phonetic components
US20220215170A1 (en) * 2021-01-06 2022-07-07 Tencent America LLC Framework for chinese text error identification and correction

Families Citing this family (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103399907A (en) * 2013-07-31 2013-11-20 深圳市华傲数据技术有限公司 Method and device for calculating similarity of Chinese character strings on the basis of edit distance
CN105589843B (en) * 2014-10-24 2019-02-26 科大讯飞股份有限公司 A kind of text word string matching process and system
CN105653567A (en) * 2014-12-04 2016-06-08 南京理工大学常熟研究院有限公司 Method for quickly looking for feature character strings in text sequential data
CN104484391B (en) * 2014-12-11 2017-11-21 北京国双科技有限公司 The computational methods and device of similarity of character string
CN104572627B (en) * 2015-01-30 2018-01-23 深圳市华傲数据技术有限公司 Object oriented editing distance computational methods and matching process based on comentropy
CN111324784B (en) * 2015-03-09 2023-05-16 创新先进技术有限公司 Character string processing method and device
CN104899189B (en) * 2015-05-27 2017-11-28 深圳市华傲数据技术有限公司 Object oriented matching process based on comentropy
CN106569994B (en) 2015-10-10 2019-02-26 阿里巴巴集团控股有限公司 The analysis method and device of address
CN106611176B (en) * 2015-10-26 2019-10-25 北京国双科技有限公司 The recognition methods of abnormal Chinese character string and device
CN106815197B (en) * 2015-11-27 2020-07-31 北京国双科技有限公司 Text similarity determination method and device
CN105446957B (en) 2015-12-03 2018-07-20 小米科技有限责任公司 Similitude determines method, apparatus and terminal
CN105956417A (en) * 2016-05-04 2016-09-21 西安电子科技大学 Similar base sequence query method based on editing distance in cloud environment
CN106250364A (en) * 2016-07-20 2016-12-21 科大讯飞股份有限公司 A kind of text modification method and device
CN106326484A (en) * 2016-08-31 2017-01-11 北京奇艺世纪科技有限公司 Error correction method and device for search terms
CN106170002B (en) * 2016-09-08 2019-07-02 中国科学院信息工程研究所 A kind of counterfeit domain name detection method of Chinese and system
CN106909609B (en) * 2017-01-09 2020-08-04 北方工业大学 Method for determining similar character strings, method and system for searching duplicate files
CN106970912A (en) * 2017-04-21 2017-07-21 北京慧闻科技发展有限公司 Chinese sentence similarity calculating method, computing device and computer-readable storage medium
CN107609059B (en) * 2017-08-28 2020-10-20 昆明理工大学 Chinese domain name similarity measurement method based on J-W distance
CN108256587A (en) * 2018-02-05 2018-07-06 武汉斗鱼网络科技有限公司 Determining method, apparatus, computer and the storage medium of a kind of similarity of character string
CN108629046B (en) * 2018-05-14 2023-08-18 平安科技(深圳)有限公司 Field matching method and terminal equipment
CN109063068B (en) * 2018-07-23 2020-07-03 广州云测信息技术有限公司 Picture retrieval method and device
CN110929477B (en) * 2018-09-03 2023-04-28 阿里巴巴集团控股有限公司 Keyword variant determination method and device
CN109857912A (en) * 2018-12-20 2019-06-07 广州企图腾科技有限公司 A kind of font recognition methods, electronic equipment and storage medium
CN111209447A (en) * 2019-02-27 2020-05-29 山东大学 Chinese character string similarity calculation method and device based on sound-shape codes
CN110738202A (en) * 2019-09-06 2020-01-31 平安科技(深圳)有限公司 Character recognition method, device and computer readable storage medium
CN113689923B (en) * 2020-05-19 2024-06-18 北京平安联想智慧医疗信息技术有限公司 Medical data processing device, system and method
CN112269904B (en) * 2020-09-28 2023-07-25 华控清交信息科技(北京)有限公司 Data processing method and device
CN112883718B (en) * 2021-04-27 2021-10-22 恒生电子股份有限公司 Spelling error correction method and device based on Chinese character sound-shape similarity and electronic equipment
CN113657445B (en) * 2021-07-13 2022-06-07 珠海金智维信息科技有限公司 Resnet-based single-row text picture comparison method and system
CN115640523A (en) * 2022-10-18 2023-01-24 抖音视界有限公司 Text similarity measurement method, device, equipment, storage medium and program product
CN116701963A (en) * 2023-08-09 2023-09-05 北京智精灵科技有限公司 Fuzzy matching method and system for character strings

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040181527A1 (en) * 2003-03-11 2004-09-16 Lockheed Martin Corporation Robust system for interactively learning a string similarity measurement
CN101976253A (en) * 2010-10-27 2011-02-16 重庆邮电大学 Chinese variation text matching recognition method
CN103399907A (en) * 2013-07-31 2013-11-20 深圳市华傲数据技术有限公司 Method and device for calculating similarity of Chinese character strings on the basis of edit distance

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101126406B1 (en) * 2008-11-27 2012-04-20 엔에이치엔(주) Method and System for Determining Similar Word with Input String
CN101561813B (en) * 2009-05-27 2010-09-29 东北大学 Method for analyzing similarity of character string under Web environment
WO2012104943A1 (en) * 2011-02-02 2012-08-09 日本電気株式会社 Join processing device, data management device, and text string similarity join system
CN102122298B (en) * 2011-03-07 2013-02-20 清华大学 Method for matching Chinese similarity

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040181527A1 (en) * 2003-03-11 2004-09-16 Lockheed Martin Corporation Robust system for interactively learning a string similarity measurement
CN101976253A (en) * 2010-10-27 2011-02-16 重庆邮电大学 Chinese variation text matching recognition method
CN103399907A (en) * 2013-07-31 2013-11-20 深圳市华傲数据技术有限公司 Method and device for calculating similarity of Chinese character strings on the basis of edit distance

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WANG, JINGTING. ET AL.: "Research Towards Chinese String Similarity Based on the Clustering Feature of Chinese Characters.", NEW TECHNOLOGY OF LIBRARY AND INFORMATION SERVICE, no. 2, 2011 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10811003B2 (en) 2018-10-31 2020-10-20 International Business Machines Corporation Language phonetic processing based on fine-grained mapping of phonetic components
CN110674859A (en) * 2019-09-17 2020-01-10 智阳网络技术(上海)有限公司 Chinese short text similarity detection method and system based on Chinese character strokes
US20220215170A1 (en) * 2021-01-06 2022-07-07 Tencent America LLC Framework for chinese text error identification and correction
US11481547B2 (en) * 2021-01-06 2022-10-25 Tencent America LLC Framework for chinese text error identification and correction

Also Published As

Publication number Publication date
CN103399907A (en) 2013-11-20

Similar Documents

Publication Publication Date Title
WO2015014287A1 (en) Method and apparatus for calculating similarity between chinese character strings based on editing distance
WO2021135910A1 (en) Machine reading comprehension-based information extraction method and related device
WO2018040899A1 (en) Error correction method and device for search term
US8745077B2 (en) Searching and matching of data
JP5997217B2 (en) A method to remove ambiguity of multiple readings in language conversion
CN104021786B (en) Speech recognition method and speech recognition device
WO2020143163A1 (en) Named entity recognition method and apparatus based on attention mechanism, and computer device
CN110909548A (en) Chinese named entity recognition method and device and computer readable storage medium
CN111209447A (en) Chinese character string similarity calculation method and device based on sound-shape codes
CN105068997B (en) The construction method and device of parallel corpora
CN102063508A (en) Generalized suffix tree based fuzzy auto-completion method for Chinese search engine
US7366984B2 (en) Phonetic searching using multiple readings
TW201516715A (en) Method of data sorting
JP5323652B2 (en) Similar word determination method and system
Li et al. Dimsim: An accurate chinese phonetic similarity algorithm based on learned high dimensional encoding
CN111882462A (en) Chinese trademark approximate detection method facing multi-factor examination standard
CN106650803B (en) The method and device of similarity between a kind of calculating character string
TW200531005A (en) Method for text-to-pronunciation conversion capable of increasing the accuracy by re-scoring graphemes likely to be tagged erroneously
CN101727440A (en) Sensitive word correcting method and system
WO2021042527A1 (en) Character recognition method and apparatus, and computer-readable storage medium
CN101645068B (en) Data querying method capable of searching similar characteristic words and search engine server
Wang et al. Accurate Braille-Chinese translation towards efficient Chinese input method for blind people
CN103605755B (en) A kind of construction method of proverb literary composition database and proverb literary composition database retrieval system
Lakshmi et al. An ensemble of grapheme and phoneme-based models for automatic English to Kannada back-transliteration
JP4415768B2 (en) Address table generation support method, apparatus and program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14832813

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14832813

Country of ref document: EP

Kind code of ref document: A1