CN111209447A

CN111209447A - Chinese character string similarity calculation method and device based on sound-shape codes

Info

Publication number: CN111209447A
Application number: CN201910146570.4A
Authority: CN
Inventors: 刘卫国; 宋红磊; 张�浩; 殷泽坤; 张雯
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2019-02-27
Filing date: 2019-02-27
Publication date: 2020-05-29

Abstract

The invention discloses a sound-shape code which comprises a sound code and a shape code, wherein the sound code consists of numerical codes of initial consonants and vowels, and the shape code consists of four-corner codes, structural codes and stroke numbers of Chinese characters; pre-storing the mapping rule of the sound-shape code and the pronunciation similarity of partial initial consonants/vowels, wherein the method comprises the following steps: receiving two character strings to be compared; reading a mapping rule of the sound-shape codes, and converting each Chinese character in the two character strings into sound-shape code representation according to the mapping rule; based on the pronunciation similarity of the initial consonants and the vowels, the editing distance between every two corresponding substrings of the two character strings is calculated by adopting the editing distance; and calculating the similarity of the two character strings according to the editing distance. The invention converts the character string into the sound-shape code digital string for comparison, improves the matching precision of the Chinese characters, and on the other hand, replaces the weight of the editing distance with the editing distance of the Chinese characters, and can more accurately calculate the similarity of the character string.

Description

Chinese character string similarity calculation method and device based on sound-shape codes

Technical Field

The invention belongs to the technical field of text similarity calculation, and particularly relates to a Chinese character string similarity calculation method and device based on a sound-shape code.

Background

The similarity of character strings is a basic technique in String matching (String matching), Text Comparison (Text Comparison) and Information Extraction (Information Extraction) as a measure for measuring the degree of approximation between two character strings, and the input of the basic technique is usually two identical or different character strings, and the output of the basic technique is a determined value. The higher the similarity of two character strings, the larger the corresponding return value. There are many methods for measuring the similarity of character strings, including cosine similarity (cosine similarity), Euclidean distance (Euclidean distance), edit distance (edit distance), hamming distance (hamming distance), Dice distance, Jaccard distance, J-W distance (Jaro-Winkerdistance), and so on. The edit distance algorithm, also called Levenshtein distance, represents the minimum number of edits required to convert a character string into another character string, i.e., replacing one character in the character string with another character or inserting a deleted character, and the calculation of the minimum number of edits between a pair of character strings is the core of the edit distance. The Smith-Waterman algorithm is an algorithm for local sequence alignment (as opposed to global alignment) and is often applied to similarity calculation between nucleotide or protein sequences, and the aim of the algorithm is not to perform full sequence alignment but to find fragments of two sequences with high similarity. When the specific problem of Chinese character string matching in a Chinese language environment is faced, the practicability of a classic edit distance algorithm is reduced; the common font-based algorithm is based on the five-stroke coding of the Chinese characters, and cannot well mine the structural characteristics of the Chinese characters; the Smith-Waterman algorithm is better suited to find segments with high similarity in two strings.

In order to realize the quantification of the similarity between Chinese characters, a coding mode, namely a sound-shape code, capable of simultaneously describing the pronunciation and the structure of the Chinese characters is proposed at present, but the inventor finds that when the sound-shape code encodes the initial consonant and the final consonant, part of the initial consonant/the final consonant adopts the same coding, such as an and ang, and z and zh, and the same coding is adopted for conversion, so that the difference between the initial consonant and the final consonant is weakened, but the direct equivalent substitution mode can not reflect the difference, the problem that some fuzzy sounds possibly existing in Chinese pronunciation in reality are similar is ignored, such as pronunciation habits of flat tongue warping in the square and nose sounds before and after the square, and the accuracy of similarity comparison is necessarily influenced.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a Chinese character string similarity calculation method and a Chinese character string similarity calculation device based on a sound-shape code, which simplify the representation of the existing sound-shape code, comprise initial consonants, final consonants, four-corner codes, structures and stroke number information of Chinese characters, and replace the editing distance between character strings by the editing distance between the Chinese characters, thereby calculating the editing distance between two characters more comprehensively and obtaining more accurate Chinese character string similarity.

In order to achieve the above object, one or more embodiments of the present invention provide the following technical solutions:

a Chinese character string similarity calculation method based on a sound-shape code comprises the sound code and the shape code, wherein the sound code consists of numerical codes of initial consonants and vowels, and the shape code consists of four-corner codes, structural codes and stroke numbers of Chinese characters; pre-storing the mapping rule of the sound-shape code and the pronunciation similarity of partial initial consonants/vowels, wherein the method comprises the following steps:

receiving two character strings to be compared;

reading a mapping rule of the sound-shape codes, and converting each Chinese character in the two character strings into sound-shape code representation according to the mapping rule;

based on the pronunciation similarity of the initial consonants and the vowels, the editing distance between every two corresponding substrings of the two character strings is calculated by adopting the editing distance;

and calculating the similarity of the two character strings according to the editing distance.

Further, the phonetic-shape code comprises 12 bits: 2-bit initial consonant, 2-bit vowel, 5-bit four-corner coding, 1-bit structure code and 2-bit stroke number.

Further, the mapping rule of the phonetic-shape code comprises: the mapping rules of Chinese characters to pinyin, strokes, structures and four-corner codes and the mapping rules of initials, finals and structures to numerical codes.

Further, the calculating the edit distance between each two corresponding substrings of the two character strings based on the edit distance includes:

initializing an editing distance matrix;

and sequentially calculating the editing distance between every two Chinese characters corresponding to the two character strings according to a dynamic planning strategy, and writing the editing distance into an editing distance matrix.

Furthermore, an edit distance from a sub-string with the length i of the character string A to a sub-string with the length j of the character string B is represented by edge _ char (i, j), and the dynamic programming strategy is as follows:

1)if i＝＝0&&j＝＝0，edit_char(i,j)＝0；

2)if i＝＝0&&j>0，edit_char(i,j)＝j；

3)if i>0&&j＝＝0，edit_char(i,j)＝I；

4)if 0<i<＝m&&0<j<＝n，

an edge _ char (i, j) ═ min { edge _ char (i-l, j) +1, edge _ char (i, j-1) +1, edge _ char (i-l, j-l) + f (i, j) }, wherein f (i, j) ═ l when the ith character of the character string a is not equal to the jth character of the character string B; otherwise, f (i, j) is 0.

Further, f (i, j) ═ α × S_Sound+β*S_{Shape of}

Wherein S is_SoundDistance of phonetic code, i.e. pronunciation similarity, S_{Shape of}For the shape code distance, α and β are adjustment coefficients.

Further, S_{Shape of}The four-corner coding editing distance + theta 2 structural code comparison value + theta 3 stroke number difference/max (the maximum stroke number of two Chinese characters), wherein theta 1, theta 2 and theta 3 are all weight coefficients.

Further, calculating the similarity of the two character strings according to the edit distance includes:

taking the value of the lower right corner in the edit distance matrix as the shortest edit matrix distance, and calculating the Similarity between the two character strings A and B by adopting the following formula:

Similarity＝1-distance/max(length(A)，length(B))

where length (a) and length (B) represent the lengths of strings a and B, respectively, and max (·) represents a maximum function.

One or more embodiments provide a computing device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the method for phonographic code based calculation of similarity of chinese character strings when executing the program.

One or more embodiments provide a computer-readable storage medium on which a computer program is stored, which, when executed by a processor, implements the phonographic code-based chinese character string similarity calculation method.

The above one or more technical solutions have the following beneficial effects:

the invention simplifies the existing expression of the sound-shape codes, including initial consonants, vowels, four-corner codes, structures and stroke number information of Chinese characters, and the conversion efficiency from the Chinese characters to the sound-shape codes is higher; the four-corner coding with complementary codes is adopted, the font can be more accurately represented, the influence of pronunciation habits of flat-warped tongues and front and back nasal sounds in dialects on the measurement of the pronunciation similarity of the Chinese characters is considered, for initial consonants and vowels of part of similar pronunciations such as an and ang, en and eng and the like, the similarity is set while different codes are adopted for representation, namely the similarity between pronunciations is considered, the difference is also embodied, the existing algorithm does not consider the point and is only simple, identical and different, and therefore, in comparison, the sound-shape codes can more accurately and comprehensively reflect the similarity between the Chinese characters.

The invention improves the calculation method of the editing distance of the character string, replaces the weight of the editing distance with the editing distance of the Chinese character, and can reasonably consider the conditions of different positions of the same character, such as common character string dislocation, so that the comparison between the character strings is more comprehensive, the matching precision of the character strings is improved, and the similarity of the character strings can be calculated more accurately.

The similarity calculation method introduces adjustable weight in the process of calculating the editing distance between Chinese characters, can adjust the weight of the font part and the pronunciation part according to the actual situation, for example, can improve the proportion of the font part in the process of character recognition in a natural scene, and can improve the proportion of the similarity of the pronunciation in the process of voice recognition, thereby being applied to a database retrieval system supporting Chinese character input and voice input.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the application, and are incorporated in and constitute a part of this specification, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application.

FIG. 1 is a flowchart illustrating an overall method for calculating similarity of Chinese character strings based on phonographic codes according to an embodiment of the present invention;

FIG. 2 is a block diagram of an idea of performing similarity calculation between Chinese characters based on phonetic and font codes according to an embodiment of the present invention.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The embodiments and features of the embodiments in the present application may be combined with each other without conflict.

Example one

The embodiment discloses a Chinese character string similarity calculation method based on a sound-shape code, wherein the sound-shape code comprises a sound code and a shape code, the sound code consists of numerical codes of an initial consonant and a final sound, and the shape code consists of four-corner codes, structural codes and stroke numbers of Chinese characters; pre-storing the mapping rule of the sound-shape code in a database;

as shown in fig. 1, the method comprises the steps of:

step 1: receiving two character strings A and B to be compared;

step 2: reading a mapping rule of the sound-shape codes from a database, and converting each Chinese character in the two character strings into the sound-shape code expression according to the mapping rule;

and step 3: calculating the editing distance between every two corresponding substrings of the two character strings based on the editing distance;

and 4, step 4: and calculating the similarity of the character strings according to the editing distance.

The phono configurational code comprises 12 bits: 2-bit initial consonant, 2-bit vowel, 5-bit four-corner coding, 1-bit structure code and 2-bit stroke number. The expression method is shown in table 1.

TABLE 1 phono configurational code composition

Initial consonant (2 position)

Finals (2 bit)

Four corner coding (5 bit)

Structure code (1 bit)

Stroke number (2 bit)

The mapping rule of the phonetic-shape codes comprises the following steps: the mapping rules of Chinese characters to pinyin, strokes, structures and four-corner codes and the mapping rules of initials, finals and structures to numerical codes. The mapping rule from the initial consonant and the final vowel to the numerical code is as follows:

TABLE 2 mapping rules of initials and finals to numerical codes

a	01	ai	07	ie	13	un	19
								o	02	ei	08	ve	14	vn	20
e	03	ui	09	er	15	ang	21
								i	04	ao	10	an	16	eng	22
u	05	ou	11	en	17	ing	23
								v	06	iu	12	in	18	ong	24

The step 3 comprises the following steps:

initializing an editing distance matrix;

and sequentially calculating the editing distance between every two corresponding substrings of the two character strings according to a dynamic programming strategy, and writing the editing distance into an editing distance matrix.

Specifically, the editing distance between the Chinese characters is adopted to replace the editing distance between the substrings.

Wherein the dynamic planning strategy is as follows:

1)if i＝＝0&&j＝＝0,edit_char(i,j)＝0；

2)if i＝＝0&&j>0,edit_char(i,j)＝j；

3)if i>0&&j＝＝0,edit_char(i,j)＝I；

4) if0< i < ═ m & &0< j < ═ n, edge _ char (i, j) ═ min { edge _ char (i-l, j) +1, edge _ char (i, j-1) +1, edge _ char (i-l, j-l) + f (i, j) }, when the ith character of the first character string is not equal to the jth character of the second character string, f (i, j) } l; otherwise, f (i, j) is 0.

And adopting an edge _ char (i, j) function to represent the edit distance from the substring with the length i of the first character string A to the substring with the length j of the first character string B.

f(i,j)＝α*S_Sound+β*S_{Shape of}

Wherein S_Soundα and β are coefficient parameters, α + β is 1, and can be adjusted according to similarity comparison requirements, when the character strings A and B are both texts, the value of the coefficient β can be correspondingly increased, emphasis is placed on font comparison, and if texts obtained by voice conversion exist in the character strings A and B, the value of α is properly increased, and emphasis is placed on pronunciation comparison.

TABLE 3 initial/final distance mapping table

		Distance between two adjacent plates			Distance between two adjacent plates
						c	ch	0.9	h	f	0.7
s	sh	0.9	eng	ang	0.7
						z	zh	0.9	eng	ong	0.7
an	ang	0.9	eng	ing	0.7
						en	eng	0.9	ang	ong	0.7
in	ing	0.9	ang	ing	0.7
						n	l	0.7	ing	ong	0.7

S_{Shape of}θ 1 × four-corner coding edit distance + θ 2 × configuration code comparison value + θ 3 × stroke number difference/max (stroke number);

wherein, θ 1, θ 2, θ 3 are coefficient values of each comparison part (which can be adjusted according to comparison requirements); the quadrangle code edit distance is obtained by table lookup.

The step 4 comprises the following steps:

Similarity＝1-distance/max(length(A)，length(B))

As an example, a mapping Table is established to obtain a mapping Table Table _1 from Chinese characters to pinyin, a mapping Table Table _2 from Chinese characters to strokes, a mapping Table Table _3 from Chinese characters to structures and a mapping Table Table _4 from Chinese characters to four-corner numbers, and the sound-shape codes of the Chinese characters can be obtained by inquiring the mapping Table, wherein the specific structure comprises the following steps: 2-bit initial consonant, 2-bit vowel, 5-bit four-corner coding, 1-bit structure code and 2-bit stroke number.

For example: the character string A is 'Shanxi Taiji', the character string B is 'Shandong university', and the sound-shape codes of all Chinese characters are obtained based on the mapping table, as follows:

TABLE 4 phono configurational code for character string A and character string B

Chinese characters	Sound-shape code	Chinese characters	Sound-shape code
				Mountain	171622770003	Mountain	171622770003
East	052440904005	Western medicine	140410604011
				Big (a)	050140800013	Taiwan (Chinese character of 'tai')	060740030014
Study the design	141490407208	Study the design	141490407208

If the edit distance of the character strings A and B is required, the edit distance between the Chinese characters is calculated first.

First, the character string a, B has a length of m-n-4.

Edit distance matrix, as in table 5:

TABLE 5 initialized edit distance matrix

Calculating an edit distance from a substring with the length i of the character string A to a substring with the length j of the character string B, and firstly calculating the edit distance from a mountain to a mountain of the character string A, wherein f (i, j) is obviously 0, and edge _ char (1,1) is obviously 0;

at this time, as shown in table 6, the edit distance matrix is updated to:

TABLE 6 write "mountain" to "mountain" edit distance to edit distance matrix

		Mountain	East	Big (a)	Study the design
							0	1	2	3	4
Mountain	1	0
						Western medicine	2
Taiwan (Chinese character of 'tai')	3
						Study the design	4

Then, edit _ char (1,2), i.e., the edit distance of "mountain" and "east of mountain", is calculated. From the dynamic programming formulation we need to calculate f (1, 2).

f(i,j)＝α*S_Sound+β*S_{Shape of}

Wherein S_SoundAcquiring the distance of the initial consonant and the final consonant from the constructed distance table of the initial consonant and the final consonant;

α and β are coefficient parameters which can be adjusted according to the similarity comparison requirement, in this case, the similarity of the weight bias font is calculated, so the parameter values are 0.2 and 0.8 respectively.

The S of mountain and east can be known by looking up the table_Sound＝1，S_{Shape of}The edit distances of the four middle corner codes are as follows:

TABLE 7 "mountain" and "east" quadrangle code edit distances

As can be seen from the table, the edit distance of the "mountain" and "east" four corner codes is 5/5 ═ 1.0;

the structures are the same, so the distance of the structures is 0, the stroke number difference is 2, S_{Shape of}0.8 × 1+0.05 × 0+0.15 × 2/5 ═ 0.86. By the formula, it _ char (i, j) ═ min { edit _ char (i-l, j) +1, edit _ char (i, j-1) +1, edit _ char (i-l, j-l) + f (i, j) }, when it _ char (i-l, j) +1 ═ 3, edit _ char (i, j-1) +1 ═ 1, edit _ char (i-l, j-l) + f (i, j) ═ 1.888; by analogy, the final results are shown in table 8:

TABLE 8 edit distance of character strings "Shanxi Taiji" and "Shandong university

		Mountain	East	Big (a)	Study the design
							0	1	2	3	4
Mountain	1	0	1	2	3
						Western medicine	2	1	0.476	1.476	2.476
Taiwan (Chinese character of 'tai')	3	2	1.476	0.962	1.962
						Study the design	4	3	2.476	1.962	0.962

As can be seen from the table, the shortest editing distance improvement algorithm based on the phonetic codes of the character strings "shanxi tai chou" and "shandong university" results in 0.962, and the similarity of the character strings is 1-0.962/4-0.760.

In the embodiment, the editing distance of the Chinese character string is replaced by the editing distance of the Chinese character, so that the conditions of different positions of the same character can be reasonably considered, the comparison between the character strings is more comprehensive, the matching precision of the character strings is improved, and the similarity of the character strings can be more accurately calculated.

In one or more embodiments, the similarity calculation method may be used in a database retrieval system, where one of the character string a and the character string B is a text to be retrieved, and the other is a text in a database. And the text to be searched can be directly input through characters or can be input through language, and the system firstly converts the voice into the text and then executes the search.

Example two

An object of the present embodiment is to provide a computing device.

A computing device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the memory having pre-stored therein mapping rules for the pictographic codes, the processor implementing the following steps when executing the program:

receiving two character strings to be compared;

reading a mapping rule of the sound-shape codes, and converting each Chinese character in the two character strings into sound-shape code representation according to the mapping rule; the sound-shape codes comprise sound codes and shape codes, wherein the sound codes comprise numerical codes of initials and finals, and the shape codes comprise four-corner codes, structural codes and stroke numbers of Chinese characters;

calculating the editing distance between every two corresponding substrings of the two character strings based on the editing distance;

EXAMPLE III

An object of the present embodiment is to provide a computer-readable storage medium.

A computer-readable storage medium having stored thereon a mapping rule of the phonetic-to-shape codes and a computer program for calculating text similarity, which program, when executed by a processor, performs the steps of:

receiving two character strings to be compared;

The steps involved in the second and third embodiments correspond to the first embodiment of the method, and the detailed description thereof can be found in the relevant description of the first embodiment. The term "computer-readable storage medium" should be taken to include a single medium or multiple media containing one or more sets of instructions; it should also be understood to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by a processor and that cause the processor to perform any of the methods of the present invention.

One or more of the above embodiments have the following technical effects:

the existing expression of the sound-shape codes is simplified, the sound-shape codes comprise initial consonants, vowels, four-corner codes, structures and stroke number information of the Chinese characters, and the conversion efficiency from the Chinese characters to the sound-shape codes is higher; the four-corner coding with the complementary codes is adopted, the character patterns can be more accurately represented, and for partial consonants and vowels with similar pronunciations, such as an and ang, en and eng, and the like, the similarity is set while different codes are adopted for representation, namely the similarity between pronunciations is considered, and the difference is also reflected, so that the similarity between Chinese characters can be more accurately identified.

The method for calculating the editing distance of the character string is improved, the weight of the editing distance is replaced by the editing distance of the Chinese characters, the conditions of different positions of the same character can be reasonably considered, such as common character string dislocation, so that the comparison between the character strings is more comprehensive, the matching precision of the character strings is improved, and the similarity of the character strings can be calculated more accurately.

Adjustable weight is introduced in the process of calculating the editing distance between Chinese characters, the weight of a character shape part and a character sound part can be adjusted according to actual situations, for example, the proportion of the character shape part can be improved in the process of character recognition in natural scenes, and the proportion of the similarity of the character sound can be improved in the process of voice recognition, so that the method can be applied to a database retrieval system supporting Chinese character input and voice input.

Those skilled in the art will appreciate that the modules or steps of the present application described above can be implemented using general purpose computing devices, or alternatively, they can be implemented using program code executable by computing devices, such that they are stored in a storage device and executed by computing devices, or they are separately fabricated into individual integrated circuit modules, or multiple modules or steps thereof are fabricated into a single integrated circuit module. The present application is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Although the embodiments of the present application have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present application, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive effort by those skilled in the art.

Claims

1. a Chinese character string similarity computing method based on phonetic-shaped code, is characterized in that, described phonetic-shaped code comprises phonetic code and shape code, wherein, phonetic code is made up of the digital coding of initial consonant and final and final vowel, and shape code is made up of Chinese character The four-corner coding, the structural coding and the number of strokes are composed of; the mapping rule of the pre-stored said phonetic code, and part of the initial/final consonant pronunciation similarity, the method includes:

Receive two strings to be compared;

Read the mapping rule of phonetic-shaped code, according to described mapping rule, each Chinese character in two character strings is all converted into phonetic-shaped code representation;

Based on the pronunciation similarity of the initials/finals, the edit distance is used to calculate the edit distance between the corresponding substrings of the two strings;

Calculate the similarity of two strings according to the edit distance.

2. a kind of Chinese character string similarity calculation method based on phonetic code as claimed in claim 1, is characterized in that, described phonetic code comprises 12: 2 initials, 2 finals, 5 quadrangular codes, 1-digit structure code, 2-digit stroke number.

3. a kind of Chinese character string similarity computing method based on phonetic-shape code as claimed in claim 1, is characterized in that, the mapping rule of described phonetic-shape code comprises: the mapping of Chinese character to pinyin, stroke, structure and quadrangle coding rules, as well as the mapping rules of initials, finals, and structures to numerical codes.

4. a kind of Chinese character string similarity calculation method based on phonetic code as claimed in claim 1, is characterized in that, described based on edit distance calculating the edit distance between two corresponding substrings of two character strings comprises:

Initialize the edit distance matrix;

According to the dynamic programming strategy, the edit distance between the Chinese characters in the two strings is calculated in turn, and written into the edit distance matrix.

5. a kind of Chinese character string similarity calculation method based on phonetic code as claimed in claim 4 is characterized in that, adopt edit_char (i, j) to represent that string A length is the substring of i to string B length is the edit distance of the substring of j, the dynamic programming strategy is as follows:

1) if i==0&&j==0, edit_char(i,j)=0;

2) if i==0&&j>0, edit_char(i,j)=j;

3) if i>0&&j==0, edit_char(i,j)=I;

4) if 0<i<=m&&0<j<=n,

edit_char(i,j)==min{edit_char(i-l,j)+1,edit_char(i,j-1)+1,edit_char(i-l,j-l)+f(i,j)}, where, when the string When the ith character of A is not equal to the jth character of the string B, f(i,j)=1; otherwise, f(i,j)=0.

6. a kind of Chinese character string similarity calculation method based on phonetic code as claimed in claim 5, is characterized in that,

f(i,j)=α*S _sound +β*S _shape

Among them, the S _sound is the phonetic code distance, that is, the pronunciation similarity, the S _shape is the shape code distance, and α and β are adjustment coefficients.

7. a kind of Chinese character string similarity calculation method based on phonetic-shaped code as claimed in claim 6, is characterized in that, S _shape =θ1* four-corner coding edit distance+θ2* structure code comparison value+θ3* stroke number difference /The maximum number of strokes for two Chinese characters, where θ1, θ2, and θ3 are all weight coefficients.

8. a kind of Chinese character string similarity calculation method based on phonetic code as claimed in claim 6, is characterized in that, according to described edit distance calculating the similarity of two character strings comprises:

Take the value in the lower right corner of the edit distance matrix as the shortest edit matrix distance, and use the following formula to calculate the similarity of the two strings A and B:

Similarity=1-distance/max(length(A), length(B))

Among them, length(A) and length(B) represent the lengths of strings A and B, respectively, and max(·) represents the function of taking the maximum value.

9. A computing device, comprising a memory, a processor and a computer program stored on the memory and running on the processor, wherein the processor implements any one of claims 1-8 when the processor executes the program A method for calculating the similarity of Chinese character strings based on phonetic code.

10. A computer-readable storage medium, having a computer program stored thereon, is characterized in that, when this program is executed by the processor, realize the Chinese character string similarity calculation method based on phonetic code as any one of claims 1-8 .