CN111209447A - Chinese character string similarity calculation method and device based on sound-shape codes - Google Patents
Chinese character string similarity calculation method and device based on sound-shape codes Download PDFInfo
- Publication number
- CN111209447A CN111209447A CN201910146570.4A CN201910146570A CN111209447A CN 111209447 A CN111209447 A CN 111209447A CN 201910146570 A CN201910146570 A CN 201910146570A CN 111209447 A CN111209447 A CN 111209447A
- Authority
- CN
- China
- Prior art keywords
- codes
- similarity
- sound
- character strings
- shape
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004364 calculation method Methods 0.000 title claims description 17
- 238000013507 mapping Methods 0.000 claims abstract description 37
- 238000000034 method Methods 0.000 claims abstract description 31
- 240000004282 Grewia occidentalis Species 0.000 claims abstract description 25
- 239000011159 matrix material Substances 0.000 claims description 16
- 238000004590 computer program Methods 0.000 claims description 6
- 230000006870 function Effects 0.000 claims description 4
- 238000004422 calculation algorithm Methods 0.000 description 9
- 238000013461 design Methods 0.000 description 6
- 238000006243 chemical reaction Methods 0.000 description 4
- 239000003814 drug Substances 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000000295 complement effect Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 210000002105 tongue Anatomy 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000009472 formulation Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 238000002865 local sequence alignment Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 239000002773 nucleotide Substances 0.000 description 1
- 125000003729 nucleotide group Chemical group 0.000 description 1
- 108090000623 proteins and genes Chemical group 0.000 description 1
- 102000004169 proteins and genes Human genes 0.000 description 1
- 238000011002 quantification Methods 0.000 description 1
- 238000002864 sequence alignment Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Landscapes
- Document Processing Apparatus (AREA)
Abstract
The invention discloses a sound-shape code which comprises a sound code and a shape code, wherein the sound code consists of numerical codes of initial consonants and vowels, and the shape code consists of four-corner codes, structural codes and stroke numbers of Chinese characters; pre-storing the mapping rule of the sound-shape code and the pronunciation similarity of partial initial consonants/vowels, wherein the method comprises the following steps: receiving two character strings to be compared; reading a mapping rule of the sound-shape codes, and converting each Chinese character in the two character strings into sound-shape code representation according to the mapping rule; based on the pronunciation similarity of the initial consonants and the vowels, the editing distance between every two corresponding substrings of the two character strings is calculated by adopting the editing distance; and calculating the similarity of the two character strings according to the editing distance. The invention converts the character string into the sound-shape code digital string for comparison, improves the matching precision of the Chinese characters, and on the other hand, replaces the weight of the editing distance with the editing distance of the Chinese characters, and can more accurately calculate the similarity of the character string.
Description
Technical Field
The invention belongs to the technical field of text similarity calculation, and particularly relates to a Chinese character string similarity calculation method and device based on a sound-shape code.
Background
The similarity of character strings is a basic technique in String matching (String matching), Text Comparison (Text Comparison) and Information Extraction (Information Extraction) as a measure for measuring the degree of approximation between two character strings, and the input of the basic technique is usually two identical or different character strings, and the output of the basic technique is a determined value. The higher the similarity of two character strings, the larger the corresponding return value. There are many methods for measuring the similarity of character strings, including cosine similarity (cosine similarity), Euclidean distance (Euclidean distance), edit distance (edit distance), hamming distance (hamming distance), Dice distance, Jaccard distance, J-W distance (Jaro-Winkerdistance), and so on. The edit distance algorithm, also called Levenshtein distance, represents the minimum number of edits required to convert a character string into another character string, i.e., replacing one character in the character string with another character or inserting a deleted character, and the calculation of the minimum number of edits between a pair of character strings is the core of the edit distance. The Smith-Waterman algorithm is an algorithm for local sequence alignment (as opposed to global alignment) and is often applied to similarity calculation between nucleotide or protein sequences, and the aim of the algorithm is not to perform full sequence alignment but to find fragments of two sequences with high similarity. When the specific problem of Chinese character string matching in a Chinese language environment is faced, the practicability of a classic edit distance algorithm is reduced; the common font-based algorithm is based on the five-stroke coding of the Chinese characters, and cannot well mine the structural characteristics of the Chinese characters; the Smith-Waterman algorithm is better suited to find segments with high similarity in two strings.
In order to realize the quantification of the similarity between Chinese characters, a coding mode, namely a sound-shape code, capable of simultaneously describing the pronunciation and the structure of the Chinese characters is proposed at present, but the inventor finds that when the sound-shape code encodes the initial consonant and the final consonant, part of the initial consonant/the final consonant adopts the same coding, such as an and ang, and z and zh, and the same coding is adopted for conversion, so that the difference between the initial consonant and the final consonant is weakened, but the direct equivalent substitution mode can not reflect the difference, the problem that some fuzzy sounds possibly existing in Chinese pronunciation in reality are similar is ignored, such as pronunciation habits of flat tongue warping in the square and nose sounds before and after the square, and the accuracy of similarity comparison is necessarily influenced.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a Chinese character string similarity calculation method and a Chinese character string similarity calculation device based on a sound-shape code, which simplify the representation of the existing sound-shape code, comprise initial consonants, final consonants, four-corner codes, structures and stroke number information of Chinese characters, and replace the editing distance between character strings by the editing distance between the Chinese characters, thereby calculating the editing distance between two characters more comprehensively and obtaining more accurate Chinese character string similarity.
In order to achieve the above object, one or more embodiments of the present invention provide the following technical solutions:
a Chinese character string similarity calculation method based on a sound-shape code comprises the sound code and the shape code, wherein the sound code consists of numerical codes of initial consonants and vowels, and the shape code consists of four-corner codes, structural codes and stroke numbers of Chinese characters; pre-storing the mapping rule of the sound-shape code and the pronunciation similarity of partial initial consonants/vowels, wherein the method comprises the following steps:
receiving two character strings to be compared;
reading a mapping rule of the sound-shape codes, and converting each Chinese character in the two character strings into sound-shape code representation according to the mapping rule;
based on the pronunciation similarity of the initial consonants and the vowels, the editing distance between every two corresponding substrings of the two character strings is calculated by adopting the editing distance;
and calculating the similarity of the two character strings according to the editing distance.
Further, the phonetic-shape code comprises 12 bits: 2-bit initial consonant, 2-bit vowel, 5-bit four-corner coding, 1-bit structure code and 2-bit stroke number.
Further, the mapping rule of the phonetic-shape code comprises: the mapping rules of Chinese characters to pinyin, strokes, structures and four-corner codes and the mapping rules of initials, finals and structures to numerical codes.
Further, the calculating the edit distance between each two corresponding substrings of the two character strings based on the edit distance includes:
initializing an editing distance matrix;
and sequentially calculating the editing distance between every two Chinese characters corresponding to the two character strings according to a dynamic planning strategy, and writing the editing distance into an editing distance matrix.
Furthermore, an edit distance from a sub-string with the length i of the character string A to a sub-string with the length j of the character string B is represented by edge _ char (i, j), and the dynamic programming strategy is as follows:
1)if i==0&&j==0,edit_char(i,j)=0;
2)if i==0&&j>0,edit_char(i,j)=j;
3)if i>0&&j==0,edit_char(i,j)=I;
4)if 0<i<=m&&0<j<=n,
an edge _ char (i, j) ═ min { edge _ char (i-l, j) +1, edge _ char (i, j-1) +1, edge _ char (i-l, j-l) + f (i, j) }, wherein f (i, j) ═ l when the ith character of the character string a is not equal to the jth character of the character string B; otherwise, f (i, j) is 0.
Further, f (i, j) ═ α × SSound+β*SShape of
Wherein S isSoundDistance of phonetic code, i.e. pronunciation similarity, SShape ofFor the shape code distance, α and β are adjustment coefficients.
Further, SShape ofThe four-corner coding editing distance + theta 2 structural code comparison value + theta 3 stroke number difference/max (the maximum stroke number of two Chinese characters), wherein theta 1, theta 2 and theta 3 are all weight coefficients.
Further, calculating the similarity of the two character strings according to the edit distance includes:
taking the value of the lower right corner in the edit distance matrix as the shortest edit matrix distance, and calculating the Similarity between the two character strings A and B by adopting the following formula:
Similarity=1-distance/max(length(A),length(B))
where length (a) and length (B) represent the lengths of strings a and B, respectively, and max (·) represents a maximum function.
One or more embodiments provide a computing device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the method for phonographic code based calculation of similarity of chinese character strings when executing the program.
One or more embodiments provide a computer-readable storage medium on which a computer program is stored, which, when executed by a processor, implements the phonographic code-based chinese character string similarity calculation method.
The above one or more technical solutions have the following beneficial effects:
the invention simplifies the existing expression of the sound-shape codes, including initial consonants, vowels, four-corner codes, structures and stroke number information of Chinese characters, and the conversion efficiency from the Chinese characters to the sound-shape codes is higher; the four-corner coding with complementary codes is adopted, the font can be more accurately represented, the influence of pronunciation habits of flat-warped tongues and front and back nasal sounds in dialects on the measurement of the pronunciation similarity of the Chinese characters is considered, for initial consonants and vowels of part of similar pronunciations such as an and ang, en and eng and the like, the similarity is set while different codes are adopted for representation, namely the similarity between pronunciations is considered, the difference is also embodied, the existing algorithm does not consider the point and is only simple, identical and different, and therefore, in comparison, the sound-shape codes can more accurately and comprehensively reflect the similarity between the Chinese characters.
The invention improves the calculation method of the editing distance of the character string, replaces the weight of the editing distance with the editing distance of the Chinese character, and can reasonably consider the conditions of different positions of the same character, such as common character string dislocation, so that the comparison between the character strings is more comprehensive, the matching precision of the character strings is improved, and the similarity of the character strings can be calculated more accurately.
The similarity calculation method introduces adjustable weight in the process of calculating the editing distance between Chinese characters, can adjust the weight of the font part and the pronunciation part according to the actual situation, for example, can improve the proportion of the font part in the process of character recognition in a natural scene, and can improve the proportion of the similarity of the pronunciation in the process of voice recognition, thereby being applied to a database retrieval system supporting Chinese character input and voice input.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the application, and are incorporated in and constitute a part of this specification, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application.
FIG. 1 is a flowchart illustrating an overall method for calculating similarity of Chinese character strings based on phonographic codes according to an embodiment of the present invention;
FIG. 2 is a block diagram of an idea of performing similarity calculation between Chinese characters based on phonetic and font codes according to an embodiment of the present invention.
Detailed Description
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
The embodiments and features of the embodiments in the present application may be combined with each other without conflict.
Example one
The embodiment discloses a Chinese character string similarity calculation method based on a sound-shape code, wherein the sound-shape code comprises a sound code and a shape code, the sound code consists of numerical codes of an initial consonant and a final sound, and the shape code consists of four-corner codes, structural codes and stroke numbers of Chinese characters; pre-storing the mapping rule of the sound-shape code in a database;
as shown in fig. 1, the method comprises the steps of:
step 1: receiving two character strings A and B to be compared;
step 2: reading a mapping rule of the sound-shape codes from a database, and converting each Chinese character in the two character strings into the sound-shape code expression according to the mapping rule;
and step 3: calculating the editing distance between every two corresponding substrings of the two character strings based on the editing distance;
and 4, step 4: and calculating the similarity of the character strings according to the editing distance.
The phono configurational code comprises 12 bits: 2-bit initial consonant, 2-bit vowel, 5-bit four-corner coding, 1-bit structure code and 2-bit stroke number. The expression method is shown in table 1.
TABLE 1 phono configurational code composition
Initial consonant (2 position) | Finals (2 bit) | Four corner coding (5 bit) | Structure code (1 bit) | Stroke number (2 bit) |
The mapping rule of the phonetic-shape codes comprises the following steps: the mapping rules of Chinese characters to pinyin, strokes, structures and four-corner codes and the mapping rules of initials, finals and structures to numerical codes. The mapping rule from the initial consonant and the final vowel to the numerical code is as follows:
TABLE 2 mapping rules of initials and finals to numerical codes
a | 01 | ai | 07 | ie | 13 | un | 19 |
o | 02 | ei | 08 | ve | 14 | vn | 20 |
e | 03 | ui | 09 | er | 15 | ang | 21 |
i | 04 | ao | 10 | an | 16 | eng | 22 |
u | 05 | ou | 11 | en | 17 | ing | 23 |
v | 06 | iu | 12 | in | 18 | ong | 24 |
The step 3 comprises the following steps:
initializing an editing distance matrix;
and sequentially calculating the editing distance between every two corresponding substrings of the two character strings according to a dynamic programming strategy, and writing the editing distance into an editing distance matrix.
Specifically, the editing distance between the Chinese characters is adopted to replace the editing distance between the substrings.
Wherein the dynamic planning strategy is as follows:
1)if i==0&&j==0,edit_char(i,j)=0;
2)if i==0&&j>0,edit_char(i,j)=j;
3)if i>0&&j==0,edit_char(i,j)=I;
4) if0< i < ═ m & &0< j < ═ n, edge _ char (i, j) ═ min { edge _ char (i-l, j) +1, edge _ char (i, j-1) +1, edge _ char (i-l, j-l) + f (i, j) }, when the ith character of the first character string is not equal to the jth character of the second character string, f (i, j) } l; otherwise, f (i, j) is 0.
And adopting an edge _ char (i, j) function to represent the edit distance from the substring with the length i of the first character string A to the substring with the length j of the first character string B.
f(i,j)=α*SSound+β*SShape of
Wherein SSoundα and β are coefficient parameters, α + β is 1, and can be adjusted according to similarity comparison requirements, when the character strings A and B are both texts, the value of the coefficient β can be correspondingly increased, emphasis is placed on font comparison, and if texts obtained by voice conversion exist in the character strings A and B, the value of α is properly increased, and emphasis is placed on pronunciation comparison.
TABLE 3 initial/final distance mapping table
Distance between two adjacent plates | Distance between two adjacent plates | ||||
c | ch | 0.9 | h | f | 0.7 |
s | sh | 0.9 | eng | ang | 0.7 |
z | zh | 0.9 | eng | ong | 0.7 |
an | ang | 0.9 | eng | ing | 0.7 |
en | eng | 0.9 | ang | ong | 0.7 |
in | ing | 0.9 | ang | ing | 0.7 |
n | l | 0.7 | ing | ong | 0.7 |
SShape ofθ 1 × four-corner coding edit distance + θ 2 × configuration code comparison value + θ 3 × stroke number difference/max (stroke number);
wherein, θ 1, θ 2, θ 3 are coefficient values of each comparison part (which can be adjusted according to comparison requirements); the quadrangle code edit distance is obtained by table lookup.
The step 4 comprises the following steps:
taking the value of the lower right corner in the edit distance matrix as the shortest edit matrix distance, and calculating the Similarity between the two character strings A and B by adopting the following formula:
Similarity=1-distance/max(length(A),length(B))
where length (a) and length (B) represent the lengths of strings a and B, respectively, and max (·) represents a maximum function.
As an example, a mapping Table is established to obtain a mapping Table Table _1 from Chinese characters to pinyin, a mapping Table Table _2 from Chinese characters to strokes, a mapping Table Table _3 from Chinese characters to structures and a mapping Table Table _4 from Chinese characters to four-corner numbers, and the sound-shape codes of the Chinese characters can be obtained by inquiring the mapping Table, wherein the specific structure comprises the following steps: 2-bit initial consonant, 2-bit vowel, 5-bit four-corner coding, 1-bit structure code and 2-bit stroke number.
For example: the character string A is 'Shanxi Taiji', the character string B is 'Shandong university', and the sound-shape codes of all Chinese characters are obtained based on the mapping table, as follows:
TABLE 4 phono configurational code for character string A and character string B
Chinese characters | Sound-shape code | Chinese characters | Sound-shape code |
Mountain | 171622770003 | Mountain | 171622770003 |
East | 052440904005 | Western medicine | 140410604011 |
Big (a) | 050140800013 | Taiwan (Chinese character of 'tai') | 060740030014 |
Study the design | 141490407208 | Study the design | 141490407208 |
If the edit distance of the character strings A and B is required, the edit distance between the Chinese characters is calculated first.
First, the character string a, B has a length of m-n-4.
Edit distance matrix, as in table 5:
TABLE 5 initialized edit distance matrix
Calculating an edit distance from a substring with the length i of the character string A to a substring with the length j of the character string B, and firstly calculating the edit distance from a mountain to a mountain of the character string A, wherein f (i, j) is obviously 0, and edge _ char (1,1) is obviously 0;
at this time, as shown in table 6, the edit distance matrix is updated to:
TABLE 6 write "mountain" to "mountain" edit distance to edit distance matrix
Mountain | East | Big (a) | Study the design | ||
0 | 1 | 2 | 3 | 4 | |
Mountain | 1 | 0 | |||
Western medicine | 2 | ||||
Taiwan (Chinese character of 'tai') | 3 | ||||
Study the design | 4 |
Then, edit _ char (1,2), i.e., the edit distance of "mountain" and "east of mountain", is calculated. From the dynamic programming formulation we need to calculate f (1, 2).
f(i,j)=α*SSound+β*SShape of
Wherein SSoundAcquiring the distance of the initial consonant and the final consonant from the constructed distance table of the initial consonant and the final consonant;
Sshape ofθ 1 × four-corner coding edit distance + θ 2 × configuration code comparison value + θ 3 × stroke number difference/max (stroke number);
α and β are coefficient parameters which can be adjusted according to the similarity comparison requirement, in this case, the similarity of the weight bias font is calculated, so the parameter values are 0.2 and 0.8 respectively.
The S of mountain and east can be known by looking up the tableSound=1,SShape ofThe edit distances of the four middle corner codes are as follows:
TABLE 7 "mountain" and "east" quadrangle code edit distances
As can be seen from the table, the edit distance of the "mountain" and "east" four corner codes is 5/5 ═ 1.0;
the structures are the same, so the distance of the structures is 0, the stroke number difference is 2, SShape of0.8 × 1+0.05 × 0+0.15 × 2/5 ═ 0.86. By the formula, it _ char (i, j) ═ min { edit _ char (i-l, j) +1, edit _ char (i, j-1) +1, edit _ char (i-l, j-l) + f (i, j) }, when it _ char (i-l, j) +1 ═ 3, edit _ char (i, j-1) +1 ═ 1, edit _ char (i-l, j-l) + f (i, j) ═ 1.888; by analogy, the final results are shown in table 8:
TABLE 8 edit distance of character strings "Shanxi Taiji" and "Shandong university
Mountain | East | Big (a) | Study the design | ||
0 | 1 | 2 | 3 | 4 | |
Mountain | 1 | 0 | 1 | 2 | 3 |
Western medicine | 2 | 1 | 0.476 | 1.476 | 2.476 |
Taiwan (Chinese character of 'tai') | 3 | 2 | 1.476 | 0.962 | 1.962 |
Study the design | 4 | 3 | 2.476 | 1.962 | 0.962 |
As can be seen from the table, the shortest editing distance improvement algorithm based on the phonetic codes of the character strings "shanxi tai chou" and "shandong university" results in 0.962, and the similarity of the character strings is 1-0.962/4-0.760.
In the embodiment, the editing distance of the Chinese character string is replaced by the editing distance of the Chinese character, so that the conditions of different positions of the same character can be reasonably considered, the comparison between the character strings is more comprehensive, the matching precision of the character strings is improved, and the similarity of the character strings can be more accurately calculated.
In one or more embodiments, the similarity calculation method may be used in a database retrieval system, where one of the character string a and the character string B is a text to be retrieved, and the other is a text in a database. And the text to be searched can be directly input through characters or can be input through language, and the system firstly converts the voice into the text and then executes the search.
Example two
An object of the present embodiment is to provide a computing device.
A computing device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the memory having pre-stored therein mapping rules for the pictographic codes, the processor implementing the following steps when executing the program:
receiving two character strings to be compared;
reading a mapping rule of the sound-shape codes, and converting each Chinese character in the two character strings into sound-shape code representation according to the mapping rule; the sound-shape codes comprise sound codes and shape codes, wherein the sound codes comprise numerical codes of initials and finals, and the shape codes comprise four-corner codes, structural codes and stroke numbers of Chinese characters;
calculating the editing distance between every two corresponding substrings of the two character strings based on the editing distance;
and calculating the similarity of the two character strings according to the editing distance.
EXAMPLE III
An object of the present embodiment is to provide a computer-readable storage medium.
A computer-readable storage medium having stored thereon a mapping rule of the phonetic-to-shape codes and a computer program for calculating text similarity, which program, when executed by a processor, performs the steps of:
receiving two character strings to be compared;
reading a mapping rule of the sound-shape codes, and converting each Chinese character in the two character strings into sound-shape code representation according to the mapping rule; the sound-shape codes comprise sound codes and shape codes, wherein the sound codes comprise numerical codes of initials and finals, and the shape codes comprise four-corner codes, structural codes and stroke numbers of Chinese characters;
calculating the editing distance between every two corresponding substrings of the two character strings based on the editing distance;
and calculating the similarity of the two character strings according to the editing distance.
The steps involved in the second and third embodiments correspond to the first embodiment of the method, and the detailed description thereof can be found in the relevant description of the first embodiment. The term "computer-readable storage medium" should be taken to include a single medium or multiple media containing one or more sets of instructions; it should also be understood to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by a processor and that cause the processor to perform any of the methods of the present invention.
One or more of the above embodiments have the following technical effects:
the existing expression of the sound-shape codes is simplified, the sound-shape codes comprise initial consonants, vowels, four-corner codes, structures and stroke number information of the Chinese characters, and the conversion efficiency from the Chinese characters to the sound-shape codes is higher; the four-corner coding with the complementary codes is adopted, the character patterns can be more accurately represented, and for partial consonants and vowels with similar pronunciations, such as an and ang, en and eng, and the like, the similarity is set while different codes are adopted for representation, namely the similarity between pronunciations is considered, and the difference is also reflected, so that the similarity between Chinese characters can be more accurately identified.
The method for calculating the editing distance of the character string is improved, the weight of the editing distance is replaced by the editing distance of the Chinese characters, the conditions of different positions of the same character can be reasonably considered, such as common character string dislocation, so that the comparison between the character strings is more comprehensive, the matching precision of the character strings is improved, and the similarity of the character strings can be calculated more accurately.
Adjustable weight is introduced in the process of calculating the editing distance between Chinese characters, the weight of a character shape part and a character sound part can be adjusted according to actual situations, for example, the proportion of the character shape part can be improved in the process of character recognition in natural scenes, and the proportion of the similarity of the character sound can be improved in the process of voice recognition, so that the method can be applied to a database retrieval system supporting Chinese character input and voice input.
Those skilled in the art will appreciate that the modules or steps of the present application described above can be implemented using general purpose computing devices, or alternatively, they can be implemented using program code executable by computing devices, such that they are stored in a storage device and executed by computing devices, or they are separately fabricated into individual integrated circuit modules, or multiple modules or steps thereof are fabricated into a single integrated circuit module. The present application is not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.
Although the embodiments of the present application have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present application, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive effort by those skilled in the art.
Claims (10)
1. A Chinese character string similarity calculation method based on sound-shape codes is characterized in that the sound-shape codes comprise sound codes and shape codes, wherein the sound codes comprise numerical codes of initials and finals, and the shape codes comprise four-corner codes, structural codes and stroke numbers of Chinese characters; pre-storing the mapping rule of the sound-shape code and the pronunciation similarity of partial initial consonants/vowels, wherein the method comprises the following steps:
receiving two character strings to be compared;
reading a mapping rule of the sound-shape codes, and converting each Chinese character in the two character strings into sound-shape code representation according to the mapping rule;
based on the pronunciation similarity of the initial consonants and the vowels, the editing distance between every two corresponding substrings of the two character strings is calculated by adopting the editing distance;
and calculating the similarity of the two character strings according to the editing distance.
2. The method for calculating the similarity of Chinese character strings based on the phonographic codes as claimed in claim 1, wherein the phonographic codes comprise 12 bits: 2-bit initial consonant, 2-bit vowel, 5-bit four-corner coding, 1-bit structure code and 2-bit stroke number.
3. The method for calculating the similarity of Chinese character strings based on the phonographic codes as claimed in claim 1, wherein the mapping rule of the phonographic codes comprises: the mapping rules of Chinese characters to pinyin, strokes, structures and four-corner codes and the mapping rules of initials, finals and structures to numerical codes.
4. The method for calculating the similarity of Chinese character strings based on phonographic codes as claimed in claim 1, wherein said calculating the edit distance between two corresponding substrings of two character strings based on the edit distance comprises:
initializing an editing distance matrix;
and sequentially calculating the editing distance between every two Chinese characters in the two character strings according to a dynamic programming strategy, and writing the editing distance into an editing distance matrix.
5. The method for calculating the similarity of Chinese character strings based on phonographic codes as claimed in claim 4, wherein the edit distance from the sub-string with the length i in the character string A to the sub-string with the length j in the character string B is represented by edge _ char (i, j), and the dynamic programming strategy is as follows:
1)if i==0&&j==0,edit_char(i,j)=0;
2)if i==0&&j>0,edit_char(i,j)=j;
3)if i>0&&j==0,edit_char(i,j)=I;
4)if 0<i<=m&&0<j<=n,
an edge _ char (i, j) ═ min { edge _ char (i-l, j) +1, edge _ char (i, j-1) +1, edge _ char (i-l, j-l) + f (i, j) }, wherein f (i, j) ═ l when the ith character of the character string a is not equal to the jth character of the character string B; otherwise, f (i, j) is 0.
6. The method for calculating the similarity of Chinese character strings based on phonographic codes according to claim 5,
f(i,j)=α*Ssound+β*SShape of
Wherein S isSoundDistance of phonetic code, i.e. pronunciation similarity, SShape ofα, β are adjustment systems for the code distanceAnd (4) counting.
7. The method for calculating the similarity of Chinese character strings based on phonographic codes as claimed in claim 6, wherein S isShape ofThe method comprises the steps of dividing a four-corner coding editing distance into theta 1 and theta 2, comparing a structural code comparison value into theta 3 and stroke number difference, and dividing the stroke number difference into the maximum stroke number of two Chinese characters, wherein the theta 1, the theta 2 and the theta 3 are all weight coefficients.
8. The method of claim 6, wherein calculating the similarity of two strings according to the edit distance comprises:
taking the value of the lower right corner in the edit distance matrix as the shortest edit matrix distance, and calculating the Similarity between the two character strings A and B by adopting the following formula:
Similarity=1-distance/max(length(A),length(B))
where length (a) and length (B) represent the lengths of strings a and B, respectively, and max (·) represents a maximum function.
9. A computing device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the method for calculating the similarity of a chinese string based on a phonographic code according to any one of claims 1 to 8 when executing the program.
10. A computer-readable storage medium on which a computer program is stored, the program, when executed by a processor, implementing the method for calculating the similarity of a chinese character string based on a phonographic code according to any one of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910146570.4A CN111209447A (en) | 2019-02-27 | 2019-02-27 | Chinese character string similarity calculation method and device based on sound-shape codes |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910146570.4A CN111209447A (en) | 2019-02-27 | 2019-02-27 | Chinese character string similarity calculation method and device based on sound-shape codes |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111209447A true CN111209447A (en) | 2020-05-29 |
Family
ID=70789533
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910146570.4A Pending CN111209447A (en) | 2019-02-27 | 2019-02-27 | Chinese character string similarity calculation method and device based on sound-shape codes |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111209447A (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111753147A (en) * | 2020-06-27 | 2020-10-09 | 百度在线网络技术(北京)有限公司 | Similarity processing method, device, server and storage medium |
CN111767422A (en) * | 2020-06-30 | 2020-10-13 | 平安国际智慧城市科技股份有限公司 | Data auditing method, device, terminal and storage medium |
CN112214535A (en) * | 2020-10-22 | 2021-01-12 | 上海明略人工智能(集团)有限公司 | Similarity calculation method and system, electronic device and storage medium |
CN112613522A (en) * | 2021-01-04 | 2021-04-06 | 重庆邮电大学 | Method for correcting recognition result of medicine taking order based on fusion font information |
CN112765422A (en) * | 2021-01-18 | 2021-05-07 | 深轻(上海)科技有限公司 | Table look-up method for two-dimensional data table |
CN112861844A (en) * | 2021-03-30 | 2021-05-28 | 中国工商银行股份有限公司 | Service data processing method and device and server |
CN112966475A (en) * | 2021-03-02 | 2021-06-15 | 挂号网(杭州)科技有限公司 | Character similarity determining method and device, electronic equipment and storage medium |
CN113536786A (en) * | 2021-06-22 | 2021-10-22 | 深圳价值在线信息科技股份有限公司 | Method for generating confusing Chinese characters, terminal device and computer readable storage medium |
CN113642563A (en) * | 2021-08-31 | 2021-11-12 | 平安医疗健康管理股份有限公司 | Drug use rechecking method, device, equipment and storage medium |
CN114386385A (en) * | 2022-03-22 | 2022-04-22 | 北京创新乐知网络技术有限公司 | Method, device, system and storage medium for discovering sensitive word derived vocabulary |
CN116303731A (en) * | 2023-05-22 | 2023-06-23 | 四川互慧软件有限公司 | Code matching method and device for hospital standard main data and electronic equipment |
CN116757189A (en) * | 2023-08-11 | 2023-09-15 | 四川互慧软件有限公司 | Patient name disambiguation method based on Chinese character features |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103399907A (en) * | 2013-07-31 | 2013-11-20 | 深圳市华傲数据技术有限公司 | Method and device for calculating similarity of Chinese character strings on the basis of edit distance |
CN108009253A (en) * | 2017-12-05 | 2018-05-08 | 昆明理工大学 | A kind of improved character string Similar contrasts method |
-
2019
- 2019-02-27 CN CN201910146570.4A patent/CN111209447A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103399907A (en) * | 2013-07-31 | 2013-11-20 | 深圳市华傲数据技术有限公司 | Method and device for calculating similarity of Chinese character strings on the basis of edit distance |
CN108009253A (en) * | 2017-12-05 | 2018-05-08 | 昆明理工大学 | A kind of improved character string Similar contrasts method |
Non-Patent Citations (3)
Title |
---|
数据中国: "中文相似度算法", 《HTTPS://BLOG.CSDN.NET/CHNDATA/ARTICLE/DETAILS/41114771》 * |
陈正铭等: "编辑距离算法在中文文本相似度计算中的优化与实现", 《韶关学院学报》 * |
陈鸣等: "基于音形码的汉字相似度比对算法", 《信息技术》 * |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111753147A (en) * | 2020-06-27 | 2020-10-09 | 百度在线网络技术(北京)有限公司 | Similarity processing method, device, server and storage medium |
US12079258B2 (en) | 2020-06-27 | 2024-09-03 | Baidu Online Network Technology (Beijing) Co., Ltd. | Similarity processing method, apparatus, server and storage medium |
CN111767422A (en) * | 2020-06-30 | 2020-10-13 | 平安国际智慧城市科技股份有限公司 | Data auditing method, device, terminal and storage medium |
CN112214535A (en) * | 2020-10-22 | 2021-01-12 | 上海明略人工智能(集团)有限公司 | Similarity calculation method and system, electronic device and storage medium |
CN112613522A (en) * | 2021-01-04 | 2021-04-06 | 重庆邮电大学 | Method for correcting recognition result of medicine taking order based on fusion font information |
CN112765422B (en) * | 2021-01-18 | 2024-04-05 | 深轻(上海)科技有限公司 | Table lookup method for two-dimensional data table |
CN112765422A (en) * | 2021-01-18 | 2021-05-07 | 深轻(上海)科技有限公司 | Table look-up method for two-dimensional data table |
CN112966475A (en) * | 2021-03-02 | 2021-06-15 | 挂号网(杭州)科技有限公司 | Character similarity determining method and device, electronic equipment and storage medium |
CN112861844A (en) * | 2021-03-30 | 2021-05-28 | 中国工商银行股份有限公司 | Service data processing method and device and server |
CN113536786A (en) * | 2021-06-22 | 2021-10-22 | 深圳价值在线信息科技股份有限公司 | Method for generating confusing Chinese characters, terminal device and computer readable storage medium |
CN113642563A (en) * | 2021-08-31 | 2021-11-12 | 平安医疗健康管理股份有限公司 | Drug use rechecking method, device, equipment and storage medium |
CN114386385A (en) * | 2022-03-22 | 2022-04-22 | 北京创新乐知网络技术有限公司 | Method, device, system and storage medium for discovering sensitive word derived vocabulary |
CN116303731B (en) * | 2023-05-22 | 2023-07-21 | 四川互慧软件有限公司 | Code matching method and device for hospital standard main data and electronic equipment |
CN116303731A (en) * | 2023-05-22 | 2023-06-23 | 四川互慧软件有限公司 | Code matching method and device for hospital standard main data and electronic equipment |
CN116757189A (en) * | 2023-08-11 | 2023-09-15 | 四川互慧软件有限公司 | Patient name disambiguation method based on Chinese character features |
CN116757189B (en) * | 2023-08-11 | 2023-10-31 | 四川互慧软件有限公司 | Patient name disambiguation method based on Chinese character features |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111209447A (en) | Chinese character string similarity calculation method and device based on sound-shape codes | |
JP7280382B2 (en) | End-to-end automatic speech recognition of digit strings | |
CN111310443B (en) | Text error correction method and system | |
CN107220235B (en) | Speech recognition error correction method and device based on artificial intelligence and storage medium | |
JP6204959B2 (en) | Speech recognition result optimization apparatus, speech recognition result optimization method, and program | |
US6738741B2 (en) | Segmentation technique increasing the active vocabulary of speech recognizers | |
US9384730B2 (en) | Pronunciation accuracy in speech recognition | |
CN111199726B (en) | Speech processing based on fine granularity mapping of speech components | |
CN112507734B (en) | Neural machine translation system based on romanized Uygur language | |
JP6778655B2 (en) | Word concatenation discriminative model learning device, word concatenation detection device, method, and program | |
CN110870004A (en) | Syllable-based automatic speech recognition | |
KR20230009564A (en) | Learning data correction method and apparatus thereof using ensemble score | |
CN113626563A (en) | Method and electronic equipment for training natural language processing model and natural language processing | |
CN114036957B (en) | Rapid semantic similarity calculation method | |
KR101483947B1 (en) | Apparatus for discriminative training acoustic model considering error of phonemes in keyword and computer recordable medium storing the method thereof | |
JP6577900B2 (en) | Phoneme error acquisition device, phoneme error acquisition method, and program | |
JP2010164918A (en) | Speech translation device and method | |
JP6301794B2 (en) | Automaton deformation device, automaton deformation method and program | |
US20190155902A1 (en) | Information generation method, information processing device, and word extraction method | |
CN113536776B (en) | Method for generating confusion statement, terminal device and computer readable storage medium | |
JP3950957B2 (en) | Language processing apparatus and method | |
CN110399608B (en) | Text error correction system and method for dialogue system based on pinyin | |
KR102299269B1 (en) | Method and apparatus for building voice database by aligning voice and script | |
Lei et al. | Data-driven lexicon expansion for Mandarin broadcast news and conversation speech recognition | |
US20070055515A1 (en) | Method for automatically matching graphic elements and phonetic elements |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200529 |