CN112966475A

CN112966475A - Character similarity determining method and device, electronic equipment and storage medium

Info

Publication number: CN112966475A
Application number: CN202110231017.8A
Authority: CN
Inventors: 饶官军; 方成; 孟海忠; 柴鹏飞; 陈雪魁; 吴边
Original assignee: Guahao Net Hangzhou Technology Co Ltd
Current assignee: Guahao Net Hangzhou Technology Co Ltd
Priority date: 2021-03-02
Filing date: 2021-03-02
Publication date: 2021-06-15

Abstract

The invention discloses a method and a device for determining character similarity, electronic equipment and a storage medium, wherein the method comprises the following steps: respectively determining coding character strings corresponding to the target characters and the characters to be determined according to the character voices of the target characters and the characters to be determined; respectively determining character basic attribute information of the target character and the character to be determined, and determining a font coding table of the target character and the character to be determined according to the character basic attribute; the basic attribute information of the characters comprises character structures, at least one basic unit forming the characters and the stroke number of the characters; and determining the target similarity between the target characters and the characters to be determined based on the code character string and the font code table. The technical scheme of the embodiment of the invention solves the technical problem that the similarity result is inaccurately determined when the similarity between two characters is determined only by considering the pronunciation or the font of the characters in the prior art, and realizes the technical effect of accurately, efficiently and conveniently determining the similarity between the two characters.

Description

Character similarity determining method and device, electronic equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of computers, in particular to a method and a device for determining character similarity, electronic equipment and a storage medium.

Background

In various fields, the similarity between two characters is usually determined, and then an error character or a character to be replaced in a document is determined based on the similarity, so that the accuracy of text content is improved.

At present, the similarity between characters is mainly determined through simple character-pronunciation comparison or character-pattern comparison, and the technical problem that the similarity between two determined characters is inaccurate, so that the character content replacement is inaccurate exists.

Disclosure of Invention

The invention provides a method and a device for determining character similarity, electronic equipment and a storage medium, which are used for optimizing the method for determining the character similarity so as to improve the accuracy of determining the character similarity.

In a first aspect, an embodiment of the present invention provides a method for determining a text similarity, where the method includes:

respectively determining coding character strings corresponding to the target characters and the characters to be determined according to the character voices of the target characters and the characters to be determined;

respectively determining character basic attribute information of the target characters and the characters to be determined, and determining font coding tables of the target characters and the characters to be determined according to the character basic attributes; the basic attribute information of the characters comprises a character structure, at least one basic unit forming the characters and the stroke number of the characters;

and determining the target similarity between the target characters and the characters to be determined based on the coding character string and the font coding table.

In a second aspect, an embodiment of the present invention further provides a device for determining a word similarity, where the device includes:

the code character string determining module is used for respectively determining code character strings corresponding to the target characters and the characters to be determined according to the character voices of the target characters and the characters to be determined;

the font coding table determining module is used for respectively determining the character basic attribute information of the target characters and the characters to be determined and determining the font coding tables of the target characters and the characters to be determined according to the character basic attribute; the basic attribute information of the characters comprises a character structure, at least one basic unit forming the characters and the stroke number of the characters;

and the target similarity value determining module is used for determining the target similarity between the target characters and the characters to be determined based on the coding character string and the font coding table.

In a third aspect, an embodiment of the present invention further provides an electronic device, where the electronic device includes:

one or more processors;

a storage device for storing one or more programs,

when the one or more programs are executed by the one or more processors, the one or more processors implement the text similarity determination method according to any one of the embodiments of the present invention.

In a fourth aspect, the present invention further provides a storage medium containing computer-executable instructions, which when executed by a computer processor, are configured to perform the text similarity determination method according to any one of the embodiments of the present invention.

According to the technical scheme of the embodiment of the invention, the target characters and the coded character strings of the characters to be determined are determined according to the target characters and the pronunciation of the characters to be determined, meanwhile, the font code table of the target characters and the characters to be determined is obtained according to the basic attribute information of the target characters and the characters to be determined, and the coded character strings and the font code table of the target characters and the files to be determined are comprehensively considered to obtain the similarity between the target characters and the characters to be determined.

Drawings

In order to more clearly illustrate the technical solutions of the exemplary embodiments of the present invention, a brief description is given below of the drawings used in describing the embodiments. It should be clear that the described figures are only views of some of the embodiments of the invention to be described, not all, and that for a person skilled in the art, other figures can be derived from these figures without inventive effort.

Fig. 1 is a schematic flow chart of a text similarity determination method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a phonetic code string according to an embodiment of the present invention;

FIG. 3 is a diagram of a textual structure and a structural identifier provided in accordance with an embodiment of the present invention;

FIG. 4 is a diagram of a text font code table according to an embodiment of the present invention;

fig. 5 is a schematic flow chart illustrating a method for determining text similarity according to a second embodiment of the present invention;

fig. 6 is a schematic diagram illustrating a similarity fusion comparison between pronunciation and font according to a second embodiment of the present invention;

fig. 7 is a schematic structural diagram of a text similarity determination apparatus according to a third embodiment of the present invention;

fig. 8 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Example one

Fig. 1 is a flowchart of a text similarity determining method according to an embodiment of the present invention, which is applicable to a case of determining similarity between two texts, and the method may be executed by a text similarity determining apparatus, the apparatus may be implemented in a form of software and/or hardware, the hardware may be an electronic device, the electronic device may be a terminal or a server, and implementation of the technical solution may be implemented by a terminal, a server, or a combination of a terminal and a server.

Before the technical solution of the present embodiment is introduced, an application scenario is exemplarily described. In the process of editing characters or sentences by a user, whether wrong characters exist in the sentences can be determined based on the method provided by the embodiment, and then the correct characters are displayed; or to determine whether there are similar words between two words. The method can be suitable for the scene needing text error detection or determining the similarity of two texts.

As shown in fig. 1, the method of this embodiment includes:

and S110, respectively determining the coded character strings corresponding to the target characters and the characters to be determined according to the character voices of the target characters and the characters to be determined.

Wherein, the current character can be used as the target character. The word to be determined may be a candidate word that matches the current word. Or, the current character and the character to be determined are both characters input by the user. For example, if the method is integrated in the client, a user can edit two characters at will on the display interface, the similarity of the edited characters can be determined based on the integrated method, and simultaneously, one of the two edited characters can be used as a target character and the other can be used as a character to be determined. In different application scenarios, a certain difference exists between the target character and the character to be determined, for example, if two characters are edited at will and the similarity between the two characters is determined, one of the characters can be used as the target character and the other character can be used as the character to be determined; if the application scene is an edited document, the target characters can be characters which are edited currently, and the characters to be determined are a series of characters which are associated with the current characters. Each character has a character pronunciation, which is usually composed of an initial consonant, a vowel, and a tone. The character string comparison table corresponding to the initial consonant, the final sound and the tone can be predetermined, and the coding character strings of the target character and the character to be determined can be respectively determined according to the character pronunciation of each character and the character string comparison table.

Specifically, after the target character and the character to be determined are obtained, the code character string of the target character and the code character string of the character to be determined can be determined according to the pronunciation of the target character and the character to be determined.

In this embodiment, the determining, according to the pronunciation of the target text and the text to be determined, the encoded character strings corresponding to the target text and the text to be determined respectively includes: determining a target coding character string of the target character according to a pre-established character sound coding table and the character sound of the target character; determining a to-be-determined coded character string of the to-be-determined character according to a pre-established character sound coding table and the character sound of the to-be-determined character; the character pronunciation coding table comprises character pronunciation characteristics and character pronunciation codes corresponding to the character pronunciation characteristics, wherein the character pronunciation characteristics comprise letters corresponding to initials, letters corresponding to finals and tones; the length of the target code character string is consistent with the length of the code character string to be determined.

Wherein, the pronunciation code table includes pronunciation characteristics and pronunciation codes corresponding to the pronunciation characteristics. The character pronunciation characteristics comprise initial consonant letters, final letters and tones. Correspondingly, the pronunciation code table includes code characters corresponding to each initial consonant, code characters corresponding to each final and code characters corresponding to tone. Usually, a word includes an initial, at least one final and a tone. In order to accurately determine the similarity between two characters, the encoded character strings of the two characters may be determined respectively, and in order to process the encoded character strings of the two characters, the lengths of the encoded character strings of the two characters are usually the same. If the lengths are not consistent, zero padding processing is carried out on the corresponding positions to ensure that the lengths of the code character strings of the two characters are the same.

Specifically, a target code character string of the target character and a code character string to be determined of the character to be determined are respectively determined according to a pre-created character sound code table, the character sound of the target character and the character sound of the character to be determined.

For example, the phonetic features (initial consonant, final consonant, and tone) and the coded characters of the phonetic features can be seen in fig. 2. Determining the encoded string from the literal pronunciation may be: for example, the character is "honoring" and the pronunciation is "xiang gang rubin", and the code string of each character can be determined according to the code comparison table of fig. 2 as follows: "ndu 1 lw02 re02 ar 01". It should be noted that, since the "phase" includes an initial, two finals and a tone, the corresponding length of the character string is 4, and the "joint rubin" includes an initial, a final and a tone, 0 may be complemented to ensure that the length of the character string is always maintained. The encoding string of each word can be determined in the above manner.

And S120, respectively determining character basic attribute information of the target characters and the characters to be determined, and determining a font coding table of the target characters and the characters to be determined according to the character basic attributes.

The basic attribute information of the characters comprises character structures, at least one basic unit forming the characters and the stroke number of the characters. The text structure can be a left-right structure, an up-down structure, a left-middle-right structure, an up-middle-down structure, a full enclosure structure …. The character structure is also a character structure, and the structure can be seen in fig. 3. Each word may be composed of a plurality of basic units, and a basic unit is a basic splitting unit for forming each word, for example, the basic unit of "yebin" includes three, respectively "wood", "". The stroke number is as the name implies how many strokes are needed for writing, e.g. the stroke number of "yebin" word is 11. Based on the structure of the text, at least one basic unit constituting the text, and the number of strokes of the text, a glyph code corresponding to the text may be determined.

Specifically, the character structures of the target character and the character to be determined, at least one basic unit constituting the character, and the number of strokes may be determined, respectively, so as to obtain the font codes of the target character and the character to be determined.

In this embodiment, determining basic text attribute information of a target text and a text to be determined, and determining a font code table of the target text and the text to be determined according to the basic text attribute includes: determining a target character structure of a target character, a target character unit forming the target character and the target stroke number of the target character according to a preset character basic attribute table; determining a character structure to be determined of the character to be determined, a character unit to be determined forming the character to be determined and the number of strokes to be determined of the character to be determined; obtaining a target font coding table corresponding to the target character according to the character basic attribute table, the target character structure, the target character unit and the target stroke number; obtaining a font coding table to be determined corresponding to the characters to be determined according to the character attribute table, the character structure to be determined, the character unit to be determined and the number of strokes to be determined; the character attribute table comprises font code symbols corresponding to character structures, basic units forming each character and stroke number.

The character attribute table may be preset, and the attribute table includes font code symbols corresponding to the character structure, basic units constituting each character, and the number of strokes of the character. The character attributes called from the character attribute table can form a font code table as a target font code table; correspondingly, the font code table determined according to the basic attribute information of the character to be determined is used as the font code table to be determined. The font code table includes characters, stroke number of the characters, structure of the characters and basic units forming the characters. Based on the character pattern code table, the specific content of the target character and the character to be determined can be determined.

For clearly describing the technical solution of the present embodiment, the specific contents included in the font code table can be described by taking the case of "yebin and fir" as the characters, respectively. Based on the predetermined character attribute table and the character "yebin", the number of strokes is determined as that, the basic structure of the character is left-right structure, the basic unit of the character is "wood, ", based on which the contents in the font code table are known as shown in fig. 4. Correspondingly, the content of the font code table corresponding to the "fir" character is also shown as the display content in fig. 4.

On the basis of the above technical solutions, it should be noted that a certain number of basic units of chinese characters, for example, 485 basic units constituting chinese characters, may be selected in advance according to all chinese characters (88937) or symbols. That is, all characters can be formed by one or more basic units, and the basic units forming the Chinese characters, the structure of the Chinese characters and the number of strokes of the Chinese characters can be made into a character basic attribute table.

It should be further noted that the basic attribute table of the text may include 88937 strokes, text structures and basic constituent units of the text; the basic attribute table may be a basic attribute table including only the character structure of a character and the basic constituent units of the character, and the basic attribute table may be a basic attribute table including 88937 characters, and the determination target character and the character to be determined may be composed according to each basic constituent unit in the basic attribute table.

Specifically, if the basic attribute table of the characters comprises all the Chinese characters, the basic composition unit of each Chinese character, the structure of the Chinese character and the number of strokes of the Chinese character, the font coding table of the target character and the character to be determined can be respectively called from the basic attribute table of the characters; if the basic attribute table of the characters only comprises basic constitutional units for constituting all the characters, the basic constitutional units of the target characters and the characters to be determined and the structures of the characters can be respectively determined, meanwhile, the stroke number of the characters can be determined, and the information can be used as a font coding table of the corresponding characters. That is, the glyph encoding table includes the basic constituent units of the text, the number of strokes of the text, and the glyph encoder of the glyph structure.

S130, determining the target similarity between the target character and the character to be determined based on the coding character string and the font coding table.

The target similarity is determined by the similarity of the coding character string and the similarity of the font coding table, namely the similarity of the character pronunciation and the similarity of the font. The similarity of the coded character string is determined according to the character pronunciation characteristics of the characters, and the similarity of the font code table is determined according to the font code table of the target characters and the font code table of the characters to be determined.

Specifically, the similarity of the font code table can be determined according to the font code table of the target character and the character to be determined, and meanwhile, the coding similarity of the target character and the character to be determined can be determined according to the character sound characteristics of the target character and the character to be determined and the determined coding character string. And obtaining the target similarity of the target character to be determined according to the similarity of the code character string and the similarity of the font code table.

According to the technical scheme of the embodiment of the invention, the target character and the coded character string of the character to be determined are determined by determining the pronunciation of the target file and the character to be determined, meanwhile, the font coding table of the target character and the character to be determined is obtained according to the basic attribute information of the target character and the character to be determined, and the coded character string and the font coding table of the target character and the character to be determined are comprehensively considered, so that the similarity between the target character and the character to be determined can be obtained, the technical problem that the similarity result determination is inaccurate when the similarity between the two characters is determined only by considering the pronunciation or the font of the characters in the prior art is solved, and the technical effect of accurately, efficiently and conveniently determining the similarity between the two characters is realized.

Example two

Fig. 5 is a flowchart illustrating a text similarity determining method according to a second embodiment of the present invention, and details of "determining the target similarity between the target text and the text to be determined based on the encoded character string and the font encoded table" are described on the basis of the foregoing embodiments, and specific implementation manners thereof may refer to technical solutions of this embodiment. The technical terms that are the same as or corresponding to the above embodiments are not repeated herein.

As shown in fig. 5, the method includes:

s210, according to the character pronunciation of the target character and the character to be determined, determining the code character strings corresponding to the target character and the character to be determined respectively.

S220, determining basic character attribute information of the target characters and the characters to be determined respectively, and determining font coding tables of the target characters and the characters to be determined according to the basic character attributes.

And S230, determining the character-sound similarity of the target characters and the characters to be determined according to the target code character strings of the target characters and the code character strings to be determined of the characters to be determined.

The character string determined according to the character pronunciation characteristics of the target characters, namely, the initial consonant and/or the final sound of the target characters is used as the target coding character string. Meanwhile, the pronunciation characteristics of the characters to be determined can be determined, and the code character string to be determined of the characters to be determined is obtained according to the pronunciation characteristics of the characters to be determined.

Specifically, the similarity between the target text and the text to be determined can be obtained by performing similarity processing on the code string to be determined and the target code string.

In this embodiment, determining the target text and the code character string of the text to be determined, and determining the similarity between the target text and the text to be determined may be: aiming at the target coding character string and the pronunciation code corresponding to the same position in the coding character string to be determined, determining the editing distance to be processed corresponding to the current position according to the two pronunciation codes corresponding to the current position; and determining the similarity of the target character and the character to be determined according to the editing distance to be processed corresponding to each position.

In order to determine the similarity of the code strings between two characters, the lengths of the character strings of the two characters are the same, and if the lengths of the character strings are different, zero padding processing may be performed on the code strings with shorter lengths by using the same number of bits. For example, the character pronunciation of "courteous" is "xiang gang ru bin", wherein "xiang" includes an initial, two finals, and a tone, and "ru" includes an initial, a final, and a tone. In order to determine the similarity between two characters, the encoding character string of two characters may be determined based on the longest character of the encoding character string. For example, the code string of "xiang" is "ndu 1", the code string of "ru" is "re 02", and zero padding processing is performed by one less vowel.

Here, the edit distance, also called the levenstein distance, is a quantitative measure of the degree of difference between two character strings. The measure is based on how many times at least processing is required to change one string into another. For example, for the same position, if two characters are the same, the edit distance is 0, and if two characters are different, the edit distance is 1.

Specifically, the encoding character strings of the target character and the character to be determined are the same, so that when the similarity of the encoding character strings between two characters is determined, the similarity can be determined respectively for the pronunciation codes corresponding to the same position. In the above exemplary basis, for the phonetic coding of the first position, "n and r" are different, the edit distance corresponding to the current position is 1, and the edit distances of the second, third, and fourth phonetic codings are determined continuously, and from the phonetic coding of each position, the edit distances corresponding to the phonetic coding of the second, third, and fourth positions are 1, and 1. Based on the editing distance corresponding to each position, the similarity of the target character and the character to be determined can be determined.

On the basis of the above technical solution, it should be noted that, in the process of practical application, there is a case where the front nasal sound and the rear nasal sound of the user are not distinguished, that is, there is a case where the front nasal sound and the rear nasal sound are not distinguished, for example, the similarity between z and zh, c and ch, and s and sh in the initial consonant is small, and the edit distance of the code character corresponding to the type of character sound can be set relatively, so as to improve the accuracy of determining the character.

Optionally, the determining the editing distance to be processed corresponding to the current position according to the two phonetic codes corresponding to the current position includes at least one of the following modes: if the two phonetic codes corresponding to the current position are the same, determining that the editing distance to be processed of the current position is a first preset value; if the two character-pronunciation codes corresponding to the current position are the character-pronunciation codes in the first preset editing distance, determining that the editing distance to be processed of the current position is a second preset value; and if the two phonetic codes corresponding to the current position are different and are not the phonetic codes in the first preset editing distance, determining that the editing distance to be processed at the current position is a third preset value.

Specifically, if two phonetic codes corresponding to the current position are the same, that is, the code characters of the current position are the same, the editing distance to be processed at the current position is a first preset value, which is optionally 0. If the two pronunciation codes corresponding to the current position are the pronunciation codes in the first preset editing distance, that is, similar pronunciation codes can be preset, and the similar pronunciation codes are used as the pronunciation codes in the first preset editing distance, the pronunciation codes in the current position are the second preset value, and optionally, the second preset value is 0.3. Of course, if the two phonetic codes of the current position are different and are not the phonetic codes in the first preset editing distance, it is determined that the editing distance to be processed of the current position is a third preset value, and optionally, the third preset value is 1.

For example, in this embodiment, the word-sound similarity between the target word and the word to be determined is determined by calculating the edit distance. After the editing distance corresponding to each position is determined, the similarity is determined according to the editing distance corresponding to each position.

Optionally, the determining the degree of similarity between the target text and the text to be determined according to the editing distance to be processed corresponding to each position includes: and determining the character-sound similarity of the target character and the character to be determined according to the distance to be edited and the position length of each position.

The editing distance corresponding to each position can be used as the distance to be edited. The position length is the length of the character-pronunciation code string.

Illustratively, the degree of similarity of the character pronunciation is S₀(w₁,w₂) To represent W₁Representing the target word, W₂Representing the text to be determined. S₀(w₁,w₂) 1- (sum of minimum edit distances at each position)/4. Based on the formula, the similarity between the target character and the character to be determined can be calculated.

S240, determining the font similarity of the target character and the character to be determined according to the target font coding table of the target character and the character to be determined of the character to be determined.

On the basis of each technical scheme, the font code table comprises the stroke number of characters, characters corresponding to the character structure and basic composition units of the characters. Therefore, the font similarity of the target character and the character to be determined can be respectively determined from the three aspects.

In this embodiment, the determining the font similarity between the target text and the text to be determined according to the target font coding table of the target text and the text to be determined font coding table of the text to be determined includes: determining a structural similarity value according to the font structures of the target character and the character to be determined; determining a unit similarity value according to the target basic unit of the target character and the basic unit to be determined of the character to be determined; determining a stroke similarity value according to the target stroke number of the target character and the number of strokes to be determined of the character to be determined; and determining the font similarity of the target character and the character to be determined according to the structure similarity value, the unit similarity value and the stroke similarity value.

The structure similarity values corresponding to the same and different fonts can be preset, and optionally, when the font structures are the same, the font structure similarity value is 1, and when the font structures are different, the font structure similarity value is 0. The unit similarity value is determined according to the basic constituent units of each character, optionally, the basic constituent units of the two characters, the number of the same basic constituent units, and the maximum number of the basic constituent units of the two characters are obtained, and the unit similarity of the two characters is determined based on the number of the same basic constituent units and the maximum number. The stroke similarity value is determined according to the number of strokes as the name suggests.

Specifically, if the font similarity between two characters needs to be determined, the font structures of the target character and the character to be determined can be obtained, and if the font structures are the same, the similarity value is 1; otherwise, the similarity value is 0. And simultaneously, respectively determining the basic constituent units and the number of the target characters and the characters to be determined, and determining the number of the same basic constituent units, wherein the number is selectable. According to the number and the maximum number of the same basic constitutional units, a unit similarity value between two characters can be calculated. Meanwhile, according to the stroke numbers of the characters to be determined and the target characters in the font coding table, the stroke similarity values of the target characters and the characters to be determined can be determined. By processing the structure similarity value, the unit similarity value and the stroke similarity value, the font similarity of the target character and the character to be determined can be obtained.

In this embodiment, the determining the font similarity between the target text and the text to be determined according to the structure similarity, the unit similarity, and the stroke similarity may be:

and determining the font similarity of the target character and the character to be determined according to the weight corresponding to the structure similarity value and the structure similarity, the weight corresponding to the unit similarity value and the unit similarity value, and the weight corresponding to the stroke similarity value and the stroke similarity value.

That is, a weight value of each element may be set, and the font similarity between the target i's internal text and the text to be determined may be determined according to the weight value of the element.

Illustratively, the glyph similarity may be in S₁(w₁,w₂) Denotes w₁Representing the target word, w₂Representing the word to be determined, S₁(w₁,w₂)＝S₁₁(w₁,w₂)+S₁₂(w₁,w₂)+S₁₃(w₁,w₂)，S₁₁(w₁,w₂) Represents the unit similarity value, S₁₂(w₁,w₂) Denotes the structural similarity value, S₁₃(w₁,w₂) Representing a stroke similarity value. S₁₁(w₁,w₂) Number of basic units/max (w) identical for two Chinese characters₁Base of charactersNumber of the constituent units, w₂Number of basic constituent units of a character); when w is₁、w₂The font structures being identical, S₁₂(w₁,w₂) 1, otherwise S₁₂(w₁,w₂)＝0；S₁₃(w₁,w₂)＝1-(|w₁Number of strokes-w₂Stroke number |/w₁Number of strokes). Based on the above manner, the font similarity of the two characters can be determined.

Of course, in order to improve the accuracy of the similarity of the fonts, the weight corresponding to each similarity value, i.e. S, may be set₁(w₁,w₂)＝S₁₁(w₁,w₂)+S₁₂(w₁,w₂)+S₁₃(w₁,w₂) It can be expressed as: s₁(w₁,w₂)＝α*S₁₁(w₁,w₂)+β*S₁₂(w₁,w₂)+γ*S₁₃(w₁,w₂). α + β + γ is 1, and the user may set a weight value corresponding to each similarity value according to actual needs.

And S250, determining the target similarity of the target character and the character to be determined according to the character pronunciation similarity and the character pattern similarity.

Specifically, after the character sound similarity and the character pattern similarity are obtained, the target similarity between the target character and the character to be determined is obtained by comprehensively processing the character sound similarity and the character pattern similarity.

Optionally, the determining the target similarity between the target text and the text to be determined according to the word-sound similarity and the font similarity includes: and determining the target similarity of the target character and the character to be determined according to the character sound weight, the character sound similarity, the character pattern weight and the character pattern similarity.

Specifically, the weight of the character-sound similarity and the weight of the font similarity may be set separately. And obtaining one similarity of the target similarities according to the character-sound similarity weight and the character-sound similarity, and obtaining the other similarity of the target similarities according to the font similarity weight and the font similarity.

Optionally, the determining the target similarity between the target text and the text to be determined according to the pronunciation weight, the pronunciation similarity, the font weight, and the font similarity includes: obtaining the target word sound similarity according to the square of the word sound similarity and the word sound weight; obtaining the similarity of the target font according to the square of the similarity of the font and the weight of the font; and determining the target similarity according to the target character pronunciation similarity and the target character form similarity.

The character pronunciation weight and the character font weight can be the same or different, a user can set the weights according to actual requirements, and the weights are optional and are all 0.5. After the target character pronunciation similarity and the target character pattern similarity are obtained, the sum of the target character pattern similarity and the target character pattern similarity can be calculated, and the sum is squared to obtain the target similarity. The method improves the accuracy of determining the character similarity.

In this embodiment, the fusion function of the target similarity may be: f (w)₁,w₂)＝sqrt(S₀ ²(w₁,w₂)+S₁ ²(w₁,w₂) B) is provided, wherein w₁,w₂For two characters, the left side in fig. 6 is the target similarity corresponding to one of the expression forms of the pronunciation and font similarity score fusion function, and the right side is the target similarity corresponding to the pronunciation and font similarity fusion function constructed in this embodiment. As can be seen from FIG. 6, the function constructed by the method better conforms to the logic of human judgment on the similarity of Chinese characters, so that the method can improve the accuracy of judgment on the similarity of two characters.

According to the technical scheme of the embodiment of the invention, the character-sound similarity value and the character-shape similarity value of the target character and the character to be determined are respectively determined, and the character-sound similarity value and the character-shape similarity value are subjected to fusion processing based on the preset fusion function, so that the target similarity value between the target character and the character to be determined is obtained, and the technical effects of accuracy and convenience in determining the character similarity value are improved.

EXAMPLE III

Fig. 7 is a schematic structural diagram of a text similarity determining apparatus according to a third embodiment of the present invention, where the apparatus includes: an encoding string determination module 310, a glyph encoding table determination module 320, and a target similarity value determination module 330.

The encoding character string determining module 310 is configured to determine, according to the word sounds of the target word and the word to be determined, an encoding character string corresponding to the target word and the word to be determined, respectively; a font code table determining module 320, configured to determine basic character attribute information of the target character and the character to be determined, and determine a font code table of the target character and the character to be determined according to the basic character attribute; the basic attribute information of the characters comprises a character structure, at least one basic unit forming the characters and the stroke number of the characters; a target similarity value determining module 330, configured to determine a target similarity between the target word and the word to be determined based on the encoding character string and the font encoding table.

On the basis of the above technical solution, the code string determining module:

the target coding character string determining unit is used for determining a target coding character string of a target character according to a pre-established character sound coding table and the character sound of the target character; the to-be-determined coding character string determining unit is used for determining the to-be-determined coding character string of the to-be-determined character according to a pre-established character sound coding table and the character sound of the to-be-determined character; the character pronunciation coding table comprises character pronunciation characteristics and character pronunciation codes corresponding to the character pronunciation characteristics, wherein the character pronunciation characteristics comprise letters corresponding to initials, letters corresponding to finals and tones; the length of the target code character string is consistent with the length of the code character string to be determined.

On the basis of the above technical solutions, the glyph encoding table determining module is further configured to: determining a target character structure of the target character, a target character unit forming the target character and a target stroke number of the target character according to a preset character basic attribute table; determining a character structure to be determined of the character to be determined, a character unit to be determined forming the character to be determined and the number of strokes to be determined of the character to be determined; obtaining a target font coding table corresponding to the target character according to the character basic attribute table, the target character structure, the target character unit and the target stroke number; obtaining a font coding table to be determined corresponding to the characters to be determined according to the character attribute table, the character structure to be determined, the character unit to be determined and the number of strokes to be determined; the character attribute table comprises font code symbols corresponding to character structures, basic units forming each character and stroke number.

On the basis of the above technical solutions, the target similarity determining module includes:

the character-sound similarity determining unit is used for determining the character-sound similarity of the target characters and the characters to be determined according to the target coded character strings of the target characters and the coded character strings to be determined of the characters to be determined; the font similarity determining unit is used for determining the font similarity between the target character and the character to be determined according to the target font coding table of the target character and the character to be determined of the character to be determined; and the target similarity determining unit is used for determining the target similarity between the target character and the character to be determined according to the character pronunciation similarity and the character form similarity.

On the basis of the above technical solutions, the word-pronunciation similarity determining unit includes:

the editing distance determining subunit is used for determining the editing distance to be processed corresponding to the current position according to the two character-sound codes corresponding to the current position aiming at the character-sound codes corresponding to the same position in the target coded character string and the coded character string to be determined; and the character-sound similarity determining subunit is used for determining the character-sound similarity of the target character and the character to be determined according to the editing distance to be processed corresponding to each position.

On the basis of the above technical solutions, the edit distance determining subunit is further configured to:

if the two phonetic codes corresponding to the current position are the same, determining that the editing distance to be processed of the current position is a first preset value; if the two character-pronunciation codes corresponding to the current position are the character-pronunciation codes in the first preset editing distance, determining that the editing distance to be processed of the current position is a second preset value; and if the two phonetic codes corresponding to the current position are different and are not the phonetic codes in the first preset editing distance, determining that the editing distance to be processed at the current position is a third preset value.

On the basis of the above technical solutions, the phonetic similarity determining subunit is configured to: and determining the character-sound similarity of the target character and the character to be determined according to the distance to be edited and the position length of each position.

On the basis of the above technical solutions, the glyph encoding table includes a glyph structure, basic units constituting a character, and a character stroke number, and the glyph similarity determining unit includes:

a structure similarity value determining subunit, configured to determine a structure similarity value according to the target character and the font structure of the character to be determined; a unit similarity value determining subunit, configured to determine a unit similarity value according to the target basic unit of the target text and the basic unit of the text to be determined; the stroke similarity value determining subunit is used for determining a stroke similarity value according to the target stroke number of the target character and the number of strokes to be determined of the character to be determined; and the font similarity value determining subunit is used for determining the font similarity between the target character and the character to be determined according to the structure similarity value, the unit similarity value and the stroke similarity value.

On the basis of the above technical solutions, the target similarity value determining unit is configured to: and determining the target similarity of the target character and the character to be determined according to the character sound weight, the character sound similarity, the character pattern weight and the character pattern similarity.

On the basis of the above technical solutions, the target similarity determination unit includes:

the target character-sound similarity determining subunit is used for obtaining the target character-sound similarity according to the square of the character-sound similarity and the character-sound weight; the target font similarity determining subunit is used for obtaining the target font similarity according to the square of the font similarity and the font weight; and the target similarity determining subunit is used for determining the target similarity according to the target character pronunciation similarity and the target character style similarity.

The character similarity determining device provided by the embodiment of the invention can execute the character similarity determining method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the executing method.

It should be noted that, the units and modules included in the apparatus are merely divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the embodiment of the invention.

Example four

Fig. 8 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present invention. FIG. 8 illustrates a block diagram of an exemplary electronic device 40 suitable for use in implementing embodiments of the present invention. The electronic device 40 shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiment of the present invention.

As shown in fig. 8, electronic device 40 is embodied in the form of a general purpose computing device. The components of electronic device 40 may include, but are not limited to: one or more processors or processing units 401, a system memory 402, and a bus 403 that couples the various system components (including the system memory 402 and the processing unit 401).

Bus 403 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Electronic device 40 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by electronic device 40 and includes both volatile and nonvolatile media, removable and non-removable media.

The system memory 402 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)404 and/or cache memory 405. The electronic device 40 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 406 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 8, and commonly referred to as a "hard drive"). Although not shown in FIG. 8, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to the bus 403 by one or more data media interfaces. Memory 402 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.

A program/utility 408 having a set (at least one) of program modules 407 may be stored, for example, in memory 402, such program modules 407 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 407 generally perform the functions and/or methods of the described embodiments of the invention.

The electronic device 40 may also communicate with one or more external devices 409 (e.g., keyboard, pointing device, display 410, etc.), with one or more devices that enable a user to interact with the electronic device 40, and/or with any devices (e.g., network card, modem, etc.) that enable the electronic device 40 to communicate with one or more other computing devices. Such communication may be through input/output (I/O) interface 411. Also, the electronic device 40 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 412. As shown, the network adapter 412 communicates with the other modules of the electronic device 40 over the bus 403. It should be appreciated that although not shown in FIG. 8, other hardware and/or software modules may be used in conjunction with electronic device 40, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

The processing unit 401 executes various functional applications and data processing by running a program stored in the system memory 402, for example, to implement the word similarity determination method provided in the embodiment of the present invention.

EXAMPLE five

The fifth embodiment of the present invention further provides a storage medium containing computer-executable instructions, which are used to execute the word similarity determining method when executed by a computer processor.

The method comprises the following steps:

Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for embodiments of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A method for determining character similarity is characterized by comprising the following steps:

respectively determining character basic attribute information of the target characters and the characters to be determined, and determining font coding tables of the target characters and the characters to be determined according to the character basic attribute information; the basic attribute information of the characters comprises a character structure, at least one basic unit forming the characters and the stroke number of the characters;

2. The method according to claim 1, wherein the determining, according to the pronunciation of the target word and the word to be determined, the encoded character strings corresponding to the target word and the word to be determined respectively comprises:

determining a target coding character string of the target character according to a pre-established character sound coding table and the character sound of the target character; and the number of the first and second groups,

determining a to-be-determined coded character string of the to-be-determined character according to a pre-created character sound coding table and the character sound of the to-be-determined character;

the character pronunciation coding table comprises character pronunciation characteristics and character pronunciation codes corresponding to the character pronunciation characteristics, wherein the character pronunciation characteristics comprise letters corresponding to initials, letters corresponding to finals and tones; the length of the target code character string is consistent with the length of the code character string to be determined.

3. The method according to claim 1, wherein the determining the basic character attribute information of the target character and the character to be determined respectively, and determining the font code table of the target character and the character to be determined according to the basic character attribute information comprises:

determining a target character structure of the target character, a target character unit forming the target character and a target stroke number of the target character according to a preset character basic attribute table; determining a character structure to be determined of the character to be determined, a character unit to be determined forming the character to be determined and the number of strokes to be determined of the character to be determined;

obtaining a target font coding table corresponding to the target character according to the target character structure, the target character unit and the target stroke number; obtaining a font coding table to be determined corresponding to the characters to be determined according to the character structure to be determined, the character units to be determined and the number of strokes to be determined;

the character attribute table comprises font code symbols corresponding to character structures, basic units forming the characters and stroke numbers of the characters.

4. The method of claim 1, wherein the determining the target similarity between the target word and the word to be determined based on the encoding string and the glyph encoding table comprises:

determining the character-sound similarity of the target characters and the characters to be determined according to the target coding character strings of the target characters and the coding character strings to be determined of the characters to be determined;

determining the font similarity of the target character and the character to be determined according to the target font coding table of the target character and the character to be determined of the character to be determined;

and determining the target similarity of the target character and the character to be determined according to the character sound similarity and the character form similarity.

5. The method according to claim 4, wherein the determining the degree of similarity between the target word and the word to be determined according to the target encoding string of the target word and the encoding string to be determined of the word to be determined comprises:

aiming at the target coding character string and the pronunciation code corresponding to the same position in the coding character string to be determined, determining the editing distance to be processed corresponding to the current position according to the two pronunciation codes corresponding to the current position;

and determining the character-sound similarity of the target character and the character to be determined according to the editing distance to be processed corresponding to each position.

6. The method of claim 5, wherein the determining the editing distance to be processed corresponding to the current position according to the two phonetic codes corresponding to the current position comprises at least one of the following modes:

if the two phonetic codes corresponding to the current position are the same, determining that the editing distance to be processed of the current position is a first preset value;

if the two character-pronunciation codes corresponding to the current position are the character-pronunciation codes in the first preset editing distance, determining that the editing distance to be processed of the current position is a second preset value;

and if the two phonetic codes corresponding to the current position are different and are not the phonetic codes in the first preset editing distance, determining that the editing distance to be processed at the current position is a third preset value.

7. The method according to claim 5, wherein the determining the degree of similarity between the target text and the text to be determined according to the editing distance to be processed corresponding to each position comprises:

and determining the character-sound similarity of the target character and the character to be determined according to the distance to be edited and the position length of each position.

8. The method of claim 4, wherein the glyph encoding table includes a glyph structure, basic units constituting a word, and a word stroke number, and the determining the glyph similarity between the target word and the word to be determined according to the target glyph encoding table of the target word and the word to be determined encoding table of the word to be determined comprises:

determining a structural similarity value according to the font structures of the target character and the character to be determined;

determining a unit similarity value according to the target basic unit of the target character and the basic unit to be determined of the character to be determined;

determining a stroke similarity value according to the target stroke number of the target character and the number of strokes to be determined of the character to be determined;

and determining the font similarity of the target character and the character to be determined according to the structure similarity value, the unit similarity value and the stroke similarity value.

9. The method of claim 4, wherein the determining the target similarity between the target text and the text to be determined according to the pronunciation similarity and the font similarity comprises:

and determining the target similarity of the target character and the character to be determined according to the character sound weight, the character sound similarity, the character pattern weight and the character pattern similarity.

10. The method of claim 9, wherein determining the target similarity between the target text and the text to be determined according to the pronunciation weight, the pronunciation similarity, the font weight, and the font similarity comprises:

obtaining the target word sound similarity according to the square of the word sound similarity and the word sound weight;

obtaining the similarity of the target font according to the square of the similarity of the font and the weight of the font;

and determining the target similarity according to the target character pronunciation similarity and the target character form similarity.

11. A character similarity determination device, comprising:

12. An electronic device, characterized in that the electronic device comprises:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of determining textual similarity of any of claims 1-10.

13. A storage medium containing computer-executable instructions for performing the textual similarity determination method of any of claims 1-10 when executed by a computer processor.