CN111222590A - Font-near word determining method, electronic device and computer-readable storage medium - Google Patents

Font-near word determining method, electronic device and computer-readable storage medium Download PDF

Info

Publication number
CN111222590A
CN111222590A CN201911412330.0A CN201911412330A CN111222590A CN 111222590 A CN111222590 A CN 111222590A CN 201911412330 A CN201911412330 A CN 201911412330A CN 111222590 A CN111222590 A CN 111222590A
Authority
CN
China
Prior art keywords
character
stroke
sequence
common
included angle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911412330.0A
Other languages
Chinese (zh)
Other versions
CN111222590B (en
Inventor
高岩峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
MIGU Culture Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
MIGU Culture Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, MIGU Culture Technology Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN201911412330.0A priority Critical patent/CN111222590B/en
Publication of CN111222590A publication Critical patent/CN111222590A/en
Application granted granted Critical
Publication of CN111222590B publication Critical patent/CN111222590B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/24Character recognition characterised by the processing or recognition method
    • G06V30/242Division of the character sequences into groups prior to recognition; Selection of dictionaries
    • G06V30/244Division of the character sequences into groups prior to recognition; Selection of dictionaries using graphical properties, e.g. alphabet type or font
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Character Discrimination (AREA)

Abstract

The embodiment of the invention relates to the technical field of natural language processing, and discloses a method for determining a shape and a near word, electronic equipment and a computer readable storage medium. In the present invention, the method for determining the shape-approximate word includes: acquiring stroke similarity of a first character and a second character; if the stroke similarity is larger than the preset similarity, extracting a common stroke sequence of the first character and the second character; wherein the common sequence of strokes includes a same number of strokes of the first character and the second character, the number of strokes being consecutive in both the first character and the second character; respectively acquiring a first relative position of the common stroke sequence in the first character and a second relative position of the common stroke sequence in the second character; and determining whether the first character and the second character are similar characters or not according to the first relative position and the second relative position, so that the accuracy of a similar character determination result can be improved.

Description

Font-near word determining method, electronic device and computer-readable storage medium
Technical Field
The embodiment of the invention relates to the technical field of natural language processing, in particular to a method for determining a form near word, electronic equipment and a computer readable storage medium.
Background
With the development of network technology, in many scenarios, near-word recognition is required. For example, identifying variant words in a web review, a user handwriting a scene of text, recognition of text in an image, and so forth. In the related art, the method for identifying the form-near word comprises the following steps: in the character pattern input method, obtaining the character pattern input method code of each Chinese character in the Chinese character set; acquiring the coding distance between each Chinese character and other Chinese characters in the Chinese character set according to the character pattern input method coding of the Chinese characters; and judging whether each Chinese character is a shape-similar character with other Chinese characters in the Chinese character set according to the coding distance, and obtaining a shape-similar character judgment result.
However, the inventors found that at least the following problems exist in the related art: non-near-form words are identified as near-form words. For example, the stroke code for "" is 25112112 and the stroke code for "nation" is 25112141, both of which have only the last two digits different and satisfy the threshold and are recognized as near-word-shaped, but in fact both are not near-word-shaped. Similar situations are more, and it can be seen that the related art has the problem that the judgment result of the form-close word is not accurate.
Disclosure of Invention
An object of embodiments of the present invention is to provide a shape-near word determination method, an electronic device, and a computer-readable storage medium, which enable the accuracy of a shape-near word determination result to be improved.
In order to solve the above technical problem, an embodiment of the present invention provides a method for determining a shape near word, including the following steps: acquiring stroke similarity of a first character and a second character; if the stroke similarity is larger than the preset similarity, extracting a common stroke sequence of the first character and the second character; wherein the common sequence of strokes includes a same number of strokes of the first character and the second character, the number of strokes being consecutive in both the first character and the second character; respectively acquiring a first relative position of the common stroke sequence in the first character and a second relative position of the common stroke sequence in the second character; and determining whether the first character and the second character are similar characters or not according to the first relative position and the second relative position.
The embodiment of the invention also provides a shape near word determining device, which comprises: the first acquisition module is used for acquiring the stroke similarity of the first character and the second character; the extraction module is used for extracting a common stroke sequence of the first character and the second character if the stroke similarity is greater than a preset similarity; wherein the common sequence of strokes includes a same number of strokes of the first character and the second character, the number of strokes being consecutive in both the first character and the second character; a second obtaining module, configured to obtain a first relative position of the common stroke sequence in the first character and a second relative position of the common stroke sequence in the second character, respectively; and the determining module is used for determining whether the first character and the second character are similar characters or not according to the first relative position and the second relative position.
An embodiment of the present invention also provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the above-described method of determining a near word.
Embodiments of the present invention also provide a computer-readable storage medium storing a computer program, which when executed by a processor implements the above-mentioned font-word determining method.
Compared with the prior art, the method and the device for obtaining the stroke similarity of the first character and the second character are obtained; if the stroke similarity is greater than the preset similarity, extracting a public stroke sequence of the first character and the second character; wherein the common stroke sequence includes a plurality of identical strokes of the first character and the second character, the plurality of strokes being consecutive in both the first character and the second character; respectively acquiring a first relative position of the common stroke sequence in a first character and a second relative position of the common stroke sequence in a second character; and determining whether the first character and the second character are similar characters or not according to the first relative position and the second relative position. The stroke similarity of the first character and the second character is larger than the preset similarity, which indicates that the first character and the second character have high possibility of being similar characters. After the stroke similarity is determined to be greater than the preset similarity, the common stroke sequence between the first character and the second character is extracted, so that the number of strokes in the extracted common stroke sequence is more favorable to guarantee, and the difference of the structure between the first character and the second character can be reflected better by the first relative position of the common stroke sequence in the first character and the second relative position of the common stroke sequence in the second character. According to the embodiment of the invention, the stroke similarity and the difference of the structures of the public stroke sequence in the first character and the second character are combined to determine whether the first character and the second character are the similar characters, so that the misjudgment caused by determining whether the first character and the second character are the similar characters only through the stroke similarity is avoided, and the accuracy of judging the similar characters is improved.
Additionally, the obtaining a first relative position of the common stroke sequence in the first character and a second relative position in the second character, respectively, comprises: respectively calculating a first position characteristic value corresponding to each stroke in the common stroke sequence in the first character; respectively calculating corresponding second position characteristic values of all strokes in the public stroke sequence in the second character; the determining whether the first character and the second character are similar characters according to the first relative position and the second relative position comprises: and determining whether the first character and the second character are similar characters or not according to the first position characteristic value and the second position characteristic value. That is, each stroke in the common sequence of strokes corresponds to a first position characteristic value in the first character and a second position characteristic value in the second character. And combining the first position characteristic value and the second position characteristic value corresponding to each stroke in the common stroke sequence, so as to be beneficial to further improving the accuracy of shape-similar word judgment.
In addition, the separately calculating a first position characteristic value corresponding to each stroke in the common stroke sequence in the first character includes: acquiring the end point coordinates of each stroke in the public stroke sequence in the first character; calculating the first position characteristic value according to the endpoint coordinates in the first character; the calculating the corresponding second position characteristic value of each stroke in the common stroke sequence in the second character respectively comprises: acquiring the end point coordinates of each stroke in the public stroke sequence in the second character; and calculating the second position characteristic value according to the endpoint coordinates in the second character. The specific calculation mode of the first position characteristic value and the second position characteristic value is provided, and the end point coordinates of all strokes in the public stroke sequence in the first character and the second character are respectively obtained, so that the first position characteristic value and the second position characteristic value are conveniently and accurately calculated, and the accuracy of the shape-similar character judgment result is further improved.
In addition, the determining whether the first character and the second character are near-word-shaped characters according to the n-1 first difference values and the n-1 second difference values includes: and if the n-1 first difference values d and the n-1 second difference values are smaller than a preset difference value, determining that the first character and the second character are similar characters. The n-1 first difference values and the n-1 second difference values are smaller than the preset difference values, which indicates that the structure difference of each stroke in the first character and the second character is small from the second stroke in the common stroke sequence, and the probability of belonging to the similar character is high. Therefore, the first character and the second character are determined to be the shape-similar characters after the n-1 first difference values and the n-1 second difference values are determined to be smaller than the preset difference values, and the accuracy of shape-similar character judgment is further improved.
Drawings
One or more embodiments are illustrated by the corresponding figures in the drawings, which are not meant to be limiting.
FIG. 1 is a flow chart of a method for determining a near word according to a first embodiment of the present invention;
FIG. 2 is a flow chart of a method for determining a near word according to a second embodiment of the present invention;
FIG. 3 is a schematic diagram of a font near word determining apparatus according to a third embodiment of the present invention;
fig. 4 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, embodiments of the present invention will be described in detail below with reference to the accompanying drawings. However, it will be appreciated by those of ordinary skill in the art that numerous technical details are set forth in order to provide a better understanding of the present application in various embodiments of the present invention. However, the technical solution claimed in the present application can be implemented without these technical details and various changes and modifications based on the following embodiments. The following embodiments are divided for convenience of description, and should not constitute any limitation to the specific implementation manner of the present invention, and the embodiments may be mutually incorporated and referred to without contradiction.
The first embodiment of the present invention relates to a method for determining a shape near word, which is applied to an electronic device, where the electronic device may be a terminal or a server, and this embodiment is not particularly limited in this respect. The following describes the implementation details of the method for determining a shape-similar word according to the present embodiment in detail, and the following is only provided for the convenience of understanding and is not necessary for implementing the present embodiment.
A flowchart of the method for determining a shape-similar word in this embodiment is shown in fig. 1, and specifically includes:
step 101: stroke similarity of the first character and the second character is obtained.
Wherein, the first character and the second character can be Chinese characters. In one example, the first character may be a character currently required to search for a similar character, and the second character may be any one of characters in a pre-stored character information base. Specifically, the manner of obtaining the stroke similarity of the first character and the second character may be as follows:
first, stroke codes of a first character and a second character are queried. The stroke codes can be inquired from a preset Chinese character stroke code table, and the Chinese character stroke code table records strokes of Chinese characters according to the stroke sequence of the Chinese characters. For example, the basic strokes may include 5 basic strokes of horizontal, vertical, left-falling, dot and turning, which are specified in the modern Chinese universal character table. Stroke order refers to the order of one or more of the basic strokes that make up the character, in the order in which the character is written. It should be noted that each basic stroke may correspond to a stroke identifier, and the stroke order of the character may be represented by a plurality of combined stroke identifiers. The stroke identification corresponding to each basic stroke may be a stroke code, for example, the stroke code corresponding to 5 basic strokes of horizontal, vertical, left-falling, dot and turning may be the numbers 1, 2, 3, 4 and 5. For example, the first character is "Kun", the queried stroke is encoded as "25111535", the second character is "", and the queried stroke is encoded as "25111525".
Next, an edit distance d is calculated for the stroke code of the first character and the stroke code of the second character. The calculation method of the edit distance may be: the number of operations such as deletion, insertion, and replacement required to convert the first character into the second character is counted, and the counted number is used as the edit distance. For example, the minimum number of operations required to convert the stroke code of the first character into the stroke code of the second character may be used as the edit distance.
Then, the similarity between the first character and the second character can be calculated from the calculated edit distance d, the length La of the first character, and the length Lb of the second character. La and Lb may be the number of strokes of the first character and the second character, respectively, or the number of bits of the stroke codes of the first character and the second character. For example, the first character "Kun" is 8 in length, and the second character "" is also 8 in length. In one example, the similarity s may be calculated by the following formula:
L=min(La,Lb)
s=(L–d)/L
step 102: determining whether the stroke similarity is greater than a preset similarity; if yes, go to step 103, otherwise go to step 106.
The preset similarity may be set according to actual needs, and this embodiment is not particularly limited thereto.
Step 103: a common stroke sequence of the first character and the second character is extracted.
The common stroke sequence comprises a plurality of identical strokes in the first character and the second character, and the plurality of strokes are continuous in both the first character and the second character. Specifically, each stroke in the first character and the second character may be traversed separately, and a common stroke sequence of the first character and the second character may be selected.
In one example, a Common stroke sequence, which includes several consecutive strokes, may be extracted by a Longest Common Subsequence (LCS) algorithm. It will be appreciated that the common stroke sequence extracted by the LCS algorithm is the longest common stroke sequence between the first character and the second character. Such as: the stroke sequence of the first character is "ABCDE", the stroke sequence of the second character is "ABCDE", and the extracted common stroke sequence is "ABC". For another example: the stroke sequence of the first character is "ABCDE", the stroke sequence of the second character is "FEABC", and the extracted common stroke sequence is also "ABC". That is, the common stroke sequence is a continuous whole that can appear anywhere in the first character and the second character. It should be noted that, in the present embodiment, the common stroke sequence is extracted by the LCS algorithm as an example, and the manner of extracting the common stroke sequence in the specific implementation is not limited thereto.
Step 104: a first relative position of the common stroke sequence in the first character and a second relative position in the second character are obtained, respectively.
Specifically, each stroke of the character has a relative position, for example, the character can be regarded as being located in a field grid, and each stroke of the character has a corresponding position in the field grid. The first relative position of the common stroke sequence in the first character may be understood as: the relative position of the common stroke sequence in the first character in the grid, such as top left, bottom left, top right, bottom right, middle, etc. of the grid. The second relative position is the same, and will not be described herein.
In one example, the first relative position of the common stroke sequence in the first character may include: the position of each stroke in the sequence of common strokes in the first character relative to the previous stroke. The second relative position of the common stroke sequence in the second character may include: the position of each stroke in the sequence of common strokes in the second character relative to the previous stroke. Wherein, the position of each stroke relative to the previous stroke may be: up, down, left, right, left up, left down, right up, right down, cross, etc.
Step 105: and determining whether the first character and the second character are similar characters or not according to the first relative position and the second relative position.
For example, the closer the first relative position and the second relative position are, i.e., indicating that the common stroke sequence is structurally closer in the first character and the second character, the closer the first character and the second character are, the determination may be made as to whether the first character and the second character are near-word-shaped. For another example, the first character and the second character may be added into the same grid, the center points of the first character and the second character are overlapped with the center point of the grid, the overlapping rate of the first relative position and the second relative position is obtained, and if the overlapping rate is high, the first character and the second character are determined to be similar characters.
Step 106: it is determined that the first character and the second character do not form a near word.
The above examples in the present embodiment are only for convenience of understanding, and do not limit the technical aspects of the present invention.
Compared with the prior art, the stroke similarity of the first character and the second character in the embodiment is greater than the preset similarity, which indicates that the first character and the second character are more likely to be similar characters. After the stroke similarity is determined to be greater than the preset similarity, the common stroke sequence between the first character and the second character is extracted, so that the number of strokes in the extracted common stroke sequence is more favorable to guarantee, and the difference of the structure between the first character and the second character can be reflected better by the first relative position of the common stroke sequence in the first character and the second relative position of the common stroke sequence in the second character. According to the embodiment of the invention, the stroke similarity and the difference of the structures of the public stroke sequence in the first character and the second character are combined to determine whether the first character and the second character are the similar characters, so that the misjudgment caused by determining whether the first character and the second character are the similar characters only through the stroke similarity is avoided, and the accuracy of judging the similar characters is improved.
A second embodiment of the present invention relates to a shape proximity word determination method. The following describes the implementation details of the method for determining a shape-similar word according to the present embodiment in detail, and the following is only provided for the convenience of understanding and is not necessary for implementing the present embodiment.
A flowchart of the method for determining a shape-similar word in this embodiment is shown in fig. 2, and specifically includes:
step 201: stroke similarity of the first character and the second character is obtained.
Step 202: determining whether the stroke similarity is greater than a preset similarity; if yes, go to step 203, otherwise go to step 207.
Step 203: a common stroke sequence of the first character and the second character is extracted.
Step 201 to step 203 are substantially the same as step 101 to step 103 in the first embodiment, and are not repeated herein for avoiding repetition.
Step 204: and respectively calculating a first position characteristic value corresponding to each stroke in the common stroke sequence in the first character.
Specifically, the end point coordinates of each stroke in the common stroke sequence in the first character may be obtained first. For example, the endpoint coordinates of each stroke in the common stroke sequence in the first character may be obtained by querying a preset encoding table of the endpoint coordinates of the strokes of the Chinese character. The Chinese character stroke end point coordinate coding table can be made in advance according to actual needs and is used for inquiring the end point coordinate of each stroke of a Chinese character. The endpoint coordinates may include: the coordinates of the starting point, the coordinates of the ending point, and the coordinates of the turning point when the turning point exists.
Further, a first position feature value may be calculated based on the endpoint coordinates in the first character.
In one example, the first position characteristic value may include a first included angle, and the first included angle may be obtained by: and according to the coordinates of the end points in the first character, starting from the second stroke in the public stroke sequence in the first character, sequentially calculating a first included angle between a connecting line of the first end point of each stroke and the last end point of the previous stroke and a horizontal line. That is, it can be understood that starting with the second stroke in the common stroke sequence, each stroke corresponds to a first angle. In a specific implementation, the first angles corresponding to each stroke may form a first angle sequence, which is denoted as (ad2, ad3, ad4 … … adn); where ad2 denotes a first angle corresponding to the second stroke in the common stroke sequence, ad3 and ad4 are similar, n denotes the number of strokes in the common stroke sequence, and adn denotes a first angle corresponding to the last stroke in the common stroke sequence.
In one example, the first position characteristic value may include a third angle, and the third angle may be obtained by: and sequentially calculating a third included angle between the first line of each stroke and the last line of the previous stroke from the second stroke in the common stroke sequence in the first character according to the endpoint coordinates in the first character. That is, it can be understood that, starting with the second stroke in the common stroke sequence, each stroke corresponds to a third angle. In a specific implementation, the third angles corresponding to each stroke may form a third angle sequence, which is represented as (ax2, ax3, ax4 … … axn); where ax2 represents the third angle for the second stroke in the common sequence of strokes, ax3, ax4 are similar, n represents the number of strokes in the common sequence of strokes, and axn is the third angle for the last stroke in the common sequence of strokes.
In one example, the first position characteristic value may include both the first angle and the third angle. Starting with the second stroke in the common sequence of strokes, each stroke corresponds to a tuple comprising the first angle and the third angle. That is, the first angle sequence and the third angle sequence may form a binary sequence, which is represented as (ad2, ax2), (ad3, ax3), (ad4, ax4) … … (and, axn).
It should be noted that, in a specific implementation, the first position characteristic value may also include other angles, and this embodiment is only an example of the two types of included angles, and is not limited to the two types of included angles in the specific implementation.
Step 205: and respectively calculating a corresponding second position characteristic value of each stroke in the common stroke sequence in the second character.
Specifically, the end point coordinates of each stroke in the common stroke sequence in the second character may be obtained first. For example, the endpoint coordinates of each stroke in the common stroke sequence in the second character may be obtained by querying a preset encoding table of the endpoint coordinates of the strokes of the Chinese character.
Further, a second position feature value may be calculated based on the endpoint coordinates in the second character.
In one example, the second position characteristic value may include a second included angle, and the second included angle may be obtained by: and according to the coordinates of the end points in the second character, sequentially calculating a second included angle between a connecting line of the first end point of each stroke and the last end point of the previous stroke and the horizontal line from the second stroke in the common stroke sequence in the second character. That is, it can be understood that starting with the second stroke in the common stroke sequence, each stroke corresponds to a second angle. In a specific implementation, the second angles corresponding to each stroke may form a second angle sequence, which is represented as (bd2, bd3, bd4 … … bdn); wherein bd2 represents the second angle corresponding to the second stroke in the common stroke sequence, bd3 and bd4 are similar, n represents the number of strokes in the common stroke sequence, and bdn is the second angle corresponding to the last stroke in the common stroke sequence.
In one example, the second position characteristic value may include a fourth angle, and the fourth angle may be obtained by: and sequentially calculating a fourth included angle between the first line of each stroke and the last line of the previous stroke from the second stroke in the common stroke sequence in the second character according to the endpoint coordinates in the second character. That is, it is understood that, starting with the second stroke in the common stroke sequence, each stroke corresponds to a fourth angle. In a specific implementation, the fourth angles corresponding to each stroke may form a fourth angle sequence, which is represented as (bx2, bx3, bx4 … … bxn); wherein bx2 represents a fourth angle corresponding to the second stroke in the common stroke sequence, bx3 and bx4 are similar, n represents the number of strokes in the common stroke sequence, and bxn is the fourth angle corresponding to the last stroke in the common stroke sequence.
In one example, the second position characteristic value may include both the second angle and the fourth angle. That is, starting with the second stroke in the common sequence of strokes, each stroke corresponds to a tuple comprising the second angle and the fourth angle. That is, the second and fourth angle sequences described above may constitute a binary sequence, which is represented as (bd2, bx2), (bd3, bx3), (bd4, bx4) … … (bnd, bxn).
It should be noted that, in a specific implementation, the second position characteristic value may also include other angles, and this embodiment is only an example of the two types of included angles, and is not limited to the two types of included angles in the specific implementation.
In addition, the execution sequence of step 204 and step 205 is not limited to the above-mentioned step 204 being executed first and then step 205 being executed, and in a specific implementation, step 204 may be executed at the same time, or step 205 is executed first and then step 204 is executed.
Step 206: and determining whether the first character and the second character are similar characters or not according to the first position characteristic value and the second position characteristic value.
Specifically, a comparison relationship between the first position characteristic value and the second position characteristic value can be obtained; and determining whether the first character and the second character are similar characters or not according to the comparison relationship.
In one example, the first position characteristic value includes a first angle and the second position characteristic value includes a second angle. The comparison relationship between the first position characteristic value and the second position characteristic value may be a first comparison relationship, and the first comparison relationship may be a difference relationship or a ratio relationship. Taking the difference relationship as an example, n-1 first differences may be obtained according to the first comparison relationship; wherein the n-1 first difference values are respectively: and starting from the second stroke in the common stroke sequence, calculating the difference value of the first included angle and the second included angle corresponding to each stroke in sequence, wherein n is the number of strokes included in the common stroke sequence. That is, each stroke in the common stroke sequence may correspond to a first difference value starting with the second stroke, such as | ad2-bd2| for the second stroke, | ad3-bd3| for the third stroke, and | adn-bdn | for the nth stroke. The n-1 first difference values are expressed as (| ad2-bd2|, | ad3-bd3|, | ad4-bd4| … … | adn-bdn |), that is, any one first difference value can be the absolute value of the difference value of the two included angles. It can be understood that if the first ratio relationship is a ratio relationship, then according to the first ratio relationship, n-1 first ratios can be obtained, which are expressed as (ad2/bd2, ad3/bd3, ad4/bd4 … … and/bdn).
In a specific implementation, whether the first character and the second character are similar characters or not may be determined according to the first comparison relationship, and a specific manner may be as follows:
in one example, the first comparison relationship is a difference relationship, and if the n-1 first differences are all smaller than a preset difference, it can be determined that the first character and the second character are similar characters. The preset difference value may be set according to actual needs, and this embodiment is not particularly limited to this. In a specific implementation, if most of the n-1 first differences are smaller than the preset difference, that is, only individual first differences are larger than the preset difference, it may also be determined that the first character and the second character are similar characters. However, this embodiment is not particularly limited thereto.
In another example, the first ratio relationship is a ratio relationship, and if the n-1 first ratios are all larger than a preset ratio, the first character and the second character can be determined to be similar characters. The preset ratio can be set according to actual needs, and this embodiment is not particularly limited thereto. In a specific implementation, if most of the n-1 first ratios are greater than the preset ratio, the first character and the second character can also be determined to be similar characters. However, this embodiment is not particularly limited thereto.
Optionally, if the first comparison relationship includes a difference relationship and a ratio relationship, the determination manners in the two examples may be combined, for example, if n-1 first ratios are all greater than a preset ratio and n-1 first differences are all less than the preset difference, it may be determined that the first character and the second character are similar characters. However, in specific implementations, this is not a limitation.
In another example, the first position characteristic value includes a third angle and the second position characteristic value includes a fourth angle. The comparison relationship between the first position characteristic value and the second position characteristic value may be a second comparison relationship, and the second comparison relationship may also be a difference relationship or a ratio relationship. Taking the difference relationship as an example, n-1 second differences can be obtained according to the second comparison relationship; and the n-1 second difference values are respectively the difference values of a third included angle and a fourth included angle corresponding to each stroke which are sequentially calculated from the second stroke in the public stroke sequence. That is, starting with the second stroke in the common stroke sequence, each stroke may correspond to a second difference value, represented as (| ax2-bx2|, | ax3-bx3|, | ax4-bx4| … … | axn-bxn |). It can be understood that if the second comparison relationship is a ratio relationship, then according to the second comparison relationship, n-1 second ratios can be obtained, which are expressed as (ax2/bx2, ax3/bx3, ax4/bx4 … … axn/bxn).
In a specific implementation, whether the first character and the second character are similar characters or not may be determined according to the second comparison relationship, and a specific manner may be as follows:
in one example, the second comparison relationship is a difference relationship, and if the n-1 second differences are all smaller than a preset difference, it can be determined that the first character and the second character are similar characters. In a specific implementation, if most of the n-1 second differences are smaller than the preset difference, that is, only individual second differences are larger than the preset difference, it may also be determined that the first character and the second character are similar characters. However, this embodiment is not particularly limited thereto.
In another example, the second comparison relationship is a ratio relationship, and if the n-1 second ratios are all larger than a preset ratio, the first character and the second character can be determined to be similar characters. In a specific implementation, if most of the n-1 second ratios are greater than the preset ratio, the first character and the second character can also be determined to be similar characters. However, this embodiment is not particularly limited thereto.
Optionally, if the second comparison relationship includes a difference relationship and a ratio relationship. For example, if the n-1 second ratios are all greater than the preset ratio and the n-1 second differences are all less than the preset difference, it may be determined that the first character and the second character are similar characters. However, in specific implementations, this is not a limitation.
In one example, the first position feature value may include both the first angle and the third angle, which may be represented as a doublet (ad, ax), and the second position feature value may include both the second angle and the fourth angle, which may be represented as a doublet (bd, bx). The comparison relationship between the first position characteristic value and the second position characteristic value may include both the first comparison relationship and the second comparison relationship. It will be appreciated that, starting with the second stroke in the common sequence of strokes, each stroke may correspond to a bigram difference, i.e. there are n-1 bigram differences, expressed as: (| ad2-bd2|, | ax2-bx2|) (| ad3-bd3|, | ax3-bx3|) (| ad4-bd4|, | ax4-bx4|) … … (| adn-bdn |, | axn-bxn |). Taking the ratio relationship as an example, the corresponding n-1 binary ratios are expressed as: (ad2/bd2, ax2/bx2) (ad3/bd3, ax3/bx3) (ad4/bd4, ax4/bx4) … … (adn/bdn, axn/bxn).
In a specific implementation, whether the first character and the second character are similar characters or not may be determined according to the first comparison relationship and the second comparison relationship, and a specific manner may be as follows:
in an example, if the first comparison relationship and the second comparison relationship are both difference relationships, n-1 first difference values may be obtained according to the first comparison relationship, n-1 second difference values may be obtained according to the second comparison relationship, and whether the first character and the second character are near-word-shaped characters may be determined according to the n-1 first difference values and the n-1 second difference values. n-1 first differences and n-1 second differences, namely (| ad2-bd2|, | ax2-bx2|) (| ad3-bd3|, | ax3-bx3|) (| ad4-bd4|, | ax4-bx4|) … … (| adn-bdn |, | axn-bxn |). And if the n-1 first difference values and the n-1 second difference values are smaller than the preset difference values, determining that the first character and the second character are similar characters.
In another example, if the first comparison relationship and the second comparison relationship are both ratio relationships, n-1 first ratios can be obtained according to the first comparison relationship, n-1 second ratios can be obtained according to the second comparison relationship, and whether the first character and the second character are similar characters or not can be determined according to the n-1 first ratios and the n-1 second ratios. n-1 first ratios and n-1 second ratios, namely (ad2/bd2, ax2/bx2) (ad3/bd3, ax3/bx3) (ad4/bd4, ax4/bx4) … … (adn/bdn, axn/bxn). And if the n-1 first ratios and the n-1 second ratios are both larger than the preset ratio, determining that the first character and the second character are similar characters.
Optionally, the first comparison relationship and the second comparison relationship both include a difference relationship and a ratio relationship, that is, the determination manners in the two examples may be combined. And if the n-1 first ratios and the n-1 second ratios are both larger than the preset ratio, and the n-1 first differences and the n-1 second differences are both smaller than the preset difference, determining that the first character and the second character are similar characters. However, in specific implementations, this is not a limitation.
Step 207: it is determined that the first character and the second character do not form a near word.
The above examples in the present embodiment are only for convenience of understanding, and do not limit the technical aspects of the present invention.
Compared with the prior art, the embodiment combines the first position characteristic value and the second position characteristic value corresponding to each stroke in the common stroke sequence to determine whether the first character and the second character are similar characters. The specific calculation mode of the first position characteristic value and the second position characteristic value is provided, and the end point coordinates of all strokes in the public stroke sequence in the first character and the second character are respectively inquired, so that the first position characteristic value and the second position characteristic value are conveniently and accurately calculated, the accuracy and the convenience of obtaining the structural difference degree are improved, and the accuracy of the shape-similar word judgment result is further improved.
The steps of the above methods are divided for clarity, and the implementation may be combined into one step or split some steps, and the steps are divided into multiple steps, so long as the same logical relationship is included, which are all within the protection scope of the present patent; it is within the scope of the patent to add insignificant modifications to the algorithms or processes or to introduce insignificant design changes to the core design without changing the algorithms or processes.
A third embodiment of the present invention relates to a shape proximity word determination device, as shown in fig. 3, including: a first obtaining module 301, configured to obtain stroke similarity between a first character and a second character; an extracting module 302, configured to extract a common stroke sequence of the first character and the second character if the stroke similarity is greater than a preset similarity; wherein the common sequence of strokes includes a same number of strokes of the first character and the second character, the number of strokes being consecutive in both the first character and the second character; a second obtaining module 303, configured to obtain a first relative position of the common stroke sequence in the first character and a second relative position in the second character, respectively; a determining module 304, configured to determine whether the first character and the second character are near-word-shaped characters according to the first relative position and the second relative position.
It should be understood that this embodiment is an example of the apparatus corresponding to the first or second embodiment, and may be implemented in cooperation with the first or second embodiment. The related technical details mentioned in the first or second embodiment are still valid in this embodiment, and the technical effects that can be achieved in the first or second embodiment can also be achieved in this embodiment, and are not described here again in order to reduce repetition. Accordingly, the related-art details mentioned in the present embodiment can also be applied to the first or second embodiment.
It should be noted that each module referred to in this embodiment is a logical module, and in practical applications, one logical unit may be one physical unit, may be a part of one physical unit, and may be implemented by a combination of multiple physical units. In addition, in order to highlight the innovative part of the present invention, elements that are not so closely related to solving the technical problems proposed by the present invention are not introduced in the present embodiment, but this does not indicate that other elements are not present in the present embodiment.
A fourth embodiment of the invention relates to an electronic device, as shown in fig. 4, comprising at least one processor 401; and a memory 402 communicatively coupled to the at least one processor 401; the memory 402 stores instructions executable by the at least one processor 401, and the instructions are executed by the at least one processor 401, so that the at least one processor 401 can execute the method for determining a near word in the first or second embodiment.
Where the memory 402 and the processor 401 are coupled by a bus, which may include any number of interconnected buses and bridges that couple one or more of the various circuits of the processor 401 and the memory 402 together. The bus may also connect various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor 401 may be transmitted over a wireless medium via an antenna, which may receive the data and transmit the data to the processor 401.
The processor 401 is responsible for managing the bus and general processing and may provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And memory 402 may be used to store data used by processor 401 in performing operations.
A fifth embodiment of the present invention relates to a computer-readable storage medium storing a computer program. The computer program realizes the above-described method embodiments when executed by a processor.
That is, as can be understood by those skilled in the art, all or part of the steps in the method for implementing the embodiments described above may be implemented by a program instructing related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples for carrying out the invention, and that various changes in form and details may be made therein without departing from the spirit and scope of the invention in practice.

Claims (10)

1. A method for determining a shape near word, comprising:
acquiring stroke similarity of a first character and a second character;
if the stroke similarity is larger than the preset similarity, extracting a common stroke sequence of the first character and the second character; wherein the common sequence of strokes includes a same number of strokes of the first character and the second character, the number of strokes being consecutive in both the first character and the second character;
respectively acquiring a first relative position of the common stroke sequence in the first character and a second relative position of the common stroke sequence in the second character;
and determining whether the first character and the second character are similar characters or not according to the first relative position and the second relative position.
2. The method of claim 1, wherein the obtaining a first relative position of the common stroke sequence in the first character and a second relative position in the second character comprises:
respectively calculating a first position characteristic value corresponding to each stroke in the common stroke sequence in the first character;
respectively calculating corresponding second position characteristic values of all strokes in the public stroke sequence in the second character;
the determining whether the first character and the second character are similar characters according to the first relative position and the second relative position comprises:
and determining whether the first character and the second character are similar characters or not according to the first position characteristic value and the second position characteristic value.
3. The method for determining a near-word shape according to claim 2, wherein said separately calculating a corresponding first position feature value of each stroke in the common stroke sequence in the first character comprises:
acquiring the end point coordinates of each stroke in the public stroke sequence in the first character;
calculating the first position characteristic value according to the endpoint coordinates in the first character;
the calculating a second position characteristic value corresponding to each stroke in the common stroke sequence in the second character respectively comprises:
acquiring the end point coordinates of each stroke in the public stroke sequence in the second character;
and calculating the second position characteristic value according to the endpoint coordinates in the second character.
4. The method according to claim 3, wherein the determining whether the first character and the second character are near-word-shaped according to the first position feature value and the second position feature value comprises:
acquiring a comparison relation between the first position characteristic value and the second position characteristic value;
and determining whether the first character and the second character are similar characters or not according to the comparison relationship.
5. The method according to claim 4, wherein the comparing relationship comprises: a first alignment relationship and/or a second alignment relationship;
the first comparison relationship is a comparison relationship between a first included angle and a second included angle, the first position characteristic value comprises the first included angle and the second position characteristic value comprises the second included angle, and the first included angle and the second included angle are obtained in the following modes:
according to the endpoint coordinates in the first character, starting from the second stroke in the public stroke sequence in the first character, sequentially calculating a first included angle between a connecting line of a first endpoint of each stroke and the last endpoint of the previous stroke and a horizontal line;
according to the coordinates of the end points in the second character, starting from the second stroke in the public stroke sequence in the second character, sequentially calculating a second included angle between the connecting line of the first end point of each stroke and the last end point of the previous stroke and a horizontal line;
the second comparison relationship is a comparison relationship between a third included angle and a fourth included angle; the first position characteristic value comprises the third included angle and the second position characteristic value comprises the fourth included angle, and the third included angle and the fourth included angle are acquired in the following modes:
according to the endpoint coordinates in the first character, sequentially calculating a third included angle between the first line of each stroke and the last line of the previous stroke from the second stroke in the common stroke sequence in the first character;
and according to the endpoint coordinates in the second character, sequentially calculating a fourth included angle between the first line of each stroke and the last line of the previous stroke from the second stroke in the public stroke sequence in the second character.
6. The method according to claim 5, wherein the comparison relationship comprises the first comparison relationship and the second comparison relationship, and the determining whether the first character and the second character are the near-word-shaped characters according to the comparison relationship comprises:
obtaining n-1 first difference values according to the first comparison relation; wherein the n-1 first difference values are respectively: calculating the difference value of a first included angle and a second included angle corresponding to each stroke in sequence from the second stroke in the public stroke sequence, wherein n is the number of strokes included in the public stroke sequence;
acquiring n-1 second difference values according to the second comparison relation; the n-1 second difference values are respectively the difference values of a third included angle and a fourth included angle corresponding to each stroke which are sequentially calculated from the second stroke in the public stroke sequence;
and determining whether the first character and the second character are similar characters or not according to the n-1 first difference values and the n-1 second difference values.
7. The method for determining a shape near word according to claim 6, wherein the determining whether the first character and the second character are shape near words according to the n-1 first difference values and the n-1 second difference values comprises:
and if the n-1 first difference values and the n-1 second difference values are smaller than preset difference values, determining that the first character and the second character are similar characters.
8. A font near word determination apparatus, comprising:
the first acquisition module is used for acquiring the stroke similarity of the first character and the second character;
the extraction module is used for extracting a common stroke sequence of the first character and the second character if the stroke similarity is greater than a preset similarity; wherein the common sequence of strokes includes a same number of strokes of the first character and the second character, the number of strokes being consecutive in both the first character and the second character;
a second obtaining module, configured to obtain a first relative position of the common stroke sequence in the first character and a second relative position of the common stroke sequence in the second character, respectively;
and the determining module is used for determining whether the first character and the second character are similar characters or not according to the first relative position and the second relative position.
9. An electronic device, comprising:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of determining a near word according to any one of claims 1 to 7.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method for determining a near-word according to any one of claims 1 to 7.
CN201911412330.0A 2019-12-31 2019-12-31 Shape-near-word determining method, electronic device, and computer-readable storage medium Active CN111222590B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911412330.0A CN111222590B (en) 2019-12-31 2019-12-31 Shape-near-word determining method, electronic device, and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911412330.0A CN111222590B (en) 2019-12-31 2019-12-31 Shape-near-word determining method, electronic device, and computer-readable storage medium

Publications (2)

Publication Number Publication Date
CN111222590A true CN111222590A (en) 2020-06-02
CN111222590B CN111222590B (en) 2024-04-12

Family

ID=70808214

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911412330.0A Active CN111222590B (en) 2019-12-31 2019-12-31 Shape-near-word determining method, electronic device, and computer-readable storage medium

Country Status (1)

Country Link
CN (1) CN111222590B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112766236A (en) * 2021-03-10 2021-05-07 拉扎斯网络科技(上海)有限公司 Text generation method and device, computer equipment and computer readable storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS59128680A (en) * 1983-01-14 1984-07-24 Hitachi Ltd On-line recognizing system of handwritten character
CN1488120A (en) * 2001-01-15 2004-04-07 �¿���˹�����ɷ����޹�˾ Method, device and computer program for recognition of a handwritten character
JP2004185264A (en) * 2002-12-03 2004-07-02 Canon Inc Character recognition method
US20070040707A1 (en) * 2005-08-16 2007-02-22 Lai Jenny H Separation of Components and Characters in Chinese Text Input
CN103810506A (en) * 2014-01-03 2014-05-21 南京师范大学 Method for identifying strokes of handwritten Chinese characters
CN103927535A (en) * 2014-05-08 2014-07-16 北京汉仪科印信息技术有限公司 Recognition method and device for Chinese character writing
CN104461337A (en) * 2013-09-24 2015-03-25 中央研究院 Method for improving handwriting input efficiency
WO2015139497A1 (en) * 2014-03-19 2015-09-24 北京奇虎科技有限公司 Method and apparatus for determining similar characters in search engine
CN105608462A (en) * 2015-12-10 2016-05-25 小米科技有限责任公司 Character similarity judgment method and device
CN106598920A (en) * 2016-11-28 2017-04-26 昆明理工大学 Similar Chinese character classification method combining stroke codes with Chinese character dot matrixes
CN110097002A (en) * 2019-04-30 2019-08-06 北京达佳互联信息技术有限公司 Nearly word form determines method, apparatus, computer equipment and storage medium

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS59128680A (en) * 1983-01-14 1984-07-24 Hitachi Ltd On-line recognizing system of handwritten character
CN1488120A (en) * 2001-01-15 2004-04-07 �¿���˹�����ɷ����޹�˾ Method, device and computer program for recognition of a handwritten character
JP2004185264A (en) * 2002-12-03 2004-07-02 Canon Inc Character recognition method
US20070040707A1 (en) * 2005-08-16 2007-02-22 Lai Jenny H Separation of Components and Characters in Chinese Text Input
CN104461337A (en) * 2013-09-24 2015-03-25 中央研究院 Method for improving handwriting input efficiency
CN103810506A (en) * 2014-01-03 2014-05-21 南京师范大学 Method for identifying strokes of handwritten Chinese characters
WO2015139497A1 (en) * 2014-03-19 2015-09-24 北京奇虎科技有限公司 Method and apparatus for determining similar characters in search engine
CN103927535A (en) * 2014-05-08 2014-07-16 北京汉仪科印信息技术有限公司 Recognition method and device for Chinese character writing
CN105608462A (en) * 2015-12-10 2016-05-25 小米科技有限责任公司 Character similarity judgment method and device
CN106598920A (en) * 2016-11-28 2017-04-26 昆明理工大学 Similar Chinese character classification method combining stroke codes with Chinese character dot matrixes
CN110097002A (en) * 2019-04-30 2019-08-06 北京达佳互联信息技术有限公司 Nearly word form determines method, apparatus, computer equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王晓;吕肖庆;汤帜;: "基于笔端形状相似性的汉字字体识别" *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112766236A (en) * 2021-03-10 2021-05-07 拉扎斯网络科技(上海)有限公司 Text generation method and device, computer equipment and computer readable storage medium

Also Published As

Publication number Publication date
CN111222590B (en) 2024-04-12

Similar Documents

Publication Publication Date Title
CN107688789B (en) Document chart extraction method, electronic device and computer readable storage medium
CN108170806B (en) Sensitive word detection and filtering method and device and computer equipment
CN110033515B (en) Graph conversion method, graph conversion device, computer equipment and storage medium
CN104579360A (en) Method and equipment for data processing
CN115331213A (en) Character recognition method, chip, electronic device, and storage medium
JP6460926B2 (en) System and method for searching for an object in a captured image
CN111222590A (en) Font-near word determining method, electronic device and computer-readable storage medium
JP5551660B2 (en) Computer-implemented method for encoding text into matrix code symbols, computer-implemented method for decoding matrix code symbols, encoder for encoding text into matrix code symbols, and decoder for decoding matrix code symbols
JP6569734B2 (en) Image processing apparatus, image processing method, and program
CN111275049A (en) Method and device for acquiring character image skeleton feature descriptors
CN113963197A (en) Image recognition method and device, electronic equipment and readable storage medium
CN116739022A (en) Decoding method and device for bar code image and electronic equipment
CN109101973B (en) Character recognition method, electronic device and storage medium
US20130332824A1 (en) Embedded font processing method and device
CN115563058A (en) Similar case retrieval method based on element extraction
CN112699634B (en) Typesetting processing method of electronic book, electronic equipment and storage medium
CN110414496B (en) Similar word recognition method and device, computer equipment and storage medium
CN109492068B (en) Method and device for positioning object in predetermined area and electronic equipment
CN114708580A (en) Text recognition method, model training method, device, apparatus, storage medium, and program
CN110826488B (en) Image identification method and device for electronic document and storage equipment
CN109409370B (en) Remote desktop character recognition method and device
CN113780265B (en) Space recognition method and device for English words, storage medium and computer equipment
CN109710607A (en) A kind of hash query method solved based on weight towards higher-dimension big data
JP7105500B2 (en) Computer-implemented Automatic Acquisition Method for Element Nouns in Chinese Patent Documents for Patent Documents Without Intercharacter Spaces
CN114283420A (en) Shape and proximity character distinguishing method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant