CN111222590B - Shape-near-word determining method, electronic device, and computer-readable storage medium - Google Patents

Shape-near-word determining method, electronic device, and computer-readable storage medium Download PDF

Info

Publication number
CN111222590B
CN111222590B CN201911412330.0A CN201911412330A CN111222590B CN 111222590 B CN111222590 B CN 111222590B CN 201911412330 A CN201911412330 A CN 201911412330A CN 111222590 B CN111222590 B CN 111222590B
Authority
CN
China
Prior art keywords
character
stroke
sequence
relative position
common
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911412330.0A
Other languages
Chinese (zh)
Other versions
CN111222590A (en
Inventor
高岩峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Migu Cultural Technology Co Ltd
China Mobile Communications Group Co Ltd
Original Assignee
Migu Cultural Technology Co Ltd
China Mobile Communications Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Migu Cultural Technology Co Ltd, China Mobile Communications Group Co Ltd filed Critical Migu Cultural Technology Co Ltd
Priority to CN201911412330.0A priority Critical patent/CN111222590B/en
Publication of CN111222590A publication Critical patent/CN111222590A/en
Application granted granted Critical
Publication of CN111222590B publication Critical patent/CN111222590B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/24Character recognition characterised by the processing or recognition method
    • G06V30/242Division of the character sequences into groups prior to recognition; Selection of dictionaries
    • G06V30/244Division of the character sequences into groups prior to recognition; Selection of dictionaries using graphical properties, e.g. alphabet type or font
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Character Discrimination (AREA)

Abstract

The embodiment of the invention relates to the technical field of natural language processing, and discloses a shape near word determining method, electronic equipment and a computer readable storage medium. In the invention, the method for determining the shape near word comprises the following steps: acquiring stroke similarity of the first character and the second character; if the stroke similarity is greater than the preset similarity, extracting a common stroke sequence of the first character and the second character; the common stroke sequence comprises a plurality of strokes which are the same in the first character and the second character, and the plurality of strokes are continuous in the first character and the second character; respectively acquiring a first relative position of the common stroke sequence in the first character and a second relative position of the common stroke sequence in the second character; and determining whether the first character and the second character are near-shape characters according to the first relative position and the second relative position, so that the accuracy of a near-shape character judging result can be improved.

Description

Shape-near-word determining method, electronic device, and computer-readable storage medium
Technical Field
The embodiment of the invention relates to the technical field of natural language processing, in particular to a shape near word determining method, electronic equipment and a computer readable storage medium.
Background
With the development of network technology, in many scenarios, word-in-word recognition is required. For example, variant words in web reviews, user handwriting in text scenes, recognition of text in images, and the like are identified. In the related art, the recognition method of the shape near word comprises the following steps: in the font input method, the font input method code of each Chinese character in the Chinese character set is obtained; coding according to a font input method of the Chinese characters, and obtaining coding distances between each Chinese character and other Chinese characters in the Chinese character set; and judging whether each Chinese character is a shape-similar character or not with other Chinese characters in the Chinese character set according to the coding distance, and obtaining a shape-similar character judging result.
However, the inventors found that there are at least the following problems in the related art: the non-shape near word will be identified as a shape near word. For example, the stroke code of "" is 25112112, the stroke code of "country" is 25112141, and only the last two digits are different, satisfying the threshold, will be recognized as a near word, but in fact, both are not near words. Similar situations are relatively many, and therefore, the problem of inaccurate judgment results on the shape of the near word exists in the related technology.
Disclosure of Invention
An object of an embodiment of the present invention is to provide a shape-near-word determining method, an electronic device, and a computer-readable storage medium, so that accuracy of a shape-near-word determination result can be improved.
In order to solve the technical problems, the embodiment of the invention provides a method for determining a shape near word, which comprises the following steps: acquiring stroke similarity of the first character and the second character; if the stroke similarity is greater than the preset similarity, extracting a common stroke sequence of the first character and the second character; the common stroke sequence comprises a plurality of strokes which are the same in the first character and the second character, and the plurality of strokes are continuous in the first character and the second character; respectively acquiring a first relative position of the common stroke sequence in the first character and a second relative position of the common stroke sequence in the second character; and determining whether the first character and the second character are near words or not according to the first relative position and the second relative position.
The embodiment of the invention also provides a shape near word determining device, which comprises: the first acquisition module is used for acquiring stroke similarity of the first character and the second character; the extraction module is used for extracting a common stroke sequence of the first character and the second character if the stroke similarity is larger than a preset similarity; the common stroke sequence comprises a plurality of strokes which are the same in the first character and the second character, and the plurality of strokes are continuous in the first character and the second character; the second acquisition module is used for respectively acquiring a first relative position of the common stroke sequence in the first character and a second relative position of the common stroke sequence in the second character; and the determining module is used for determining whether the first character and the second character are near words or not according to the first relative position and the second relative position.
The embodiment of the invention also provides electronic equipment, which comprises: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the shape word proximity determination method described above.
The embodiment of the invention also provides a computer readable storage medium storing a computer program which when executed by a processor realizes the method for determining the shape near word.
Compared with the prior art, the embodiment of the invention obtains the stroke similarity of the first character and the second character; if the stroke similarity is greater than the preset similarity, extracting a common stroke sequence of the first character and the second character; the common stroke sequence comprises a plurality of same strokes in the first character and the second character, and the plurality of strokes are continuous in the first character and the second character; respectively acquiring a first relative position of a common stroke sequence in a first character and a second relative position of the common stroke sequence in a second character; and determining whether the first character and the second character are near words according to the first relative position and the second relative position. The stroke similarity of the first character and the second character is larger than the preset similarity, which indicates that the probability that the first character and the second character are near words is larger. After the stroke similarity is determined to be greater than the preset similarity, the common stroke sequence between the first character and the second character is extracted again, so that the number of strokes in the extracted common stroke sequence is guaranteed to be large, and the first relative position of the common stroke sequence in the first character and the second relative position of the common stroke sequence in the second character can reflect the difference of structures between the first character and the second character. According to the embodiment of the invention, the stroke similarity is combined with the difference of the structures of the public stroke sequences in the first character and the second character to determine whether the first character and the second character are the near-shape characters, so that misjudgment caused by determining whether the first character and the second character are the near-shape characters only through the stroke similarity is avoided, and the accuracy of the near-shape character judgment is improved.
In addition, the acquiring the first relative position of the common stroke sequence in the first character and the second relative position of the common stroke sequence in the second character respectively includes: respectively calculating a first position characteristic value corresponding to each stroke in the common stroke sequence in the first character; respectively calculating a second position characteristic value corresponding to each stroke in the common stroke sequence in the second character; the determining whether the first character and the second character are near words according to the first relative position and the second relative position comprises the following steps: and determining whether the first character and the second character are near words or not according to the first position characteristic value and the second position characteristic value. I.e. each stroke in the common stroke sequence corresponds to a first position feature value in a first character and a second position feature value in a second character. And combining the first position characteristic value and the second position characteristic value corresponding to each stroke in the common stroke sequence is beneficial to further improving the accuracy of the shape near word judgment.
In addition, the calculating, respectively, a first position feature value corresponding to each stroke in the common stroke sequence in the first character includes: acquiring endpoint coordinates of each stroke in the common stroke sequence in the first character; calculating the first position characteristic value according to the endpoint coordinates in the first character; the calculating the corresponding second position characteristic value of each stroke in the common stroke sequence in the second character respectively comprises the following steps: acquiring endpoint coordinates of each stroke in the common stroke sequence in the second character; and calculating the second position characteristic value according to the endpoint coordinates in the second character. The specific calculation mode of the first position characteristic value and the second position characteristic value is provided, and the endpoint coordinates of all strokes in the common stroke sequence in the first character and the second character are respectively obtained, so that the first position characteristic value and the second position characteristic value can be conveniently and accurately calculated, and the accuracy of the shape near word judgment result is further improved.
And determining whether the first character and the second character are near words according to the n-1 first differences and the n-1 second differences, including: and if the n-1 first difference values d and the n-1 second difference values are smaller than the preset difference values, determining that the first character and the second character are near-shape characters. The n-1 first difference values and the n-1 second difference values are smaller than the preset difference values, which indicates that the structure difference of each stroke in the first character and the second character is smaller from the second stroke in the common stroke sequence, and the probability of the character belonging to the shape of the near word is high. Therefore, the first character and the second character are determined to be the shape-near character after the n-1 first difference values and the n-1 second difference values are smaller than the preset difference values, so that the accuracy of the shape-near character judgment is further improved.
Drawings
One or more embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings.
Fig. 1 is a flowchart of a shape near word determining method in a first embodiment according to the present invention;
FIG. 2 is a flow chart of a method of determining a shape near word in a second embodiment of the invention;
Fig. 3 is a schematic diagram of a shape-near-word determining apparatus in a third embodiment according to the present invention;
fig. 4 is a schematic structural view of an electronic device according to a fourth embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the following detailed description of the embodiments of the present invention will be given with reference to the accompanying drawings. However, those of ordinary skill in the art will understand that in various embodiments of the present invention, numerous technical details have been set forth in order to provide a better understanding of the present application. However, the technical solutions claimed in the present application can be implemented without these technical details and with various changes and modifications based on the following embodiments. The following embodiments are divided for convenience of description, and should not be construed as limiting the specific implementation of the present invention, and the embodiments can be mutually combined and referred to without contradiction.
The first embodiment of the present invention relates to a method for determining a shape of a near word, which is applied to an electronic device, and the electronic device may be a terminal or a server, and the embodiment is not specifically limited thereto. The implementation details of the form-proximity word determining method of the present embodiment are specifically described below, and the following description is provided only for convenience of understanding, and is not necessary to implement the present embodiment.
As shown in fig. 1, a flowchart of the method for determining a shape-near word in this embodiment specifically includes:
step 101: and acquiring the stroke similarity of the first character and the second character.
Wherein the first character and the second character may be Chinese characters. In one example, the first character may be a character that is currently required to search for a shape-near word, and the second character may be any character in a pre-stored character information base. Specifically, the manner of obtaining the stroke similarity of the first character and the second character may be as follows:
first, the stroke codes of the first character and the second character are queried. Wherein, the stroke codes can be inquired from a preset Chinese character stroke code table, and the Chinese character stroke code table records the strokes of Chinese characters according to the stroke sequence of the Chinese characters. For example, the basic strokes may include horizontal, vertical, left-falling, dot, and fold 5 basic strokes specified in the modern chinese general character table. The stroke order refers to the order in which one or more basic strokes comprising the character were sequenced when the character was written. It should be noted that, each basic stroke may correspond to one stroke identifier, and then the stroke sequence of the character may be represented by using multiple combined stroke identifiers. The stroke identification corresponding to each basic stroke can be a stroke code, for example, the stroke codes corresponding to the horizontal, vertical, left-falling, dot and folding 5 basic strokes can be digital 1, 2, 3, 4 and 5. For example, the first character is "kun", the query stroke is encoded as "25111535", the second character is "", and the query stroke is encoded as "25111525".
Next, an edit distance d of the stroke code of the first character and the stroke code of the second character is calculated. The method for calculating the editing distance can be as follows: the number of operations such as deletion, insertion, replacement, etc. required for converting the first character into the second character is counted, and the counted number is used as the edit distance. For example, the minimum number of operations required to convert the stroke code of the first character into the stroke code of the second character may be used as the edit distance.
Then, the similarity between the first character and the second character may be calculated from the calculated edit distance d, the length La of the first character, and the length Lb of the second character. Wherein La and Lb can be the number of strokes of the first character and the second character respectively, and can also be the number of bits of the stroke codes of the first character and the second character. For example, the first character is "kun" of length 8 and the second character is "" of length 8. In one example, the similarity s may be calculated by the following formula:
L=min(La,Lb)
s=(L–d)/L
step 102: determining whether the stroke similarity is larger than a preset similarity; if yes, go to step 103, otherwise go to step 106.
The preset similarity may be set according to actual needs, which is not specifically limited in this embodiment.
Step 103: a common stroke sequence of the first character and the second character is extracted.
Wherein the common stroke sequence includes the same plurality of strokes in the first character and the second character, the plurality of strokes being continuous in both the first character and the second character. Specifically, each stroke in the first character and the second character can be traversed respectively, and a common stroke sequence of the first character and the second character is selected.
In one example, a common stroke sequence may be extracted by a longest common subsequence (Longest Common Subsequence, abbreviated as LCS) algorithm, the common stroke sequence including a number of consecutive strokes. It will be appreciated that the common stroke sequence extracted by the LCS algorithm is the longest common stroke sequence between the first character and the second character. Such as: the stroke sequence of the first character is ABCDE, the stroke sequence of the second character is ABCFE, and the extracted common stroke sequence is ABC. For another example: the stroke sequence of the first character is ABCDE, the stroke sequence of the second character is FEABC, and the extracted common stroke sequence is ABC. That is, the common stroke sequence is a continuous whole and can appear at any position in the first character and the second character. In this embodiment, only the extraction of the common stroke sequence by LCS algorithm is taken as an example, and the method of extracting the common stroke sequence in the specific implementation is not limited thereto.
Step 104: a first relative position of the common stroke sequence in the first character and a second relative position in the second character are respectively acquired.
Specifically, each stroke of the character has a relative position, such as may be considered to be located in a field, where each stroke of the character has a corresponding position. The first relative position of the common stroke sequence in the first character may be understood as: the relative position of the common stroke sequence in the first character in the field, such as upper left, lower left, upper right, lower right, middle, etc. of the field. The second relative position is the same and will not be described in detail herein.
In one example, the first relative position of the common stroke sequence in the first character may include: each stroke in the sequence of common strokes in the first character is located relative to the previous stroke. The second relative position of the common stroke sequence in the second character may include: each stroke in the sequence of common strokes in the second character is located relative to the previous stroke. Wherein the position of each stroke relative to the previous stroke may be: upper, lower, left, right, upper left, lower left, upper right, lower right, cross, etc.
Step 105: and determining whether the first character and the second character are near words according to the first relative position and the second relative position.
For example, the closer the first and second relative positions are, i.e., the closer the structure of the common stroke sequence in the first and second characters is, the more likely it is that the first and second characters are in shape. For another example, the first character and the second character may be added into the same field, the center points of the first character and the second character overlap with the center point of the field, so as to obtain the overlapping rate of the first relative position and the second relative position, and if the overlapping rate is high, it is determined that the first character and the second character are near-shaped.
Step 106: it is determined that the first character and the second character are not near words.
The above examples in this embodiment are all examples for easy understanding, and do not limit the technical configuration of the present invention.
Compared with the prior art, in the embodiment, the stroke similarity of the first character and the second character is larger than the preset similarity, which indicates that the possibility that the first character and the second character are near words is higher. After the stroke similarity is determined to be greater than the preset similarity, the common stroke sequence between the first character and the second character is extracted again, so that the number of strokes in the extracted common stroke sequence is guaranteed to be large, and the first relative position of the common stroke sequence in the first character and the second relative position of the common stroke sequence in the second character can reflect the difference of structures between the first character and the second character. According to the embodiment of the invention, the stroke similarity is combined with the difference of the structures of the public stroke sequences in the first character and the second character to determine whether the first character and the second character are the near-shape characters, so that misjudgment caused by determining whether the first character and the second character are the near-shape characters only through the stroke similarity is avoided, and the accuracy of the near-shape character judgment is improved.
A second embodiment of the present invention relates to a shape near word determining method. The implementation details of the form-proximity word determining method of the present embodiment are specifically described below, and the following description is provided only for convenience of understanding, and is not necessary to implement the present embodiment.
As shown in fig. 2, a flowchart of the method for determining a shape-near word in this embodiment specifically includes:
step 201: and acquiring the stroke similarity of the first character and the second character.
Step 202: determining whether the stroke similarity is larger than a preset similarity; if yes, go to step 203, otherwise go to step 207.
Step 203: a common stroke sequence of the first character and the second character is extracted.
Steps 201 to 203 are substantially the same as steps 101 to 103 in the first embodiment, and are not repeated here.
Step 204: and respectively calculating a first position characteristic value corresponding to each stroke in the common stroke sequence in the first character.
Specifically, the endpoint coordinates of each stroke in the common stroke sequence in the first character may be obtained first. For example, the endpoint coordinates of each stroke in the common stroke sequence in the first character may be obtained by querying a preset encoding table of endpoint coordinates of strokes of chinese characters. Wherein, the Chinese character stroke endpoint coordinate coding table can be prefabricated according to actual needs, and is used for inquiring the endpoint coordinates of each stroke of the Chinese character. The endpoint coordinates may include: a start point coordinate, an end point coordinate, and a turning point coordinate when a turning point exists.
Further, the first position feature value may be calculated based on the endpoint coordinates in the first character.
In one example, the first position feature value may include a first included angle, and the first included angle may be obtained by: and calculating a first included angle between a connecting line of a first endpoint of each stroke and a last endpoint of the previous stroke and a horizontal line in sequence from a second stroke in a common stroke sequence in the first character according to the endpoint coordinates in the first character. I.e. it can be understood that each stroke corresponds to a first angle starting from the second stroke in the sequence of common strokes. In a specific implementation, the first included angle corresponding to each stroke may form a first included angle sequence, which is denoted as (ad 2, ad3, ad4 … … adn); wherein ad2 represents a first included angle corresponding to a second stroke in the common stroke sequence, ad3 and ad4 are similar, n represents the number of strokes in the common stroke sequence, and adn is a first included angle corresponding to a last stroke in the common stroke sequence.
In one example, the first position feature value may include a third included angle, and the third included angle may be obtained by: a third angle between the first line of each stroke and the last line of the preceding stroke is calculated in turn from the second stroke in the sequence of common strokes in the first character based on the endpoint coordinates in the first character. I.e. it can be understood that each stroke corresponds to a third angle starting from the second stroke in the sequence of common strokes. In a specific implementation, the third included angle corresponding to each stroke may form a third included angle sequence, which is denoted by (ax 2, ax3, ax4 and … … axn); wherein ax2 represents a third included angle corresponding to a second stroke in the common stroke sequence, ax3 and ax4 are similar, and n represents the number of strokes in the common stroke sequence, axn is a third included angle corresponding to a last stroke in the common stroke sequence.
In one example, the first position feature value may include both the first included angle and the third included angle described above. I.e. starting with the second stroke in the common stroke sequence, each stroke corresponding to a binary group comprising a first angle and a third angle. That is, the first and third included angle sequences described above may constitute a binary sequence, denoted (ad 2, ax 2), (ad 3, ax 3), (ad 4, ax 4) … … (and, axn).
It should be noted that, in a specific implementation, the first position feature value may further include other angles, and this embodiment only uses the two types of included angles as an example, which is not limited to the specific implementation at all.
Step 205: and respectively calculating corresponding second position characteristic values of all strokes in the common stroke sequence in the second character.
Specifically, the endpoint coordinates of each stroke in the common stroke sequence in the second character may be obtained first. For example, the endpoint coordinates of each stroke in the common stroke sequence in the second character may be obtained by querying a preset encoding table of endpoint coordinates of strokes of chinese characters.
Further, a second location feature value may be calculated based on the endpoint coordinates in the second character.
In one example, the second position feature value may include a second included angle, and the second included angle may be obtained by: and calculating a second included angle between a connecting line of a first endpoint of each stroke and a last endpoint of the previous stroke and a horizontal line in sequence from a second stroke in a common stroke sequence in the second character according to the endpoint coordinates in the second character. I.e. it can be understood that each stroke corresponds to a second included angle starting from the second stroke in the sequence of common strokes. In a specific implementation, the second included angle corresponding to each stroke may form a second included angle sequence, denoted as (bd 2, bd3, bd4 … … bdn); where bd2 represents a second included angle corresponding to a second stroke in the common stroke sequence, bd3, bd4 are similar, and n represents the number of strokes in the common stroke sequence, bdn is a second included angle corresponding to a last stroke in the common stroke sequence.
In one example, the second position feature value may include a fourth included angle, and the fourth included angle may be obtained by: a fourth included angle of a first line of each stroke with a last line of a preceding stroke is sequentially calculated from a second stroke in a sequence of common strokes in the second character based on the endpoint coordinates in the second character. I.e. it can be understood that each stroke corresponds to a fourth included angle starting from the second stroke in the sequence of common strokes. In a specific implementation, the fourth included angle corresponding to each stroke may form a fourth included angle sequence, denoted as (bx 2, bx3, bx4 … … bxn); wherein bx2 represents a fourth included angle corresponding to a second stroke in the common stroke sequence, bx3 and bx4 are similar, n represents the number of strokes in the common stroke sequence, and bxn is a fourth included angle corresponding to a last stroke in the common stroke sequence.
In one example, the second position characteristic value may include both the second included angle and the fourth included angle. I.e., starting with the second stroke in the common stroke sequence, each stroke corresponds to a tuple comprising the second included angle and the fourth included angle. That is, the second included angle sequence and the fourth included angle sequence described above may constitute a binary sequence, denoted as (bd 2, bx 2), (bd 3, bx 3), (bd 4, bx 4) … … (bnd, bxn).
It should be noted that, in a specific implementation, the second position feature value may further include other angles, and this embodiment only uses the two types of included angles as an example, which is not limited to the specific implementation at all.
In addition, the execution sequence of the step 204 and the step 205 is not limited to the above-mentioned steps of executing the step 204 first and then executing the step 205, and in a specific implementation, the steps of executing the step 205 first and then executing the step 204 may be performed simultaneously.
Step 206: and determining whether the first character and the second character are near words or not according to the first position characteristic value and the second position characteristic value.
Specifically, a comparison relationship between the first position characteristic value and the second position characteristic value can be obtained; and determining whether the first character and the second character are near words or not according to the comparison relation.
In one example, the first position characteristic value includes a first included angle and the second position characteristic value includes a second included angle. The comparison relationship between the first position characteristic value and the second position characteristic value may be a first comparison relationship, and the first comparison relationship may be a difference relationship or a ratio relationship. Taking the difference relation as an example, n-1 first differences can be obtained according to the first comparison relation; wherein, n-1 first differences are respectively: starting from the second stroke in the common stroke sequence, calculating the difference value between the first included angle and the second included angle corresponding to each stroke in sequence, wherein n is the number of strokes included in the common stroke sequence. That is, each stroke in the sequence of common strokes may correspond to a first difference, such as the first difference for the second stroke being |ad2-bd2|, the first difference for the third stroke being |ad3-bd3|, the first difference for the nth stroke being |adn-bdn |. The above-mentioned n-1 first differences are expressed as (|ad2-bd 2|, |ad3-bd3|, |ad4-bd4| … … |adn-bdn |), i.e. any one first difference may be the absolute value of the difference between the two angles. It will be appreciated that if the first comparison is a ratio relationship, then n-1 first ratios, denoted (ad 2/bd2, ad3/bd3, ad4/bd4 … … and/bdn), may be obtained according to the first comparison.
In a specific implementation, whether the first character and the second character are near-shape characters may be determined according to the first comparison relation, and the specific manner may be as follows:
in one example, the first comparison relationship is a difference relationship, and if n-1 first differences are smaller than the preset difference, it may be determined that the first character and the second character are near-shape characters. The preset difference value may be set according to actual needs, which is not specifically limited in this embodiment. In a specific implementation, if most of the n-1 first differences are smaller than the preset differences, that is, only the individual first differences are larger than the preset differences, it may be determined that the first character and the second character are near-shape characters. However, the present embodiment is not particularly limited thereto.
In another example, the first comparison relationship is a ratio relationship, and if n-1 first ratios are all greater than a preset ratio, it may be determined that the first character and the second character are near-shape words. The preset ratio may be set according to actual needs, which is not specifically limited in this embodiment. In a specific implementation, if most of the n-1 first ratios are greater than the preset ratio, it may also be determined that the first character and the second character are near-shape words. However, the present embodiment is not particularly limited thereto.
Optionally, if the first comparison relationship includes a difference relationship and a ratio relationship, the determination manners in the two examples may be combined, for example, if n-1 first ratios are all greater than a preset ratio and n-1 first differences are all less than a preset difference, it may be determined that the first character and the second character are near-shape characters. However, in the specific implementation, this is not a limitation.
In another example, the first position characteristic value includes a third included angle and the second position characteristic value includes a fourth included angle. The comparison relationship between the first position characteristic value and the second position characteristic value may be a second comparison relationship, and the second comparison relationship may also be a difference relationship or a ratio relationship. Taking the difference relation as an example, n-1 second differences can be obtained according to the second comparison relation; the n-1 second differences are respectively calculated differences of a third included angle and a fourth included angle corresponding to each stroke in sequence from the second stroke in the common stroke sequence. That is, each stroke in the sequence of common strokes may correspond to a second difference, denoted as (|ax2-bx2|, |ax3-bx3|, |ax4-bx4| … … | axn-bxn |), starting with the second stroke. It will be appreciated that if the second alignment is a ratio relationship, then n-1 second ratios, expressed as (ax 2/bx2, ax3/bx3, ax4/bx4 … … axn/bxn), can be obtained from the second alignment.
In a specific implementation, whether the first character and the second character are near-shape characters may be determined according to the second comparison relationship, and the specific manner may be as follows:
in one example, the second comparison relationship is a difference relationship, and if n-1 second differences are smaller than the preset difference, it may be determined that the first character and the second character are near-shape characters. In a specific implementation, if most of the n-1 second differences are smaller than the preset differences, that is, only the individual second differences are larger than the preset differences, it may be determined that the first character and the second character are near-shape characters. However, the present embodiment is not particularly limited thereto.
In another example, the second comparison relationship is a ratio relationship, and if n-1 second ratios are all greater than the preset ratio, it may be determined that the first character and the second character are near-shape characters. In a specific implementation, if most of the n-1 second ratios are greater than the preset ratio, it may also be determined that the first character and the second character are near-shape words. However, the present embodiment is not particularly limited thereto.
Optionally, if the second comparison relationship includes a difference relationship and a ratio relationship. The above two examples may be combined, for example, if n-1 second ratios are all greater than the preset ratio and n-1 second differences are all smaller than the preset difference, the first character and the second character may be determined to be a near-shape character. However, in the specific implementation, this is not a limitation.
In one example, the first position feature value may include both a first angle and a third angle, which may be denoted as a tuple (ad, ax), and the second position feature value may include both a second angle and a fourth angle, which may be denoted as a tuple (bd, bx). The comparison of the first position characteristic value and the second position characteristic value may include both the first comparison and the second comparison. It will be appreciated that each stroke, starting from the second stroke in the common stroke sequence, may correspond to one of the two-tuple differences, i.e., a total of n-1 two-tuple differences, expressed as: (|ad 2-bd2|, |ax 2-bx2|) (|ad 3-bd3|, |ax 3-bx3|) (|ad 4-bd4|, |ax 4-bx4|) … … (|adn-bdn |, | axn-bxn |). Taking the ratio relation as an example, the ratio of the corresponding n-1 binary groups is expressed as follows: (ad 2/bd2, ax2/bx 2) (ad 3/bd3, ax3/bx 3) (ad 4/bd4, ax4/bx 4) … … (adn/bdn, axn/bxn).
In a specific implementation, whether the first character and the second character are near-shape characters or not may be determined according to the first comparison relationship and the second comparison relationship, and the specific manner may be as follows:
in one example, if the first comparison relation and the second comparison relation are both difference relations, n-1 first differences may be obtained according to the first comparison relation, n-1 second differences may be obtained according to the second comparison relation, and whether the first character and the second character are near-shape characters may be determined according to the n-1 first differences and the n-1 second differences. n-1 first differences and n-1 second differences, i.e., (|ad 2-bd2|, |ax 2-bx2|) (|ad 3-bd3|, |ax 3-bx3|) (|ad 4-bd4|, |ax4-bx 4|) … … (|adn-bdn |, | axn-bxn |). If the n-1 first difference values and the n-1 second difference values are smaller than the preset difference values, the first character and the second character can be determined to be near-shape characters.
In another example, if the first comparison relationship and the second comparison relationship are both ratio relationships, n-1 first ratios may be obtained according to the first comparison relationship, n-1 second ratios may be obtained according to the second comparison relationship, and whether the first character and the second character are near-shape characters may be determined according to the n-1 first ratios and the n-1 second ratios. n-1 first ratios and n-1 second ratios, namely (ad 2/bd2, ax2/bx 2) (ad 3/bd3, ax3/bx 3) (ad 4/bd4, ax4/bx 4) … … (adn/bdn, axn/bxn). If the n-1 first ratios and the n-1 second ratios are both larger than the preset ratios, the first character and the second character can be determined to be near-shape words.
Optionally, the first comparison relationship and the second comparison relationship both include a difference relationship and a ratio relationship, that is, the determination manners in the two examples may be combined. If the n-1 first ratio and the n-1 second ratio are both larger than the preset ratio and the n-1 first difference and the n-1 second difference are both smaller than the preset difference, the first character and the second character can be determined to be the near-shape character. However, in the specific implementation, this is not a limitation.
Step 207: it is determined that the first character and the second character are not near words.
The above examples in this embodiment are all examples for easy understanding, and do not limit the technical configuration of the present invention.
Compared with the prior art, the method combines the first position characteristic value and the second position characteristic value corresponding to each stroke in the common stroke sequence to determine whether the first character and the second character are near-shape characters or not. The specific calculation mode of the first position characteristic value and the second position characteristic value is provided, and the endpoint coordinates of all strokes in the common stroke sequence in the first character and the second character are respectively inquired, so that the first position characteristic value and the second position characteristic value are conveniently and accurately calculated, the accuracy and the convenience for obtaining the structure difference degree are improved, and the accuracy of the shape near word judgment result is further improved.
The above steps of the methods are divided, for clarity of description, and may be combined into one step or split into multiple steps when implemented, so long as they include the same logic relationship, and they are all within the protection scope of this patent; it is within the scope of this patent to add insignificant modifications to the algorithm or flow or introduce insignificant designs, but not to alter the core design of its algorithm and flow.
A third embodiment of the present invention relates to a shape-near-word determining apparatus, as shown in fig. 3, including: a first obtaining module 301, configured to obtain a stroke similarity of a first character and a second character; an extracting module 302, configured to extract a common stroke sequence of the first character and the second character if the stroke similarity is greater than a preset similarity; the common stroke sequence comprises a plurality of strokes which are the same in the first character and the second character, and the plurality of strokes are continuous in the first character and the second character; a second obtaining module 303, configured to obtain a first relative position of the common stroke sequence in the first character and a second relative position of the common stroke sequence in the second character; a determining module 304, configured to determine whether the first character and the second character are near words according to the first relative position and the second relative position.
It is to be noted that this embodiment is an example of the apparatus corresponding to the first or second embodiment, and can be implemented in cooperation with the first or second embodiment. The related technical details mentioned in the first or second embodiment are still valid in this embodiment, and the technical effects achieved in the first or second embodiment may also be achieved in this embodiment, so that the repetition is reduced and the description is omitted here. Accordingly, the related art details mentioned in the present embodiment can also be applied to the first or second embodiment.
It should be noted that each module in this embodiment is a logic module, and in practical application, one logic unit may be one physical unit, or may be a part of one physical unit, or may be implemented by a combination of multiple physical units. In addition, in order to highlight the innovative part of the present invention, units that are not so close to solving the technical problem presented by the present invention are not introduced in the present embodiment, but this does not indicate that other units are not present in the present embodiment.
A fourth embodiment of the invention relates to an electronic device, as shown in fig. 4, comprising at least one processor 401; and a memory 402 communicatively coupled to the at least one processor 401; the memory 402 stores instructions executable by the at least one processor 401, and the instructions are executed by the at least one processor 401, so that the at least one processor 401 can execute the shape word determining method in the first embodiment or the second embodiment.
Where the memory 402 and the processor 401 are connected by a bus, the bus may comprise any number of interconnected buses and bridges, the buses connecting the various circuits of the one or more processors 401 and the memory 402 together. The bus may also connect various other circuits such as peripherals, voltage regulators, and power management circuits, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or may be a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor 401 is transmitted over a wireless medium via an antenna, which further receives and transmits the data to the processor 401.
The processor 401 is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And memory 402 may be used to store data used by processor 401 in performing operations.
A fifth embodiment of the present invention relates to a computer-readable storage medium storing a computer program. The computer program implements the above-described method embodiments when executed by a processor.
That is, it will be understood by those skilled in the art that all or part of the steps in implementing the methods of the embodiments described above may be implemented by a program stored in a storage medium, where the program includes several instructions for causing a device (which may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps in the methods of the embodiments described herein. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples of carrying out the invention and that various changes in form and details may be made therein without departing from the spirit and scope of the invention.

Claims (9)

1. A method for determining a shape-near word, comprising:
acquiring stroke similarity of the first character and the second character;
If the stroke similarity is greater than the preset similarity, extracting a common stroke sequence of the first character and the second character; the common stroke sequence comprises a plurality of strokes which are the same in the first character and the second character, and the plurality of strokes are continuous in the first character and the second character;
respectively acquiring a first relative position of the common stroke sequence in the first character and a second relative position of the common stroke sequence in the second character;
wherein the first relative position is: the relative position of the common stroke sequence in the first character in a field grid, or the position of each stroke in the common stroke sequence in the first character relative to the previous stroke, wherein the first relative position is an endpoint coordinate position, or the first relative position is an angle position; the second relative position is: the relative position of the common stroke sequence in the second character in a field lattice, or the position of each stroke in the common stroke sequence in the second character relative to the previous stroke, wherein the second relative position is an endpoint coordinate position, or the second relative position is an angle position;
Determining whether the first character and the second character are near words according to the first relative position and the second relative position comprises the following steps: acquiring a comparison relation according to the first relative position and the second relative position; and determining whether the first character and the second character are near words or not according to the comparison relation.
2. The method of claim 1, wherein the obtaining the first relative position of the common stroke sequence in the first character and the second relative position of the common stroke sequence in the second character respectively comprises:
respectively calculating a first position characteristic value corresponding to each stroke in the common stroke sequence in the first character;
respectively calculating a second position characteristic value corresponding to each stroke in the common stroke sequence in the second character;
the determining whether the first character and the second character are near words according to the first relative position and the second relative position comprises the following steps:
and determining whether the first character and the second character are near words or not according to the first position characteristic value and the second position characteristic value.
3. The method of determining a shape proximity word according to claim 2, wherein the calculating, respectively, a first position feature value corresponding to each stroke in the common stroke sequence in the first character includes:
Acquiring endpoint coordinates of each stroke in the common stroke sequence in the first character;
calculating the first position characteristic value according to the endpoint coordinates in the first character;
the calculating the second position characteristic value corresponding to each stroke in the common stroke sequence in the second character respectively comprises the following steps:
acquiring endpoint coordinates of each stroke in the common stroke sequence in the second character;
and calculating the second position characteristic value according to the endpoint coordinates in the second character.
4. The shape proximity word determining method according to claim 2, wherein the comparison relation includes: a first comparison relationship and/or a second comparison relationship;
the first comparison relation is a comparison relation of a first included angle and a second included angle, the first position characteristic value comprises the first included angle and the second position characteristic value comprises the second included angle, and the acquisition modes of the first included angle and the second included angle are as follows:
calculating a first included angle between a connecting line of a first endpoint of each stroke and a last endpoint of a previous stroke and a horizontal line in sequence from a second stroke in the public stroke sequence in the first character according to endpoint coordinates in the first character;
Calculating a second included angle between a connecting line of a first endpoint of each stroke and a last endpoint of a previous stroke and a horizontal line in sequence from a second stroke in the public stroke sequence in the second character according to the endpoint coordinates in the second character;
the second comparison relation is a comparison relation of a third included angle and a fourth included angle; the first position characteristic value comprises the third included angle and the second position characteristic value comprises the fourth included angle, and the obtaining modes of the third included angle and the fourth included angle are as follows:
calculating a third included angle between a first line of each stroke and a last line of a previous stroke in sequence from a second stroke in the common stroke sequence in the first character according to the endpoint coordinates in the first character;
and calculating a fourth included angle between the first line of each stroke and the last line of the previous stroke in sequence from the second stroke in the public stroke sequence in the second character according to the endpoint coordinates in the second character.
5. The method of claim 4, wherein the comparison includes the first comparison and the second comparison, and wherein determining whether the first character and the second character are near words based on the comparison includes:
Acquiring n-1 first difference values according to the first comparison relation; wherein, n-1 first difference values are respectively: starting from the second stroke in the common stroke sequence, calculating the difference value between the first included angle and the second included angle corresponding to each stroke in sequence, wherein n is the number of strokes included in the common stroke sequence;
acquiring n-1 second difference values according to the second comparison relation; the n-1 second differences are differences of a third included angle and a fourth included angle, which are calculated in sequence from the second stroke in the common stroke sequence, respectively;
and determining whether the first character and the second character are near words or not according to the n-1 first difference values and the n-1 second difference values.
6. The method of claim 5, wherein determining whether the first character and the second character are near words based on the n-1 first differences and the n-1 second differences comprises:
and if the n-1 first difference values and the n-1 second difference values are smaller than the preset difference values, determining that the first character and the second character are near-shape characters.
7. A shape-proximity-word determining apparatus, comprising:
The first acquisition module is used for acquiring stroke similarity of the first character and the second character;
the extraction module is used for extracting a common stroke sequence of the first character and the second character if the stroke similarity is larger than a preset similarity; the common stroke sequence comprises a plurality of strokes which are the same in the first character and the second character, and the plurality of strokes are continuous in the first character and the second character;
the second acquisition module is used for respectively acquiring a first relative position of the common stroke sequence in the first character and a second relative position of the common stroke sequence in the second character;
wherein the first relative position is: the relative position of the common stroke sequence in the first character in a field grid, or the position of each stroke in the common stroke sequence in the first character relative to the previous stroke, wherein the first relative position is an endpoint coordinate position, or the first relative position is an angle position; the second relative position is: the relative position of the common stroke sequence in the second character in a field lattice, or the position of each stroke in the common stroke sequence in the second character relative to the previous stroke, wherein the second relative position is an endpoint coordinate position, or the second relative position is an angle position;
The determining module is configured to determine, according to the first relative position and the second relative position, whether the first character and the second character are near-shape characters, and includes: acquiring a comparison relation according to the first relative position and the second relative position; and determining whether the first character and the second character are near words or not according to the comparison relation.
8. An electronic device, comprising:
at least one processor; the method comprises the steps of,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the shape word determining method of any one of claims 1 to 6.
9. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the shape-near-word determining method of any one of claims 1 to 6.
CN201911412330.0A 2019-12-31 2019-12-31 Shape-near-word determining method, electronic device, and computer-readable storage medium Active CN111222590B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911412330.0A CN111222590B (en) 2019-12-31 2019-12-31 Shape-near-word determining method, electronic device, and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911412330.0A CN111222590B (en) 2019-12-31 2019-12-31 Shape-near-word determining method, electronic device, and computer-readable storage medium

Publications (2)

Publication Number Publication Date
CN111222590A CN111222590A (en) 2020-06-02
CN111222590B true CN111222590B (en) 2024-04-12

Family

ID=70808214

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911412330.0A Active CN111222590B (en) 2019-12-31 2019-12-31 Shape-near-word determining method, electronic device, and computer-readable storage medium

Country Status (1)

Country Link
CN (1) CN111222590B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112766236B (en) * 2021-03-10 2023-04-07 拉扎斯网络科技(上海)有限公司 Text generation method and device, computer equipment and computer readable storage medium
CN114283420A (en) * 2021-12-21 2022-04-05 中国联合网络通信集团有限公司 Shape and proximity character distinguishing method, device, equipment and medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS59128680A (en) * 1983-01-14 1984-07-24 Hitachi Ltd On-line recognizing system of handwritten character
CN1488120A (en) * 2001-01-15 2004-04-07 �¿���˹�����ɷ����޹�˾ Method, device and computer program for recognition of a handwritten character
JP2004185264A (en) * 2002-12-03 2004-07-02 Canon Inc Character recognition method
CN103810506A (en) * 2014-01-03 2014-05-21 南京师范大学 Method for identifying strokes of handwritten Chinese characters
CN103927535A (en) * 2014-05-08 2014-07-16 北京汉仪科印信息技术有限公司 Recognition method and device for Chinese character writing
CN104461337A (en) * 2013-09-24 2015-03-25 中央研究院 Method for improving handwriting input efficiency
WO2015139497A1 (en) * 2014-03-19 2015-09-24 北京奇虎科技有限公司 Method and apparatus for determining similar characters in search engine
CN105608462A (en) * 2015-12-10 2016-05-25 小米科技有限责任公司 Character similarity judgment method and device
CN106598920A (en) * 2016-11-28 2017-04-26 昆明理工大学 Similar Chinese character classification method combining stroke codes with Chinese character dot matrixes
CN110097002A (en) * 2019-04-30 2019-08-06 北京达佳互联信息技术有限公司 Nearly word form determines method, apparatus, computer equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070040707A1 (en) * 2005-08-16 2007-02-22 Lai Jenny H Separation of Components and Characters in Chinese Text Input

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS59128680A (en) * 1983-01-14 1984-07-24 Hitachi Ltd On-line recognizing system of handwritten character
CN1488120A (en) * 2001-01-15 2004-04-07 �¿���˹�����ɷ����޹�˾ Method, device and computer program for recognition of a handwritten character
JP2004185264A (en) * 2002-12-03 2004-07-02 Canon Inc Character recognition method
CN104461337A (en) * 2013-09-24 2015-03-25 中央研究院 Method for improving handwriting input efficiency
CN103810506A (en) * 2014-01-03 2014-05-21 南京师范大学 Method for identifying strokes of handwritten Chinese characters
WO2015139497A1 (en) * 2014-03-19 2015-09-24 北京奇虎科技有限公司 Method and apparatus for determining similar characters in search engine
CN103927535A (en) * 2014-05-08 2014-07-16 北京汉仪科印信息技术有限公司 Recognition method and device for Chinese character writing
CN105608462A (en) * 2015-12-10 2016-05-25 小米科技有限责任公司 Character similarity judgment method and device
CN106598920A (en) * 2016-11-28 2017-04-26 昆明理工大学 Similar Chinese character classification method combining stroke codes with Chinese character dot matrixes
CN110097002A (en) * 2019-04-30 2019-08-06 北京达佳互联信息技术有限公司 Nearly word form determines method, apparatus, computer equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王晓 ; 吕肖庆 ; 汤帜 ; .基于笔端形状相似性的汉字字体识别.北京大学学报(自然科学版).2012,(第01期),57-63. *

Also Published As

Publication number Publication date
CN111222590A (en) 2020-06-02

Similar Documents

Publication Publication Date Title
WO2018040899A1 (en) Error correction method and device for search term
CN103514201B (en) Method and device for querying data in non-relational database
CN111222590B (en) Shape-near-word determining method, electronic device, and computer-readable storage medium
US20170331492A1 (en) Data Processing Method and Device
CN110033515B (en) Graph conversion method, graph conversion device, computer equipment and storage medium
JP5788047B2 (en) Encoder for encoding text into matrix code symbols and decoder for decoding matrix code symbols
CN103927535A (en) Recognition method and device for Chinese character writing
CN114782970A (en) Table extraction method, system and readable medium
CN115331213A (en) Character recognition method, chip, electronic device, and storage medium
CN112307138B (en) Method, system and medium for storing and inquiring regional information
CN115563058A (en) Similar case retrieval method based on element extraction
CN113806601B (en) Peripheral interest point retrieval method and storage medium
CN114445808A (en) Swin transform-based handwritten character recognition method and system
CN111507430B (en) Feature coding method, device, equipment and medium based on matrix multiplication
CN111310450B (en) Character string word segmentation method, device, equipment and storage medium
US8976048B2 (en) Efficient processing of Huffman encoded data
CN109492068B (en) Method and device for positioning object in predetermined area and electronic equipment
CN111090737A (en) Word stock updating method and device, electronic equipment and readable storage medium
CN111104484B (en) Text similarity detection method and device and electronic equipment
CN102566770A (en) Five-stroke input method based on fuzzy stroke orders
KR101322193B1 (en) Circular pattern code, system for analyzing circular pattern code using video input unit and computer-readable recording medium for the same
CN108132924B (en) EXCEL-based chip port mapping management method
CN109710607A (en) A kind of hash query method solved based on weight towards higher-dimension big data
CN116362202B (en) Font generation method, storage medium and electronic device
CN117152458B (en) Method and system for rapidly extracting connected domain based on travel coding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant