CN110414496B - Similar word recognition method and device, computer equipment and storage medium - Google Patents

Similar word recognition method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN110414496B
CN110414496B CN201810386017.3A CN201810386017A CN110414496B CN 110414496 B CN110414496 B CN 110414496B CN 201810386017 A CN201810386017 A CN 201810386017A CN 110414496 B CN110414496 B CN 110414496B
Authority
CN
China
Prior art keywords
character
characters
value
picture
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810386017.3A
Other languages
Chinese (zh)
Other versions
CN110414496A (en
Inventor
余淼
刘晓波
郑杰鹏
吴家林
邵英杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baidu Online Network Technology Beijing Co Ltd
Original Assignee
Baidu Online Network Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Baidu Online Network Technology Beijing Co Ltd filed Critical Baidu Online Network Technology Beijing Co Ltd
Priority to CN201810386017.3A priority Critical patent/CN110414496B/en
Publication of CN110414496A publication Critical patent/CN110414496A/en
Application granted granted Critical
Publication of CN110414496B publication Critical patent/CN110414496B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • G06V10/23Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition based on positionally close patterns or neighbourhood relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Character Discrimination (AREA)

Abstract

The invention discloses a similar word recognition method, a similar word recognition device, computer equipment and a storage medium, wherein the method comprises the following steps: respectively acquiring picture characteristics, character element characteristics and character structure characteristics of each character to be processed in a picture form; and determining similar characters in the characters to be processed according to the acquired characteristics. By applying the scheme of the invention, the accuracy of the identification result can be improved.

Description

Similar word recognition method and device, computer equipment and storage medium
[ technical field ] A method for producing a semiconductor device
The present invention relates to computer application technologies, and in particular, to a method and an apparatus for recognizing similar words, a computer device, and a storage medium.
[ background of the invention ]
Many Chinese characters are similar in shape and easy to confuse, and correctly identifying/distinguishing the confusable similar characters has important significance in many aspects.
Such as: providing similar character retrieval function for Chinese learners, contrasting learning (for example: cannleng theory), and deepening memory; providing a similar word candidate list for an existing Optical Character Recognition (OCR) technology for OCR error correction; and providing a similar character list for a Chinese handwriting recognition model team so as to train the recognition model in a targeted manner, improve the recognition accuracy and the like.
In the prior art, the following similar word recognition methods are generally adopted: the method is based on a manual marking mode, namely similar character marking is carried out manually, but a large amount of labor cost is consumed in the mode; 2) similarity algorithm based on font picture layer: the method can greatly reduce the labor cost, but can not give ideal results in many cases, for example, for a deformed character, namely a Chinese character picture, a character representation is converted into a characteristic representation, and then the similarity between the Chinese characters is defined by calculating a vector distance, for example, the intersection of pixel points on the picture layer is limited although two Chinese characters are similar, so that an expected similarity result can not be effectively given, and the accuracy of a recognition result is reduced.
[ summary of the invention ]
In view of the above, the invention provides a similar word recognition method, a similar word recognition device, a computer device and a storage medium.
The specific technical scheme is as follows:
a method of similar word recognition, comprising:
respectively acquiring picture characteristics, character element characteristics and character structure characteristics of each character to be processed in a picture form;
and determining similar characters in the characters to be processed according to the acquired characteristics.
According to a preferred embodiment of the present invention, the acquiring the picture feature of the text includes:
converting the picture of the characters into a matrix format consisting of 0 and 1 according to the gray value of each pixel point in the picture of the characters;
and performing feature extraction through a convolutional neural network based on the conversion result to obtain the picture features.
According to a preferred embodiment of the present invention, the obtaining the character element feature of the character includes:
carrying out character splitting processing on the characters;
and generating the character element characteristics according to the character splitting result.
According to a preferred embodiment of the present invention, the generating the character element feature according to the word-splitting result includes:
assigning a value to a first sequence with a preset length, wherein each bit in the first sequence corresponds to a preset basic unit, the number of the basic units is more than one, and the value of the bit corresponding to the basic unit contained in the word splitting result is set to be 1, otherwise, the value is set to be 0;
and taking the first sequence of assignment completion as the character element characteristic.
According to a preferred embodiment of the present invention, the acquiring the text structure characteristics of the text includes:
determining a font structure to which the characters belong;
and generating the character structure characteristics according to the determination result.
According to a preferred embodiment of the present invention, the generating the character structure feature according to the determination result includes:
assigning a value to a second sequence with a preset length, wherein each bit in the second sequence corresponds to a preset font structure, the number of the font structures is more than one, and the value of the bit corresponding to the font structure to which the character belongs is set to be 1, otherwise, the value is set to be 0;
and taking the second sequence with the assigned value as the character structure characteristic.
According to a preferred embodiment of the present invention, the font structure includes: the structure comprises an upper three-enclosure structure, an upper-lower structure, an upper middle-lower structure, an upper enclosure structure, a lower three-enclosure structure, a full enclosure structure, a single structure, an upper right enclosure structure, a delta-shaped structure, a left three-enclosure structure, an upper left enclosure structure, a lower left enclosure structure, a left middle-right structure, a left enclosure structure, a left side structure, a right side structure and an embedded structure.
According to a preferred embodiment of the present invention, the determining, according to the obtained characteristics, a similar word in the characters to be processed includes:
for each character, respectively carrying out coding processing on the acquired character characteristics through a depth coding layer;
and determining similar characters in the characters to be processed according to the coding processing results of all the characters in the characters to be processed.
According to a preferred embodiment of the present invention, the determining, according to the encoding processing result of all the characters to be processed, similar characters in the characters to be processed includes:
and carrying out hierarchical clustering according to the coding processing results of all the characters in the characters to be processed to form a tree structure, and obtaining a set of similar characters according to a preset cutting threshold value.
A similar word recognition apparatus comprising: an acquisition unit and an identification unit;
the acquiring unit is used for respectively acquiring the picture characteristics, the character element characteristics and the character structure characteristics of each character to be processed in the form of a picture;
and the identification unit is used for determining similar characters in the characters to be processed according to the acquired characteristics.
According to a preferred embodiment of the present invention, the obtaining unit converts the image of the text into a matrix format formed by 0 and 1 according to the gray value of each pixel point in the image of the text, and performs feature extraction through a convolutional neural network based on the conversion result to obtain the image feature.
According to a preferred embodiment of the present invention, the obtaining unit performs a word splitting process on the word, and generates the character element feature according to a word splitting result.
According to a preferred embodiment of the present invention, the obtaining unit assigns a value to a first sequence with a predetermined length, each bit in the first sequence corresponds to a preset basic unit, the number of the basic units is greater than one, the value of the bit corresponding to the basic unit included in the word splitting result is set to 1, otherwise, the value is set to 0, and the first sequence with the assigned value is used as the character element feature.
According to a preferred embodiment of the present invention, the obtaining unit determines a font structure to which the text belongs, and generates the text structure feature according to a determination result.
According to a preferred embodiment of the present invention, the obtaining unit assigns a value to a second sequence with a predetermined length, each bit in the second sequence corresponds to a preset font structure, the number of the font structures is greater than one, the value of the bit corresponding to the font structure to which the text belongs is set to 1, otherwise, the value is set to 0, and the second sequence with the assigned value is used as the text structure feature.
According to a preferred embodiment of the present invention, the font structure includes: the structure comprises an upper three-enclosure structure, an upper-lower structure, an upper middle-lower structure, an upper enclosure structure, a lower three-enclosure structure, a full enclosure structure, a single structure, an upper right enclosure structure, a delta-shaped structure, a left three-enclosure structure, an upper left enclosure structure, a lower left enclosure structure, a left middle-right structure, a left enclosure structure, a left side structure, a right side structure and an embedded structure.
According to a preferred embodiment of the present invention, the recognition unit performs coding processing on the features of the obtained characters through a depth coding layer for each character, and determines similar characters in the characters to be processed according to coding processing results of all characters in the characters to be processed.
According to a preferred embodiment of the present invention, the recognition unit performs hierarchical clustering according to the encoding processing results of all the characters in the characters to be processed to form a tree structure, and obtains a set of similar characters according to a preset clipping threshold.
A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method as described above when executing the program.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method as set forth above.
Based on the introduction, the scheme of the invention can be adopted to respectively acquire the picture characteristic, the character element characteristic and the character structure characteristic of each character to be processed in the picture form, and further determine the similar characters in the characters to be processed according to the acquired characteristics.
[ description of the drawings ]
FIG. 1 is a flowchart of an embodiment of a similar word recognition method according to the present invention.
Fig. 2 is a schematic diagram of a similar word recognition process according to the present invention.
Fig. 3 is a schematic structural diagram of a similar word recognition apparatus according to an embodiment of the present invention.
FIG. 4 illustrates a block diagram of an exemplary computer system/server 12 suitable for use in implementing embodiments of the present invention.
[ detailed description ] embodiments
Aiming at the problems in the prior art, the invention provides a similar character recognition mode, aiming at each character to be processed in a picture form, the picture characteristic, the character element characteristic and the character structure characteristic of the character can be respectively obtained, and then the similar character in the character to be processed can be determined according to the obtained characteristics. The characters may include various characters such as chinese characters, korean, japanese, and the like.
In order to make the technical solution of the present invention clearer and more obvious, the solution of the present invention is further described below by referring to the drawings and examples.
It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
FIG. 1 is a flowchart of an embodiment of a similar word recognition method according to the present invention. In the present embodiment, a chinese character is exemplified. As shown in fig. 1. Including the following specific implementations.
In 101, for each Chinese character to be processed in a picture form, picture features, character element features and character structure features of the Chinese character are respectively obtained.
In 102, similar characters in the Chinese characters to be processed are determined according to the obtained characteristics.
The Chinese characters to be processed in this embodiment are in the form of pictures, and the sizes of the pictures are not limited, but are usually large, so that the information content is higher, and in addition, the sizes of the pictures of different Chinese characters are usually the same. Each picture contains a Chinese character.
For each Chinese character to be processed, the picture characteristics, character element characteristics and character structure characteristics of the Chinese character can be respectively obtained, and the obtaining mode of the characteristics is described below.
One) picture characteristics
The method comprises the steps of firstly converting a picture of the Chinese character into a matrix format (namely 0/1 matrix) consisting of 0 and 1 according to the gray value of each pixel point in the picture of the Chinese character, and then extracting features through a Convolutional Neural Network (CNN) based on the conversion result to obtain picture features.
The method comprises the steps of converting a picture of the Chinese character into an 0/1 matrix according to the gray value of each pixel point in the picture of the Chinese character, namely performing binarization processing on the picture of the Chinese character.
At present, commonly used binarization processing methods mainly include a global threshold method, a local threshold method, a dynamic threshold method and the like, and which method is specifically adopted can be determined according to actual needs.
Global thresholding: selecting a single threshold value for binaryzation of the whole picture, comparing the gray value of each pixel point in the picture with the threshold value respectively, if the gray value is larger than the threshold value, setting the value of the pixel point to be 1, and otherwise, setting the value to be 0.
Local thresholding: dividing the picture into a plurality of sub-regions, taking the gray average value of pixel points in each sub-region as the threshold value of the sub-region, and performing binarization on each sub-region respectively, or setting a threshold value according to the gray change condition of pixel points in the neighborhood of each pixel point in the picture, and then performing binarization on the picture point by point.
Dynamic threshold method: the selection of the threshold in the method depends not only on the gray values of the pixel points and the surrounding pixel points, but also is related to the coordinate positions of the pixel points, for example, the gray value distribution characteristics of the region can be locally counted, and different local thresholds are determined according to the statistical result.
After the Chinese character picture is converted into 0/1 matrix, feature extraction can be performed through a convolutional neural network based on the conversion result, so as to obtain the required picture feature.
The convolutional neural network is a feedforward neural network, has excellent performance on picture processing, comprises a convolutional layer and a pooling layer, can effectively identify two-dimensional graphs with displacement, zooming and other forms of distortion invariance through convolution and pooling, and can effectively process deformed characters like 'ri' and 'ri'.
Two) character element characteristics
The Chinese character can be firstly separated, and then character element characteristics can be generated according to the character separation result.
Specifically, a first sequence with a predetermined length may be assigned, each bit in the first sequence corresponds to a preset basic unit, the number of the basic units is greater than one, the value of the bit corresponding to the basic unit included in the word segmentation result is set to 1, otherwise, the value is set to 0, and the assigned first sequence is used as the required character element feature.
A plurality of basic units can be predefined, and for each Chinese character to be processed, the Chinese character can be separated according to the defined basic units. For example, the Chinese character 'xiang' can be divided into 'he' and 'ri', which are basic units. The number and the specific content of the basic units can be determined according to actual needs.
After the word splitting is completed, a value can be assigned to a first sequence with a preset length, the length of the first sequence is the same as the number of basic units, each bit in the first sequence corresponds to one basic unit, if a word splitting result contains a certain basic unit, the value of the bit corresponding to the basic unit can be set to be 1, and otherwise, the value can be set to be 0.
For example, the following steps are carried out:
assuming a total of 70 basic units, the length of the first sequence is 70;
assuming that the 70 basic units are numbered 1-70, respectively, the basic unit 1 corresponds to the first bit in the first sequence, the basic unit 2 corresponds to the second bit in the first sequence, and so on;
assuming that the word splitting result of the Chinese character 'xiang' is 'standing grain' and 'day', which are respectively the basic unit 10 and the basic unit 20, the values of the 10 th bit and the 20 th bit in the first sequence are set to be 1, and the values of other bits are set to be 0;
the first sequence after setting is the character element characteristic of Chinese character 'xiang'.
The order in which the basic units are numbered is not limited, but the Chinese characters to be processed need to be numbered in a uniform order.
III) character structural features
The font structure of the Chinese character can be determined firstly, and then the character structure characteristics can be generated according to the determination result.
Specifically, a second sequence with a predetermined length may be assigned, each bit in the second sequence corresponds to a preset font structure, the number of the font structures is greater than one, the value of the bit corresponding to the font structure to which the chinese character belongs is set to 1, otherwise, the value is set to 0, and the assigned second sequence is used as the required character structure characteristic.
A plurality of font structures may be predefined, and preferably, 16 font structures may be defined, which are: the structure comprises an upper three-enclosure structure, an upper-lower structure, an upper middle-lower structure, an upper enclosure structure, a lower three-enclosure structure, a full enclosure structure, a single structure, an upper right enclosure structure, a delta-shaped structure, a left three-enclosure structure, an upper left enclosure structure, a lower left enclosure structure, a left middle-right structure, a left enclosure structure, a left side structure, a right side structure and an embedded structure.
For example, the Chinese characters of the above three-enclosure structure may include: inquiring and harmonizing;
the Chinese characters of the upper and lower structure may include: abuse, class;
the Chinese characters with the upper, middle and lower structures can comprise: mane and Zhong;
the Chinese characters with the following three surrounding structures can comprise: inlaying and drawing;
the fully enclosed structured Chinese characters may include: fixing and shaping;
the single-structure Chinese characters may include: fifthly, performing secondary treatment;
the top right bounding box of Chinese characters may include: ten days and department;
the Chinese characters with the Chinese character 'pin' structure can comprise: article, epitaxy;
the Chinese characters of the left three surrounding structures can include: rectifying and boxing;
the top left bounding box of Chinese characters may include: early and pressure;
the Chinese characters of the lower left surrounding structure may include: this, the delay;
the Chinese characters with left, middle and right structures can comprise: weighing scales and spots;
the left and right structural chinese characters may include: materials and materials;
the mosaic-structured Chinese characters may include: city, Wu;
the Chinese characters of the left three surrounding structure, the upper left surrounding structure and the lower left surrounding structure all can belong to the left surrounding structure;
the Chinese characters of the upper right surrounding structure, the upper three surrounding structure and the upper left surrounding structure all belong to the upper surrounding structure.
The font structure to which each Chinese character belongs can be predefined.
For each Chinese character to be processed, after the font structure to which the Chinese character belongs is determined, a second sequence with a preset length can be assigned, the length of the second sequence is the same as the number of the font structures, each bit in the second sequence corresponds to one font structure, if the Chinese character to be processed belongs to one font structure, the value of the bit corresponding to the font structure can be set to be 1, and if not, the value can be set to be 0.
Since a chinese character may belong to multiple font structures, such as the upper left surrounding structure and the upper surrounding structure, the values of the bits corresponding to the two font structures in the second sequence need to be set to 1.
For example, the following steps are carried out:
assuming a total of 16 font structures, the length of the second sequence is 16;
if the 16 font structures are numbered 1-16 respectively, the font structure 1 corresponds to the first bit in the second sequence, the font structure 2 corresponds to the second bit in the second sequence, and so on;
assuming that a certain Chinese character belongs to the font structure 2 and the font structure 5 at the same time, setting the values of the 2 nd bit and the 5 th bit in the second sequence as 1, and setting the values of other bits as 0;
assuming that a certain Chinese character belongs to the font structure 6, setting the value of the 6 th bit in the second sequence as 1, and setting the values of other bits as 0;
the second sequence after the setting is the required character structure characteristic.
The order in which the character structures are numbered is not limited, but the Chinese characters to be processed need to be numbered in a uniform order.
After the picture characteristics, the character element characteristics and the character structure characteristics of each Chinese character to be processed are respectively obtained, the obtained characteristics can be further coded through a depth coding layer.
The depth coding layer can effectively integrate and extract high-dimensional features among multi-dimensional sparse data, realize the dimension reduction and coding of the data, and express the multi-dimensional information of each Chinese character in a coding form.
And then, determining similar characters in the Chinese characters to be processed according to the coding processing results of all the Chinese characters to be processed. Specifically, Hierarchical Clustering (Hierarchical Clustering) can be performed according to the coding processing results of all the Chinese characters to be processed to form a tree structure, and a set of similar characters can be obtained according to a preset clipping threshold.
Hierarchical clustering is one of clustering algorithms, a hierarchical nested clustering tree is created by calculating the similarity between data points of different classes, namely, a tree structure can be formed by hierarchical clustering, and then a cutting threshold value can be defined according to different product requirements, so that a plurality of similar character sets are obtained, wherein the Chinese characters in each set are similar characters.
Based on the above description, fig. 2 is a schematic diagram of the similar word recognition process according to the present invention. As shown in fig. 2, the depth coding layer may be a self-Encoding (Auto-Encoding) depth coding layer.
For simplicity of explanation, the foregoing method embodiments are presented as a series of interrelated acts, but it should be appreciated by those skilled in the art that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the present invention. Further, those skilled in the art will appreciate that the embodiments described in this specification are presently preferred and that no acts or modules are required by the invention.
In a word, by adopting the scheme of the invention, manual marking is not needed, so that the labor cost is saved, and similar characters are identified by combining the picture characteristics, the character element characteristics and the character structure characteristics, so that the accuracy of the identification result is improved, and the like.
In addition, the method embodiments are described by taking chinese characters as examples, the solution of the present invention is also applicable to other characters except for chinese characters, such as korean, japanese, and the like, and corresponding basic units and font structures can be respectively defined for other characters such as korean, japanese, and the like, and specific implementation flows are similar to those in the method embodiments and are not described again.
Furthermore, the scheme of the invention can be suitable for various characters of different types, has wide applicability, can be simultaneously suitable for combined characters, non-combined characters and the like, for example, the 'incense' is the combined character, the 'cereal' is the non-combined character, the word detaching result of the non-combined character can be the character, and the scheme of the invention has better application effect on the combined characters, particularly complex combined characters.
The above is a description of method embodiments, and the embodiments of the present invention are further described below by way of apparatus embodiments.
Fig. 3 is a schematic structural diagram of a similar word recognition apparatus according to an embodiment of the present invention. As shown in fig. 3, includes: an acquisition unit 301 and a recognition unit 302.
An obtaining unit 301, configured to obtain, for each character to be processed in a picture format, a picture feature, a character element feature, and a character structure feature of the character respectively.
An identifying unit 302 is configured to determine similar words in the characters to be processed according to the obtained features.
The characters to be processed in this embodiment are in the form of pictures, and the sizes of the pictures are not limited, but are usually larger, so that the information content is higher. Each picture contains a character.
For each word to be processed, the obtaining unit 301 can obtain the picture feature, the word element feature and the word structure feature of the word.
Specifically, the obtaining unit 301 may convert the text image into a matrix format (i.e., 0/1 matrix) composed of 0 and 1 according to the gray value of each pixel point in the text image, and perform feature extraction through a convolutional neural network based on the conversion result to obtain the image feature.
The method comprises the steps of converting a picture of characters into an 0/1 matrix according to the gray value of each pixel point in the picture of the characters, namely, performing binarization processing on the picture of the characters.
After the text picture is converted into 0/1 matrix, feature extraction can be performed through a convolutional neural network based on the conversion result, so as to obtain the required picture feature. The convolutional neural network can effectively identify two-dimensional graphs with displacement, scaling and other forms of distortion invariance through convolution and pooling, and can effectively process the deformed characters of 'ri' and 'ri'.
The obtaining unit 301 may perform a word splitting process on the word, and generate a character element feature according to the word splitting result.
Specifically, the obtaining unit 301 may assign a value to a first sequence with a predetermined length, where each bit in the first sequence corresponds to a preset basic unit, the number of the basic units is greater than one, the value of the bit corresponding to the basic unit included in the word splitting result is set to 1, otherwise, the value is set to 0, and the first sequence with the assigned value is used as the character element feature.
A plurality of basic units can be predefined, and for each character to be processed, the character splitting processing can be performed according to the defined basic units. For example, the characters 'incense' can be divided into 'standing grain' and 'day', and the 'standing grain' and the 'day' are basic units.
After the word splitting is completed, the assignment can be performed on the first sequence with the preset length, the length of the first sequence is the same as the number of the basic units, each bit in the first sequence corresponds to one basic unit, if a word splitting result contains a certain basic unit, the value of the bit corresponding to the basic unit can be set to 1, otherwise, the value can be set to 0.
The obtaining unit 301 may determine a font structure to which the text belongs, and generate a text structure feature according to the determination result.
Specifically, the obtaining unit 301 may assign a value to a second sequence with a predetermined length, where each bit in the second sequence corresponds to a preset font structure, the number of the font structures is greater than one, and set the value of the bit corresponding to the font structure to which the text belongs to 1, otherwise, set the value to 0, and use the second sequence with the assigned value as the text structure characteristic.
A plurality of font structures can be predefined, and taking a chinese character as an example, it is preferable to define 16 font structures, which are: the structure comprises an upper three-enclosure structure, an upper-lower structure, an upper middle-lower structure, an upper enclosure structure, a lower three-enclosure structure, a full enclosure structure, a single structure, an upper right enclosure structure, a delta-shaped structure, a left three-enclosure structure, an upper left enclosure structure, a lower left enclosure structure, a left middle-right structure, a left enclosure structure, a left side structure, a right side structure and an embedded structure.
And for each character to be processed, after the font structure to which the character belongs is determined, assigning a value to a second sequence with a preset length, wherein the length of the second sequence is the same as the number of the font structures, each bit in the second sequence corresponds to one font structure, if the character to be processed belongs to a certain font structure, the value of the bit corresponding to the font structure can be set to be 1, and otherwise, the value can be set to be 0.
Since a word may belong to multiple font structures, such as the upper left surrounding structure and the upper surrounding structure, the values of the bits corresponding to the two font structures in the second sequence need to be set to 1.
For each character to be processed, after the picture feature, the character element feature, and the character structure feature of the character are respectively obtained, the identifying unit 302 may further perform encoding processing on the obtained features through a depth coding layer, and then may determine similar characters in the character to be processed according to the encoding processing results of all characters in the character to be processed.
Preferably, the recognition unit 302 performs hierarchical clustering according to the encoding processing results of all the characters to be processed to form a tree structure, and obtains a set of similar characters according to a preset clipping threshold.
For a specific work flow of the apparatus embodiment shown in fig. 3, reference is made to the related description in the foregoing method embodiment, and details are not repeated.
FIG. 4 illustrates a block diagram of an exemplary computer system/server 12 suitable for use in implementing embodiments of the present invention. The computer system/server 12 shown in FIG. 4 is only one example and should not be taken to limit the scope of use or functionality of embodiments of the present invention.
As shown in fig. 4, computer system/server 12 is in the form of a general purpose computing device. The components of computer system/server 12 may include, but are not limited to: one or more processors (processing units) 16, a memory 28, and a bus 18 that connects the various system components, including the memory 28 and the processors 16.
Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Computer system/server 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer system/server 12 and includes both volatile and nonvolatile media, removable and non-removable media.
The memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)30 and/or cache memory 32. The computer system/server 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 4, and commonly referred to as a "hard drive"). Although not shown in FIG. 4, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.
A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally carry out the functions and/or methodologies of the described embodiments of the invention.
The computer system/server 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with the computer system/server 12, and/or with any devices (e.g., network card, modem, etc.) that enable the computer system/server 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Also, the computer system/server 12 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet) via the network adapter 20. As shown in FIG. 4, network adapter 20 communicates with the other modules of computer system/server 12 via bus 18. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the computer system/server 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
The processor 16 executes various functional applications and data processing, such as implementing the method in the embodiment shown in fig. 1, by executing programs stored in the memory 28.
The invention also discloses a computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, will carry out the method as in the embodiment shown in fig. 1.
Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method, etc., can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and other division manners may be available in actual implementation.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (16)

1. A method for identifying similar words, comprising:
respectively acquiring picture characteristics, character element characteristics and character structure characteristics of each character to be processed in a picture form, wherein the character element characteristics are characteristics generated according to a character splitting result of the character; wherein, obtaining the character structure characteristics of the characters comprises: determining a font structure to which the characters belong, and generating the character structure characteristics according to a determination result;
determining similar characters in the characters to be processed according to the acquired characteristics, wherein the determining comprises the following steps: and respectively coding the acquired characters according to each character, performing hierarchical clustering according to the coding processing results of all characters in the characters to be processed to form a tree structure, and obtaining a set of similar characters according to a preset cutting threshold value.
2. The method of claim 1,
the acquiring of the picture features of the characters comprises:
converting the picture of the characters into a matrix format consisting of 0 and 1 according to the gray value of each pixel point in the picture of the characters;
and performing feature extraction through a convolutional neural network based on the conversion result to obtain the picture features.
3. The method of claim 1,
the character element characteristic of the character is acquired by the method comprising the following steps:
carrying out character splitting processing on the characters;
and generating the character element characteristics according to the character splitting result.
4. The method of claim 3,
the generating the character element characteristics according to the character splitting result comprises the following steps:
assigning a value to a first sequence with a preset length, wherein each bit in the first sequence corresponds to a preset basic unit, the number of the basic units is more than one, and the value of the bit corresponding to the basic unit contained in the word splitting result is set to be 1, otherwise, the value is set to be 0;
and taking the first sequence of assignment completion as the character element characteristic.
5. The method of claim 1,
the generating the character structure feature according to the determination result includes:
assigning a value to a second sequence with a preset length, wherein each bit in the second sequence corresponds to a preset font structure, the number of the font structures is more than one, and the value of the bit corresponding to the font structure to which the character belongs is set to be 1, otherwise, the value is set to be 0;
and taking the second sequence with the assigned value as the character structure characteristic.
6. The method of claim 1,
the font structure includes: the structure comprises an upper three-enclosure structure, an upper-lower structure, an upper middle-lower structure, an upper enclosure structure, a lower three-enclosure structure, a full enclosure structure, a single structure, an upper right enclosure structure, a delta-shaped structure, a left three-enclosure structure, an upper left enclosure structure, a lower left enclosure structure, a left middle-right structure, a left enclosure structure, a left side structure, a right side structure and an embedded structure.
7. The method of claim 1,
the step of respectively encoding the acquired characters according to each character comprises the following steps:
and for each character, respectively carrying out coding processing on the acquired character characteristics through a depth coding layer.
8. An apparatus for recognizing similar words, comprising: an acquisition unit and an identification unit;
the acquiring unit is used for respectively acquiring the picture characteristics, the character element characteristics and the character structure characteristics of each character to be processed in a picture form, wherein the character element characteristics are generated according to the character splitting result of the character; wherein, obtaining the character structure characteristics of the characters comprises: determining a font structure to which the characters belong, and generating the character structure characteristics according to a determination result;
the identification unit is configured to determine similar characters in the characters to be processed according to the obtained features, and includes: and respectively coding the acquired characters according to each character, performing hierarchical clustering according to the coding processing results of all characters in the characters to be processed to form a tree structure, and obtaining a set of similar characters according to a preset cutting threshold value.
9. The apparatus of claim 8,
the obtaining unit converts the picture of the characters into a matrix format consisting of 0 and 1 according to the gray value of each pixel point in the picture of the characters, and performs feature extraction through a convolutional neural network based on the conversion result to obtain the picture features.
10. The apparatus of claim 8,
the acquisition unit carries out character splitting processing on the characters and generates character element characteristics according to character splitting results.
11. The apparatus of claim 10,
and the obtaining unit assigns a first sequence with a preset length, each bit in the first sequence corresponds to a preset basic unit, the number of the basic units is more than one, the value of the bit corresponding to the basic unit contained in the word splitting result is set to be 1, otherwise, the value is set to be 0, and the first sequence with the assigned value is used as the character element characteristic.
12. The apparatus of claim 8,
the obtaining unit assigns a second sequence with a preset length, each bit in the second sequence corresponds to a preset font structure, the number of the font structures is more than one, the value of the bit corresponding to the font structure to which the character belongs is set to be 1, otherwise, the value is set to be 0, and the second sequence with assigned value is used as the character structure characteristic.
13. The apparatus of claim 8,
the font structure includes: the structure comprises an upper three-enclosure structure, an upper-lower structure, an upper middle-lower structure, an upper enclosure structure, a lower three-enclosure structure, a full enclosure structure, a single structure, an upper right enclosure structure, a delta-shaped structure, a left three-enclosure structure, an upper left enclosure structure, a lower left enclosure structure, a left middle-right structure, a left enclosure structure, a left side structure, a right side structure and an embedded structure.
14. The apparatus of claim 8,
and the identification unit is used for coding the acquired characters through a depth coding layer respectively aiming at each character.
15. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the method of any one of claims 1 to 7.
16. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the method according to any one of claims 1 to 7.
CN201810386017.3A 2018-04-26 2018-04-26 Similar word recognition method and device, computer equipment and storage medium Active CN110414496B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810386017.3A CN110414496B (en) 2018-04-26 2018-04-26 Similar word recognition method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810386017.3A CN110414496B (en) 2018-04-26 2018-04-26 Similar word recognition method and device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN110414496A CN110414496A (en) 2019-11-05
CN110414496B true CN110414496B (en) 2022-05-27

Family

ID=68345698

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810386017.3A Active CN110414496B (en) 2018-04-26 2018-04-26 Similar word recognition method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110414496B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111275046B (en) * 2020-01-10 2024-04-16 鼎富智能科技有限公司 Character image recognition method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101546379A (en) * 2008-03-28 2009-09-30 富士通株式会社 Computer-readable recording medium having character recognition program recorded thereon, character recognition device, and character recognition method
CN101551711A (en) * 2009-05-21 2009-10-07 华南理工大学 Chinese character coding input method based on structure and primitive
CN102509112A (en) * 2011-11-02 2012-06-20 珠海逸迩科技有限公司 Number plate identification method and identification system thereof
WO2015008732A1 (en) * 2013-07-16 2015-01-22 株式会社湯山製作所 Optical character recognition device
CN106874947A (en) * 2017-02-07 2017-06-20 第四范式(北京)技术有限公司 Method and apparatus for determining word shape recency

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3790736B2 (en) * 2002-10-15 2006-06-28 松下電器産業株式会社 Dictionary creation device for character recognition and character recognition device
EP2074558A4 (en) * 2006-09-08 2013-07-31 Google Inc Shape clustering in post optical character recognition processing

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101546379A (en) * 2008-03-28 2009-09-30 富士通株式会社 Computer-readable recording medium having character recognition program recorded thereon, character recognition device, and character recognition method
CN101551711A (en) * 2009-05-21 2009-10-07 华南理工大学 Chinese character coding input method based on structure and primitive
CN102509112A (en) * 2011-11-02 2012-06-20 珠海逸迩科技有限公司 Number plate identification method and identification system thereof
WO2015008732A1 (en) * 2013-07-16 2015-01-22 株式会社湯山製作所 Optical character recognition device
CN106874947A (en) * 2017-02-07 2017-06-20 第四范式(北京)技术有限公司 Method and apparatus for determining word shape recency

Also Published As

Publication number Publication date
CN110414496A (en) 2019-11-05

Similar Documents

Publication Publication Date Title
CN110765996B (en) Text information processing method and device
US10133965B2 (en) Method for text recognition and computer program product
CN106980856B (en) Formula identification method and system and symbolic reasoning calculation method and system
CN111858843B (en) Text classification method and device
CN110084172B (en) Character recognition method and device and electronic equipment
CN113254654B (en) Model training method, text recognition method, device, equipment and medium
CN112507806B (en) Intelligent classroom information interaction method and device and electronic equipment
CN115063875B (en) Model training method, image processing method and device and electronic equipment
CN111488732B (en) Method, system and related equipment for detecting deformed keywords
CN112380825A (en) PDF document page-crossing table merging method and device, electronic equipment and storage medium
CN113657274A (en) Table generation method and device, electronic equipment, storage medium and product
CN114730241A (en) Gesture stroke recognition in touch user interface input
US8934716B2 (en) Method and apparatus for sequencing off-line character from natural scene
CN112486338A (en) Medical information processing method and device and electronic equipment
US20150139547A1 (en) Feature calculation device and method and computer program product
CN110414496B (en) Similar word recognition method and device, computer equipment and storage medium
CN112749639B (en) Model training method and device, computer equipment and storage medium
CN111783787B (en) Method and device for recognizing image characters and electronic equipment
CN113468972B (en) Handwriting track segmentation method for handwriting recognition of complex scene and computer product
CN115311674A (en) Handwriting processing method and device, electronic equipment and readable storage medium
CN115578736A (en) Certificate information extraction method, device, storage medium and equipment
CN115294581A (en) Method and device for identifying error characters, electronic equipment and storage medium
CN113128496B (en) Method, device and equipment for extracting structured data from image
CN111476090B (en) Watermark identification method and device
CN110826488B (en) Image identification method and device for electronic document and storage equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant