CN111401099B - Text recognition method, device and storage medium - Google Patents

Text recognition method, device and storage medium Download PDF

Info

Publication number
CN111401099B
CN111401099B CN201811616238.1A CN201811616238A CN111401099B CN 111401099 B CN111401099 B CN 111401099B CN 201811616238 A CN201811616238 A CN 201811616238A CN 111401099 B CN111401099 B CN 111401099B
Authority
CN
China
Prior art keywords
character
word
recognition
word vector
unrecognized
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811616238.1A
Other languages
Chinese (zh)
Other versions
CN111401099A (en
Inventor
邱芸
沈雷
刘孝颂
王懿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Corp Ltd
Original Assignee
China Telecom Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Corp Ltd filed Critical China Telecom Corp Ltd
Priority to CN201811616238.1A priority Critical patent/CN111401099B/en
Publication of CN111401099A publication Critical patent/CN111401099A/en
Application granted granted Critical
Publication of CN111401099B publication Critical patent/CN111401099B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Abstract

The disclosure provides a text recognition method, a text recognition device and a storage medium, and relates to the technical field of computers, wherein the method comprises the following steps: carrying out character recognition on an original image containing characters to be recognized to obtain a character sequence; judging whether the recognition confidence of the characters in the character sequence is smaller than a preset confidence threshold; if yes, determining the character as an unrecognized character, and generating a word vector to be recognized based on the character sequence and the unrecognized character; and obtaining similar word vectors matched with the word vectors to be recognized in the word vector library, and determining the unrecognized characters based on the similar word vectors. The method, the device and the storage medium can obtain corresponding words by inquiring the word vector library for the words which cannot be identified by the optical character identification technology, can expand and perfect the word vector library, can better improve the identification effect of the optical character identification technology, improve the working efficiency and enhance the competitiveness.

Description

Text recognition method, device and storage medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a text recognition method and apparatus, and a storage medium.
Background
The pictures and characters of original data and the like cannot be directly copied and edited, the pictures and the characters need to be identified, if the pictures and the characters cannot be identified, manual entry is needed, however, the manual entry engineering is expensive, and time and labor are consumed. The character recognition technology is widely applied to the fields of virtual reality, human-computer interaction, bill recognition, industrial automation and the like, for example, the Optical Character Recognition (OCR) technology and the like, but the recognition rate of the conventional character recognition technology is not high. Therefore, a new technical solution for text recognition is needed.
Disclosure of Invention
In view of the above, one technical problem to be solved by the present disclosure is to provide a text recognition method, apparatus and storage medium.
According to an aspect of the present disclosure, there is provided a text recognition method including: carrying out character recognition on an original image containing characters to be recognized to obtain a character sequence; judging whether the recognition confidence of the characters in the character sequence is smaller than a preset confidence threshold; if so, determining that the character is an unrecognized character, and generating a word vector to be recognized based on the character sequence and the unrecognized character; and obtaining similar word vectors matched with the word vectors to be recognized in a word vector library, and determining the unrecognized characters based on the similar word vectors.
Optionally, preprocessing the original image, and performing character recognition on the preprocessed original image; the preprocessed original image comprises: high contrast images.
Optionally, the preprocessing the original image includes: carrying out graying processing on the original image; carrying out binarization processing on the original image subjected to graying processing, and carrying out optical superposition processing on the original image and a copy image of the original image to obtain the high-contrast image; wherein the binarization processing includes: gaussian blurring, inverse color and opacity.
Optionally, generating a plurality of pixel matrices corresponding to the preprocessed original image; extracting the features of the pixel matrixes to obtain a plurality of feature maps; and performing character recognition on the plurality of feature maps to obtain the character sequence.
Optionally, performing character recognition on each feature map, determining a character corresponding to the current feature of the feature map, and determining a recognition confidence of the character; generating the sequence of characters based on characters corresponding to a current feature of the feature map.
Optionally, if the current feature corresponding to the unrecognized character corresponds to a plurality of similar characters, performing word segmentation processing on the character sequence to obtain a recognition target word including the unrecognized character; converting the recognition target word into the word vector to be recognized, and obtaining the similarity between the word vector to be recognized and the word vector in the word vector library; obtaining a word vector with the highest similarity in the word vector library as the similar word vector; determining the unrecognized character based on a recognition reference word corresponding to the similar word vector.
Optionally, the performing word segmentation processing on the character sequence to obtain a recognition target word including the unrecognized character includes: determining a last recognized character corresponding to a last feature of a current feature corresponding to the unrecognized character in the feature map; and generating the recognition target word comprising the recognition characters and the unrecognized characters according to the word segmentation result.
Optionally, the determining the unrecognized character based on the recognition reference word corresponding to the similar word vector comprises: obtaining a context character of the recognition character from the recognition reference word; determining the unrecognized character as a similar character of the plurality of similar characters that matches the contextual character.
Optionally, after determining the unrecognized character, replacing the unrecognized character in the recognition target word with the similar character matched with the context character, and adding the word vector to be recognized and the recognition target word in the word vector library.
Optionally, calculating a distance between the word vector to be recognized and a word vector in the word vector library, and determining the similarity according to the distance; wherein the distance comprises: the euclidean distance.
According to another aspect of the present invention, there is provided a text recognition apparatus including: the character recognition module is used for carrying out character recognition on an original image containing characters to be recognized to obtain a character sequence; the correction judging module is used for judging whether the recognition confidence coefficient of the characters in the character sequence is smaller than a preset confidence coefficient threshold value or not; and the character correction module is used for determining the character as an unrecognized character if the character is the unrecognized character, generating a word vector to be recognized based on the character sequence and the unrecognized character, obtaining a similar word vector matched with the word vector to be recognized in a word vector library, and determining the unrecognized character based on the similar word vector.
Optionally, the preprocessing module is configured to preprocess the original image, and perform character recognition on the preprocessed original image; the preprocessed original image comprises: high contrast images.
Optionally, the preprocessing module is configured to perform graying processing on the original image; carrying out binarization processing on the original image subjected to graying processing, and carrying out optical superposition processing on the original image and a copy image of the original image to obtain the high-contrast image; wherein the binarization processing includes: gaussian blurring, inverse color and opacity.
Optionally, the character recognition module is configured to generate a plurality of pixel matrices corresponding to the preprocessed original image; performing feature extraction on the pixel matrixes to obtain a plurality of feature maps; and performing character recognition on the plurality of feature maps to obtain the character sequence.
Optionally, the character recognition module is further configured to perform character recognition on each feature map, determine a character corresponding to the current feature of the feature map, and determine a recognition confidence of the character; generating the character sequence based on a character corresponding to a current feature of the feature map.
Optionally, the text correction module includes: the word segmentation unit is used for performing word segmentation processing on the character sequence to obtain a recognition target word comprising the unrecognized character if the current feature corresponding to the unrecognized character corresponds to a plurality of similar characters; the vector conversion unit is used for converting the recognition target words into the word vectors to be recognized; the vector matching unit is used for obtaining the similarity between the word vector to be recognized and the word vector in the word vector library; obtaining a word vector with the highest similarity in the word vector library as the similar word vector; a character determination unit for determining the unrecognized character based on the recognition reference word corresponding to the similar word vector.
Optionally, the word segmentation unit is configured to determine a last recognized character corresponding to a last feature of a current feature corresponding to the unrecognized character in the feature map; and generating the recognition target word comprising the recognition character and the unrecognized character according to the word segmentation result.
Optionally, the character determining unit is configured to obtain a context character of the recognition character from the recognition reference word; determining the unrecognized character as a similar character of the plurality of similar characters that matches the contextual character.
Optionally, the word bank adding module is configured to, after determining the unrecognized character, replace the unrecognized character in the recognition target word with the similar character matched with the context character, and add the to-be-recognized word vector and the recognition target word to the word vector bank.
Optionally, the vector matching unit is configured to calculate a distance between the word vector to be recognized and a word vector in the word vector library, and determine the similarity according to the distance; wherein the distance comprises: the euclidean distance.
According to still another aspect of the present invention, there is provided a text recognition apparatus including: a memory; and a processor coupled to the memory, the processor configured to perform the method of any of the above based on instructions stored in the memory.
According to yet another aspect of the present disclosure, a computer-readable storage medium is provided, which stores computer instructions for execution by a processor to perform the method as described above.
The text recognition method, the text recognition device and the storage medium carry out character recognition on an original image to obtain a character sequence, if the recognition confidence coefficient of characters in the character sequence is smaller than a threshold value, a word vector to be recognized is generated based on the character sequence and unrecognized characters, a similar word vector matched with the word vector to be recognized is obtained in a word vector library, and the unrecognized characters are determined; for the words which cannot be identified by the optical character identification technology, the corresponding words are obtained by inquiring the word vector library, the word vector library can be expanded and perfected, the identification effect of the optical character identification technology can be better improved, the working efficiency is improved, and the competitiveness is enhanced.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without inventive exercise.
FIG. 1 is a schematic flow chart diagram illustrating one embodiment of a text recognition method according to the present disclosure;
FIG. 2 is a schematic flow chart diagram illustrating another embodiment of a text recognition method according to the present disclosure;
FIG. 3 is a block diagram of one embodiment of a text recognition device according to the present disclosure;
FIG. 4 is a block diagram of a word correction module in one embodiment of a text recognition device according to the present disclosure;
FIG. 5 is a block diagram of yet another embodiment of a text recognition device according to the present disclosure.
Detailed Description
The present disclosure now will be described more fully hereinafter with reference to the accompanying drawings, in which exemplary embodiments of the disclosure are shown. The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure. The technical solution of the present disclosure is described in various aspects below with reference to various figures and embodiments.
The existing Optical Character Recognition (OCR) technology is mainly directed to a document image with good quality, and input images all use a standard printing form, have a clean background and a high resolution. However, in a natural scene, an image is poor in imaging effect due to illumination, angle, jitter, and the like, and recognition by OCR recognition is poor due to various image backgrounds, various font styles, and the like.
The dictionary is an optional information source, and for unrecognizable words or words, the static dictionary can be queried to obtain the context character (e.g., the context character of the character "whole" can be obtained according to the information source in the dictionary, such as neat "uniform", flat ", etc.), and the final target character can be confirmed. But regular, flat and the like belong to popular words, and a common dictionary can be found out. And the common dictionary cannot process some words or words.
For example, the leading name of a literary novel (kreteffer et al), which is not related to ri or torr in a general dictionary, makes it difficult to find out the correct character according to the context; professional vocabularies in medical literature, professional vocabularies in the communications industry, etc., and general dictionaries cannot find out these words, so that correct judgment cannot be given, and moreover, word banks are updated slowly and are not updated in general. So the effect of improving the character recognition rate is limited.
According to the text recognition method, similar characters or words can be found out by converting the words into Word vectors (Word embedding), words which cannot be recognized through the optical character recognition technology can be found out through the query of the vector dictionary, and the recognition rate of the optical character recognition technology can be improved well. Fig. 1 is a schematic flow chart diagram of an embodiment of a text recognition method according to the present disclosure, as shown in fig. 1:
step 101, performing character recognition on an original image containing characters to be recognized to obtain a character sequence.
Step 102, judging whether the recognition confidence of the characters in the character sequence is smaller than a preset confidence threshold.
And 103, if yes, determining the character as an unrecognized character, and generating a word vector to be recognized based on the character sequence and the unrecognized character.
The generation of the word vectors may take a variety of existing methods. The word vector refers to a vector representation of words, each word can be represented as a vector composed of 0 and 1, and the word vector can also be obtained by adopting a distributed description method. Word vectors can be obtained by using a machine training method, such as a deep learning algorithm word2vec, a neural network training algorithm and the like.
And 104, obtaining similar word vectors matched with the word vectors to be recognized in the word vector library, and determining unrecognized characters based on the similar word vectors. The word vector library may include a word vector dictionary or the like.
In one embodiment, the original image is preprocessed, and the preprocessed original image is subjected to character recognition, and the preprocessed original image can be a high-contrast image or the like. The preprocessing of the original image can take a variety of methods.
For example, an original image is subjected to a gradation process, a binarized original image subjected to the gradation process is subjected to a binarization process, and a light superimposition process is performed with a copy image of the original image, thereby obtaining a high-contrast image. The binarization processing includes processing such as gaussian blur processing, inverse color processing, and opacity processing.
The image obtained by the original image through the Gaussian blur processing, the inverse color processing and the opacity processing in sequence and the copy image are subjected to linear light superposition processing to obtain a high-contrast image. Due to the fact that the principle of Gaussian blur and linear light superposition is used, the method has the function of smooth filtering on weak illumination change in the image, and a good binarization effect can be achieved.
In one embodiment, an original image of a natural scene is collected, a pixel matrix is formed by pixel points of the original image, and in order to reduce meaningless operation amount brought by a color image, increase the contrast of the image and make the image clearer, the original image is preprocessed, a background can be deleted, and foreground characters are highlighted. And judging the accuracy of the characters identified by adopting the optical character identification technology to obtain characters with low accuracy or considered as errors, converting the characters into word vectors, inquiring a word vector library to complete the identification of the characters, and outputting correct characters or words.
The word vector library is dynamically set, and professional word library classification is carried out according to the target category, so that sample training and word segmentation are carried out in advance. The vector lexicon learning classification can be carried out according to various specialties such as literature classes, science and technology classes, medical classes, communication classes and the like, 5 thousands of input words are input into each lexicon for learning, and a professional vector dictionary is generated.
More than 5 ten thousand images containing characters shot under different angles, different illumination and different background environments can be trained and tested in advance, and word bank classification is carried out according to types, such as various word banks including a medical word bank, a literature word bank, a communication word bank and the like. In the recognition process, newly recognized characters or word vectors corresponding to the words are input into a word vector library, the content and the scale of the word vector library are enlarged, and the character recognition rate is effectively improved.
Fig. 2 is a schematic flow chart of another embodiment of a text recognition method according to the present disclosure, as shown in fig. 2:
step 201, collecting an original image containing characters of a natural scene.
Step 202, the original image is preprocessed.
Step 203, generating a plurality of pixel matrixes corresponding to the preprocessed original image, and performing feature extraction on the plurality of pixel matrixes to obtain a plurality of feature maps.
And step 204, performing character recognition on the plurality of feature maps to obtain a character sequence.
And performing layout analysis on the preprocessed original image to obtain a plurality of pixel matrixes. The original image may be divided into a plurality of portions including text, title, image portion, etc. by the layout analysis method, and various existing layout analysis methods may be employed.
And performing feature extraction to obtain a plurality of feature maps. The feature extraction can be respectively carried out on the pixel matrixes by adopting the conventional convolutional neural network model and the like to obtain a plurality of feature maps. And respectively carrying out character recognition on the plurality of characteristic graphs to obtain a character sequence. The existing long and short term memory network LSTM model or the like may be used to perform character recognition on each of the plurality of feature maps, and a character sequence may be generated based on characters corresponding to the current features of the feature maps.
And performing character recognition on each feature map, determining a character corresponding to the current feature of the feature map, and determining the recognition confidence coefficient of the character, wherein the recognition confidence coefficient of the character can be determined by adopting various existing threshold algorithms. If the recognition confidence of the character is lower than the confidence threshold, the character is considered to be a wrong character, and word vector correction is needed.
Step 205, if the current features corresponding to the unrecognized characters correspond to a plurality of similar characters, performing word segmentation processing on the character sequence to obtain a recognition target word including the unrecognized characters.
Step 206, converting the recognition target word into a word vector to be recognized, and obtaining the similarity between the word vector to be recognized and the word vector in the word vector library.
Step 207, the word vector with the highest similarity is obtained in the word vector library as the similar word vector.
In step 208, unrecognized characters are determined based on the recognition reference words corresponding to the similar word vectors.
And step 209, after the recognition processing, expanding the word vector library and expanding and perfecting the dictionary.
There are various methods for performing word segmentation processing on a character sequence to obtain a recognition target word including the unrecognized character. For example, a last recognized character corresponding to a last feature of the current feature corresponding to the unrecognized character in the feature map is determined, and a recognition target word including the recognized character and the unrecognized character is generated according to the result of word segmentation.
The context character of the recognized character is obtained from the recognition reference word, and the unrecognized character is determined as a similar character matching the context character among the plurality of similar characters. After the unrecognized characters are determined, the unrecognized characters in the recognition target words are replaced by similar characters matched with the context characters, and the word vector to be recognized and the recognition target words are added into the word vector library.
And calculating the distance between the word vector to be recognized and the word vector in the word vector library, and determining the similarity according to the distance, wherein the distance comprises Euclidean distance and the like. Similarity refers to the degree of similarity between two word vectors, expressed by the distance between the two word vectors. The shorter the distance between the two word vectors is, the greater the similarity of the words corresponding to the two word vectors is, and the distance can be described by an Euclidean distance, a cosine included angle and the like.
In one embodiment, an original image A of a natural scene containing text is acquired and preprocessed to generate a high contrast image. And performing layout analysis on the preprocessed original image to obtain a plurality of pixel matrixes. And respectively extracting the features of the pixel matrixes by adopting the conventional convolutional neural network model to obtain a plurality of feature maps. And respectively carrying out character recognition on the plurality of characteristic graphs by adopting the existing LSTM model and the like to obtain a character sequence.
And performing character recognition on each feature map, determining a character corresponding to the current feature of the feature map, and determining the recognition confidence coefficient of the character by adopting the existing threshold algorithm. And if the recognition confidence coefficient of the character E corresponding to the current feature C of the feature map B is 0.3 and is less than the confidence coefficient threshold value 0.5, the character E corresponding to the current feature C is an unrecognized character E.
If the recognition result of the current feature C is a plurality of similar characters, the plurality of similar characters are "will", "water", "big", and the like. And obtaining a last recognition character corresponding to a last feature D of the current feature C, wherein the last recognition character is 'society'. The word segmentation processing is carried out on the character sequence, and the word segmentation can be carried out on the character sequence by adopting various existing word segmentation methods. And generating a recognition target word including the recognition characters "society" and the unrecognized character E according to the result of the word segmentation, wherein the unrecognized character E in the recognition target word can be set as one of a plurality of similar characters or as a specific character.
Converting the recognition target words into word vectors to be recognized according to the existing conversion algorithm, calculating the Euclidean distance between the word vectors to be recognized and the word vectors in the word vector library, and selecting the word vector with the minimum Euclidean distance between the word vector and the word vector to be recognized from the word vector library, wherein the word corresponding to the word vector is the 'society'. The context character "meeting" of the recognition character "society" is obtained from "society", and the unrecognized character E is determined as a similar character "meeting" matching the context character among a plurality of similar characters.
After the unrecognized character E is determined, the unrecognized character in the recognition target word is replaced by a similar character 'meeting' matched with the context character, the word vector to be recognized and the recognition target word 'society' are added in a word vector library, and the word vector library is expanded.
In one embodiment, as shown in fig. 3, the present disclosure provides a text recognition apparatus 30, including: a character recognition module 31, a correction judgment module 32, a character correction module 33, a preprocessing module 34 and a lexicon addition module 35. The character recognition module 31 performs character recognition on an original image containing characters to be recognized to obtain a character sequence. The correction judging module 32 judges whether the recognition confidence of the characters in the character sequence is smaller than a preset confidence threshold.
If yes, the character correction module 33 determines the character as an unrecognized character, and generates a to-be-recognized word vector based on the character sequence and the unrecognized character. The character correction module 33 obtains a similar word vector matching the word vector to be recognized in the word vector library, and determines an unrecognized character based on the similar word vector.
The preprocessing module 34 preprocesses the original image, and performs character recognition on the preprocessed original image; the original image after the preprocessing includes a high contrast image and the like. The preprocessing module 34 may perform a graying process on the original image, perform a binarization process on the grayed original image, and perform a light superposition process with the copy image of the original image to obtain a high-contrast image. The binarization processing includes gaussian blur processing, inverse color processing, opacity processing, and the like.
In one embodiment, the character recognition module 31 generates a plurality of pixel matrices corresponding to the preprocessed original image, and performs feature extraction on the plurality of pixel matrices to obtain a plurality of feature maps. The character recognition module 31 performs character recognition on the plurality of feature maps to obtain a character sequence.
The character recognition module 31 may perform character recognition on each feature map, determine a character corresponding to the current feature of the feature map, and determine a recognition confidence of the character. The character recognition module 31 generates a character sequence based on the characters corresponding to the current features of the feature map.
As shown in fig. 4, the text correction module 33 includes: a word segmentation unit 331, a vector conversion unit 332, a vector matching unit 333, and a character determination unit 334. If the current feature corresponding to the unrecognized character corresponds to a plurality of similar characters, the word segmentation unit 331 performs word segmentation processing on the character sequence to obtain a recognition target word including the unrecognized character. The vector conversion unit 332 converts the recognition target word into a word vector to be recognized.
The vector matching unit 333 obtains the similarity between the word vector to be recognized and the word vectors in the word vector library, and obtains the word vector having the highest similarity in the word vector library as the similar word vector. The character determining unit 334 determines an unrecognized character based on the recognition reference word corresponding to the similar word vector. The vector matching unit 333 calculates the distance between the word vector to be recognized and the word vector in the word vector library, and determines the similarity according to the distance; the distance includes a euclidean distance, etc.
The word segmentation unit 331 determines a last recognized character in the feature map corresponding to a last feature of the current feature corresponding to the unrecognized character. The word segmentation unit 331 generates a recognition target word including recognized characters and unrecognized characters according to the result of the segmentation. The character determining unit 334 obtains a context character of the recognition character from the recognition reference word, and determines the unrecognized character as a similar character matching the context character among the plurality of similar characters.
The word stock adding module 35, after determining the unrecognized character, replaces the unrecognized character in the recognition target word with a similar character matching the context character, and adds the word vector to be recognized and the recognition target word in the word vector stock.
FIG. 5 is a block diagram of another embodiment of a text recognition device according to the present disclosure. As shown in fig. 5, the apparatus may include a memory 51, a processor 52, a communication interface 53, and a bus 54. The memory 51 is used for storing instructions, the processor 52 is coupled to the memory 51, and the processor 52 is configured to execute the text recognition method implemented above based on the instructions stored in the memory 51.
The memory 51 may be a high-speed RAM memory, a non-volatile memory (non-volatile memory), or the like, and the memory 51 may be a memory array. The storage 51 may also be partitioned and the blocks may be combined into virtual volumes according to certain rules. Processor 52 may be a central processing unit CPU, or an Application Specific Integrated Circuit ASIC (Application Specific Integrated Circuit), or one or more Integrated circuits configured to implement the text recognition methods of the present disclosure.
In one embodiment, the present disclosure provides a computer-readable storage medium storing computer instructions that, when executed by a processor, implement a text recognition method as in any of the above embodiments.
In the text recognition method, the text recognition device and the storage medium in the embodiments, the original image is subjected to character recognition to obtain a character sequence, if the recognition confidence of characters in the character sequence is smaller than a threshold value, a word vector to be recognized is generated based on the character sequence and unrecognized characters, a similar word vector matched with the word vector to be recognized is obtained in a word vector library, and the unrecognized characters are determined; the corresponding words are obtained by inquiring the word vector library for the words which cannot be identified by the optical character identification technology, the word vector library can be expanded and perfected, the identification effect of the optical character identification technology can be better improved, the working efficiency is improved, and the competitiveness is enhanced.
The method and system of the present disclosure may be implemented in a number of ways. For example, the methods and systems of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.
The description of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the disclosure in the form disclosed. Many modifications and variations will be apparent to practitioners skilled in this art. The embodiment was chosen and described in order to best explain the principles of the disclosure and the practical application, and to enable others of ordinary skill in the art to understand the disclosure for various embodiments with various modifications as are suited to the particular use contemplated.

Claims (16)

1. A text recognition method, comprising:
carrying out character recognition on an original image containing characters to be recognized to obtain a character sequence;
wherein a plurality of pixel matrices corresponding to the original image that has been preprocessed are generated; extracting the features of the pixel matrixes to obtain a plurality of feature maps; performing character recognition on the feature maps, determining a character corresponding to the current feature of the feature maps, and determining the recognition confidence coefficient of the character; generating the character sequence based on a character corresponding to a current feature of the feature map;
judging whether the recognition confidence of the characters in the character sequence is smaller than a preset confidence threshold;
if so, determining that the character is an unrecognized character, and generating a word vector to be recognized based on the character sequence and the unrecognized character;
obtaining similar word vectors matched with the word vectors to be recognized in a word vector library, and determining the unrecognized characters based on the similar word vectors;
if the current features corresponding to the unrecognized characters correspond to a plurality of similar characters, performing word segmentation processing on the character sequence to obtain a recognition target word comprising the unrecognized characters; converting the recognition target word into the word vector to be recognized, and obtaining the similarity between the word vector to be recognized and the word vector in the word vector library; obtaining a word vector with the highest similarity in the word vector library as the similar word vector; determining the unrecognized character based on a recognition reference word corresponding to the similar word vector.
2. The method of claim 1, further comprising:
preprocessing the original image, and performing character recognition on the preprocessed original image; the preprocessed original image comprises: high contrast images.
3. The method of claim 2, the pre-processing the original image comprising:
carrying out graying processing on the original image;
carrying out binarization processing on the original image subjected to graying processing, and carrying out optical superposition processing on the original image and a copy image of the original image to obtain the high-contrast image;
wherein the binarization processing includes: gaussian blurring, inverse color and opacity.
4. The method of claim 3, wherein the word segmentation processing on the character sequence to obtain the recognition target word including the unrecognized character comprises:
determining a last recognized character corresponding to a last feature of a current feature corresponding to the unrecognized character in the feature map;
and generating the recognition target word comprising the recognition characters and the unrecognized characters according to the word segmentation result.
5. The method of claim 4, wherein the determining the unrecognized character based on the recognized reference word to which the similar word vector corresponds comprises:
obtaining a context character of the recognition character from the recognition reference word;
determining the unrecognized character as a similar character of the plurality of similar characters that matches the contextual character.
6. The method of claim 5, further comprising:
after the unrecognized character is determined, replacing the unrecognized character in the recognition target word with the similar character matched with the context character, and adding the word vector to be recognized and the recognition target word in the word vector library.
7. The method of claim 1, further comprising:
calculating the distance between the word vector to be recognized and the word vector in the word vector library, and determining the similarity according to the distance; wherein the distance comprises: the euclidean distance.
8. A text recognition apparatus comprising:
the character recognition module is used for carrying out character recognition on an original image containing characters to be recognized to obtain a character sequence;
the character recognition module is used for generating a plurality of pixel matrixes corresponding to the preprocessed original image; extracting the features of the pixel matrixes to obtain a plurality of feature maps; performing character recognition on the feature maps, determining a character corresponding to the current feature of the feature maps, and determining the recognition confidence coefficient of the character; generating the character sequence based on a character corresponding to a current feature of the feature map;
the correction judging module is used for judging whether the recognition confidence coefficient of the characters in the character sequence is smaller than a preset confidence coefficient threshold value or not;
the character correction module is used for determining the character as an unrecognized character if the character is the unrecognized character, generating a word vector to be recognized based on the character sequence and the unrecognized character, obtaining a similar word vector matched with the word vector to be recognized in a word vector library, and determining the unrecognized character based on the similar word vector;
wherein, the word correction module comprises:
a word segmentation unit, configured to perform word segmentation processing on the character sequence to obtain a recognition target word including the unrecognized character if the current feature corresponding to the unrecognized character corresponds to multiple similar characters;
the vector conversion unit is used for converting the recognition target words into the word vectors to be recognized;
the vector matching unit is used for obtaining the similarity between the word vector to be recognized and the word vector in the word vector library; obtaining a word vector with the highest similarity in the word vector library as the similar word vector;
a character determination unit for determining the unrecognized character based on the recognition reference word corresponding to the similar word vector.
9. The apparatus of claim 8, further comprising:
the preprocessing module is used for preprocessing the original image and performing character recognition on the preprocessed original image; the preprocessed original image comprises: high contrast images.
10. The apparatus of claim 9, wherein,
the preprocessing module is used for carrying out graying processing on the original image; carrying out binarization processing on the original image subjected to graying processing, and carrying out optical superposition processing on the original image and a copy image of the original image to obtain the high-contrast image;
wherein the binarization processing includes: gaussian blurring, inverse color and opacity.
11. The apparatus of claim 10, wherein,
the word segmentation unit is used for determining a last recognition character corresponding to a last feature of the current feature corresponding to the unrecognized character in the feature map; and generating the recognition target word comprising the recognition character and the unrecognized character according to the word segmentation result.
12. The apparatus of claim 11, wherein,
the character determining unit is used for obtaining context characters of the recognition characters from the recognition reference words; determining the unrecognized character as a similar character of the plurality of similar characters that matches the contextual character.
13. The apparatus of claim 12, further comprising:
and the word bank adding module is used for replacing the unrecognized characters in the recognition target words with the similar characters matched with the context characters after determining the unrecognized characters, and adding the word vectors to be recognized and the recognition target words in the word vector bank.
14. The apparatus of claim 10, wherein,
the vector matching unit is used for calculating the distance between the word vector to be recognized and the word vector in the word vector library and determining the similarity according to the distance; wherein the distance comprises: the euclidean distance.
15. A text recognition apparatus comprising:
a memory; and a processor coupled to the memory, the processor configured to perform the method of any of claims 1-7 based on instructions stored in the memory.
16. A computer-readable storage medium having stored thereon computer instructions for execution by a processor of the method of any one of claims 1 to 7.
CN201811616238.1A 2018-12-28 2018-12-28 Text recognition method, device and storage medium Active CN111401099B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811616238.1A CN111401099B (en) 2018-12-28 2018-12-28 Text recognition method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811616238.1A CN111401099B (en) 2018-12-28 2018-12-28 Text recognition method, device and storage medium

Publications (2)

Publication Number Publication Date
CN111401099A CN111401099A (en) 2020-07-10
CN111401099B true CN111401099B (en) 2023-04-07

Family

ID=71430095

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811616238.1A Active CN111401099B (en) 2018-12-28 2018-12-28 Text recognition method, device and storage medium

Country Status (1)

Country Link
CN (1) CN111401099B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112149680B (en) * 2020-09-28 2024-01-16 武汉悦学帮网络技术有限公司 Method and device for detecting and identifying wrong words, electronic equipment and storage medium
CN112699780A (en) * 2020-12-29 2021-04-23 上海臣星软件技术有限公司 Object identification method, device, equipment and storage medium
CN112784932A (en) * 2021-03-01 2021-05-11 北京百炼智能科技有限公司 Font identification method and device and storage medium
CN113077018A (en) * 2021-06-07 2021-07-06 浙江大华技术股份有限公司 Target object identification method and device, storage medium and electronic device
CN114580429A (en) * 2022-01-26 2022-06-03 云捷计算机软件(江苏)有限责任公司 Artificial intelligence-based language and image understanding integrated service system

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108628830A (en) * 2018-04-24 2018-10-09 北京京东尚科信息技术有限公司 A kind of method and apparatus of semantics recognition

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7499588B2 (en) * 2004-05-20 2009-03-03 Microsoft Corporation Low resolution OCR for camera acquired documents
JP6342298B2 (en) * 2014-10-31 2018-06-13 株式会社東芝 Character recognition device, image display device, image search device, character recognition method and program
CN104462378B (en) * 2014-12-09 2017-11-21 北京国双科技有限公司 Data processing method and device for text identification
CN107133622B (en) * 2016-02-29 2022-08-26 阿里巴巴集团控股有限公司 Word segmentation method and device
CN106874910A (en) * 2017-01-19 2017-06-20 广州优库电子有限公司 The self-service meter reading terminal of low-power consumption instrument long-distance and method based on OCR identifications
CN106960206B (en) * 2017-02-08 2021-01-01 北京捷通华声科技股份有限公司 Character recognition method and character recognition system
CN108898137B (en) * 2018-05-25 2022-04-12 黄凯 Natural image character recognition method and system based on deep neural network

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108628830A (en) * 2018-04-24 2018-10-09 北京京东尚科信息技术有限公司 A kind of method and apparatus of semantics recognition

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Harshit Pande."Effective search space reduction for spell correction using character neural embeddings".《Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics》.2017,第第2卷卷第170-174页. *

Also Published As

Publication number Publication date
CN111401099A (en) 2020-07-10

Similar Documents

Publication Publication Date Title
CN111401099B (en) Text recognition method, device and storage medium
US20190180154A1 (en) Text recognition using artificial intelligence
JP3822277B2 (en) Character template set learning machine operation method
US8494273B2 (en) Adaptive optical character recognition on a document with distorted characters
US20180137350A1 (en) System and method of character recognition using fully convolutional neural networks with attention
CN110647829A (en) Bill text recognition method and system
CN110135414B (en) Corpus updating method, apparatus, storage medium and terminal
Berg-Kirkpatrick et al. Unsupervised transcription of historical documents
CN110178139B (en) System and method for character recognition using a full convolutional neural network with attention mechanisms
CN111523622B (en) Method for simulating handwriting by mechanical arm based on characteristic image self-learning
CN109389115B (en) Text recognition method, device, storage medium and computer equipment
CN111488732B (en) Method, system and related equipment for detecting deformed keywords
Zhai et al. Chinese image text recognition with BLSTM-CTC: a segmentation-free method
US8340428B2 (en) Unsupervised writer style adaptation for handwritten word spotting
Mohammad et al. Contour-based character segmentation for printed Arabic text with diacritics
CN114005127A (en) Image optical character recognition method based on deep learning, storage device and server
Ahmed et al. Printed Arabic text recognition
CN112417087A (en) Character-based tracing method and system
CN112651392A (en) Certificate information acquisition method and device, storage medium and computer equipment
Al Ghamdi A novel approach to printed Arabic optical character recognition
Smitha et al. Document image analysis using imagemagick and tesseract-ocr
CN111612045B (en) Universal method for acquiring target detection data set
Kumar et al. Line based robust script identification for indianlanguages
EP3757825A1 (en) Methods and systems for automatic text segmentation
CN116311275B (en) Text recognition method and system based on seq2seq language model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant