CN112801085A

CN112801085A - Method, device, medium and electronic equipment for recognizing characters in image

Info

Publication number: CN112801085A
Application number: CN202110176821.0A
Authority: CN
Inventors: 冯煜博; 徐娇; 王广普
Original assignee: Shenyang Linlong Technology Co ltd
Current assignee: Shenyang Linlong Technology Co ltd
Priority date: 2021-02-09
Filing date: 2021-02-09
Publication date: 2021-05-14

Abstract

The embodiment of the invention discloses a method, a device, a medium and electronic equipment for identifying characters in an image. The method comprises the following steps: acquiring a character image area to be identified; if the character image area to be identified contains characters, extracting character features; inputting the character features into a pre-training language model for predicting each character by the pre-training language model to obtain a character prediction result; the pre-training language model is obtained by training based on a pre-constructed covered training sample; and taking the character prediction result as the recognition result of the characters in the image. By adopting the technical scheme provided by the application, the effect of accurately identifying the characters can be realized aiming at the low-quality images.

Description

Method, device, medium and electronic equipment for recognizing characters in image

Technical Field

The embodiment of the invention relates to the technical field of image recognition, in particular to a method, a device, a medium and electronic equipment for recognizing characters in an image.

Background

With the development of scientific technology, image processing has become a part of many fields. In some scenes, characters in the image often need to be converted into text content, which requires enhancement processing and character recognition on the image. The enhancement processing mainly comprises means such as image denoising, image super-resolution, image deblurring and the like, and on the basis, character recognition is carried out, so that the aim of automatically recognizing the text in the image can be fulfilled. However, for some use scenes of low-quality images, the error rate of the character extraction process is very high because the characters are completely blurred or even damaged, and if manual verification is needed, the efficiency of character recognition is greatly influenced and the cost of character recognition is increased.

Disclosure of Invention

The embodiment of the invention provides a method, a device, a medium and electronic equipment for recognizing characters in an image, which can realize the effect of accurately recognizing the characters aiming at low-quality images.

In a first aspect, an embodiment of the present invention provides a method for recognizing characters in an image, where the method includes:

acquiring a character image area to be identified;

if the character image area to be identified contains characters, extracting character features;

inputting the character features into a pre-training language model for predicting each character by the pre-training language model to obtain a character prediction result; the pre-training language model is obtained by training based on a pre-constructed covered training sample;

and taking the character prediction result as the recognition result of the characters in the image.

Further, extracting character features includes:

and extracting character features of the image to be recognized by using a feature extraction layer consisting of a convolutional neural network and a pooling layer.

Further, the method for extracting the character features of the image to be recognized by using the feature extraction layer composed of the convolution layer and the pooling layer comprises the following steps:

performing feature extraction on the image to be identified by using a convolutional neural network to obtain feature mapping;

performing maximum pooling on the extracted feature mapping by using a pooling layer to obtain refined feature mapping;

and converting the refined feature mapping into a feature sequence.

Further, before converting the refined feature map into a feature sequence, the method further comprises:

carrying out normalization processing on the refined feature mapping to obtain a normalization result;

correspondingly, the refining feature mapping is converted into a feature sequence, and the method comprises the following steps:

and converting the normalization result into a characteristic sequence.

Further, the training process of the pre-training language model includes:

obtaining a covered training sample; the covered training sample comprises partial covering and/or full covering of a single character;

dividing the training samples into a training set and a test set;

inputting training samples of the training set into an initial network model for model training so as to predict the current characters through the correlation coefficient of the context to the current predicted characters;

and if the initial network model meets the preset conditions after being tested by the training samples of the test set, determining the initial network model as a pre-training language model.

In a second aspect, an embodiment of the present invention further provides an apparatus for recognizing characters in an image, including:

the character image area acquisition module is used for acquiring a character image area to be identified;

the character feature extraction module is used for extracting character features if the character image area to be identified contains characters;

the character prediction result determining module is used for inputting the character features into a pre-training language model and predicting each character by the pre-training language model to obtain a character prediction result; the pre-training language model is obtained by training based on a pre-constructed covered training sample;

and the recognition result determining module is used for taking the character prediction result as a recognition result of characters in the image.

Further, the text feature extraction module includes:

and the feature extraction unit is used for extracting the character features of the image to be recognized by using a feature extraction layer consisting of the convolutional neural network and the pooling layer.

Further, the feature extraction unit is specifically configured to:

and converting the refined feature mapping into a feature sequence.

Further, the text feature extraction module further includes:

the normalization processing unit is used for performing normalization processing on the refining feature mapping to obtain a normalization result;

and converting the normalization result into a characteristic sequence.

Further, the training process of the pre-training language model includes:

dividing the training samples into a training set and a test set;

In a third aspect, an embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a method for recognizing characters in an image according to an embodiment of the present application.

In a fourth aspect, an embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored on the memory and executable by the processor, where the processor executes the computer program to implement the method for recognizing characters in an image according to the embodiment of the present application.

According to the technical scheme provided by the embodiment of the application, a character image area to be identified is obtained; if the character image area to be identified contains characters, extracting character features; inputting the character features into a pre-training language model for predicting each character by the pre-training language model to obtain a character prediction result; the pre-training language model is obtained by training based on a pre-constructed covered training sample; and taking the character prediction result as the recognition result of the characters in the image. According to the technical scheme, the effect of accurately identifying the characters can be achieved for the low-quality images.

Drawings

Fig. 1 is a flowchart of a method for recognizing characters in an image according to an embodiment of the present invention;

FIG. 2 is a diagram of an embodiment of the present invention for providing a low quality image;

FIG. 3 is a schematic diagram of a process of recognizing characters in an image according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a model used in the recognition process according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a device for recognizing characters in an image according to a second embodiment of the present invention;

fig. 6 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present application.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the steps as a sequential process, many of the steps can be performed in parallel, concurrently or simultaneously. In addition, the order of the steps may be rearranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like.

Image enhancement: in the process of acquiring, transmitting and storing the image, the complex imaging causes (such as noise, blur, distortion and the like) in reality reduce the visual perception quality of the image. In order to restore a low-quality image to a high-quality image, many methods have been proposed by researchers, and among them, the technologies such as "image denoising", "image super-resolution", and "image deblurring" are relatively representative.

Character Recognition (OCR): that is, optical character recognition, is a technology for automatically recognizing texts in images, and has a long research history and a wide application range, such as document electronization, identity authentication, digital financial systems and license plate recognition. In addition, in a factory, by automatically extracting text information of a product, it is possible to more conveniently manage the product. Off-line work or test paper of students can be electronized through an OCR system, so that communication between teachers and students is more effective.

In the traditional scheme, the image enhancement only can enhance the quality of the image containing objects or people, and the incomplete characters in the image cannot be subjected to restoration operations such as completion and the like. The character recognition can only recognize characters with higher imaging quality in the image, and can not process incomplete characters; it is not possible to process text in noisy, blurred or low resolution pictures. Therefore, it is difficult to perform accurate character recognition using conventional schemes for some low-quality images, especially images with missing text portions.

This is also true because language processing belongs to the cognitive intelligence category and image processing belongs to the perceptual intelligence category. Deep learning only by data volume increase and computing power increase cannot enable the model to evolve from perceptual intelligence to cognitive intelligence, and knowledge needs to be introduced to assist the learning of the model to possibly improve the model.

The traditional character recognition research does not consider the real problems in the industry such as the conditions that the printed characters in the low-resolution images and the images are damaged or shielded, and the like, so that the research of related problems is lacked in the academic world.

Part of the speech recognition models also have a brake on the lack of practical problem-driven reasons in the industry, so that the current mainstream research is not concerned with the core problem proposed herein, namely the text recognition of low-quality images.

Based on the scheme, the pre-training language model is provided, and is based on the neural network pre-training of large-scale unsupervised texts, so that the model has certain natural language understanding capacity. And fine-tuning the pre-trained model on the target field, so that the model can better deal with the problem of the target field.

Example one

Fig. 1 is a flowchart of a method for recognizing characters in an image according to an embodiment of the present invention, where the embodiment is applicable to a case of performing character recognition on a low-quality image, and the method can be executed by a device for recognizing characters in an image according to an embodiment of the present invention, where the device can be implemented by software and/or hardware, and can be integrated in an electronic device of a service system.

As shown in fig. 1, the method includes:

and S110, acquiring a character image area to be recognized.

The text image area to be recognized may be the image area containing text in the low-quality image described above. The low quality image comes from the scan element in the majority. Such as book scans, newspaper scans, and the like. It can be understood that in a low-quality image, a text area may be partially blocked, so that a certain text or some texts are partially or completely blocked, and features cannot be extracted.

Fig. 2 is a schematic diagram of providing a low-quality image according to an embodiment of the present invention, as shown in fig. 2, the low-quality image is a scanned book. After image enhancement processing, the characters become fuzzy, so that the characters in the picture cannot be recognized manually. After the character recognition processing, the error rate of the recognition result is extremely high. Conventional language models cannot process pictures.

And S120, if the character image area to be recognized contains characters, extracting character features.

The method comprises the steps of firstly, identifying whether characters are contained or not, if so, adopting the character identification scheme provided by the scheme, and if not, directly identifying other images.

If the text is contained, feature extraction can be performed on the text area. Specifically, a Convolutional Neural Network (CNN) may be used to perform feature extraction on the image to obtain a feature map. The underlying CNN consists of three structures, convolution (convolution), activation (activation) and pooling (displacement). The result of the CNN output is a specific feature space for each image. When processing an image classification task, we will use the feature space output by the CNN as an input of a fully connected layer or a fully connected neural network (FCN), and use the fully connected layer to complete mapping, i.e., classification, from the input image to the tag set. Of course, the most important work in the whole process is how to iteratively adjust the network weights through the training data, i.e. the back propagation algorithm. Currently, mainstream Convolutional Neural Networks (CNNs), such as VGG, ResNet, etc., are combined by simple CNN adjustment.

In this scheme, optionally, extract characters characteristic, include:

The character features can be directly used as subsequent input data without being processed after the features extracted based on the image, and can also be converted to obtain a feature sequence to be used as subsequent input.

Specifically, the method for extracting the character features of the image to be recognized by using the feature extraction layer composed of the convolution layer and the pooling layer comprises the following steps:

and converting the refined feature mapping into a feature sequence.

Using a Convolutional Neural Network (CNN) to perform Feature extraction on the image to obtain Feature mapping (Feature Maps), also called Feature Maps or landmark Maps (landmap);

and performing maximum Pooling (Max Pooling) on the extracted feature map by using a Pooling layer (Pooling) to obtain a refined feature map.

After pooling is completed, normalization processing can be carried out on the refined feature mapping to obtain a normalization result;

and converting the normalization result into a characteristic sequence.

By using a normalization layer (normalization) to carry out Batch normalization (Batch normalization) processing on the refined feature map, gradient diffusion of the neural network can be prevented, and the obtained result is more accurate.

S130, inputting the character features into a pre-training language model for predicting each character by the pre-training language model to obtain a character prediction result; the pre-training language model is obtained by training based on a pre-constructed covered training sample.

After the character features are obtained, the character features can be input into a pre-training language model so that the pre-training language model can recognize characters one by one, and the characters can be predicted by combining the context, so that a character prediction result is obtained.

In this scheme, optionally, the training process of the pre-training language model includes:

dividing the training samples into a training set and a test set;

Specifically, the training samples can be text images with shielding, or clear text images, and after manual processing, part of text in the text images is shielded. After the initial model is divided into a training set and a test set, the training set can be adopted for training, and whether the initial model obtained by training converges or not or whether the prediction accuracy of the characters can reach a preset condition or not is determined by using the test set. The preset condition here may be that the setting is made to reach an accuracy of 99.5%, or even more.

According to the scheme, a traditional character detection and recognition method is utilized, an area where characters appear in an image is recognized firstly, then the image in the area is identified by a character recognition network, the characters in the current recognition area are predicted according to context by combining a pre-training language model while the characters are recognized, finally, the prediction conditions of the character recognition network and the pre-training language model are comprehensively considered in an output layer, and a character recognition result of the model is output according to the implicit information of the neural network such as context, image information and the like.

And S140, taking the character prediction result as the recognition result of the characters in the image.

It can be understood that if the prediction is completed, the prediction result can be directly used as the final result of the character recognition, so that the character recognition work on the low-quality image is completed.

Fig. 3 is a schematic diagram of a process of recognizing characters in an image according to an embodiment of the present invention, as shown in fig. 3, the process of the present invention is actually executed and mainly includes the following steps:

step 1: inputting a character recognition image;

step 2: performing feature extraction on the image by using a Convolutional Neural Network (CNN) to obtain feature mapping;

and step 3: performing maximum Pooling (Max Pooling) on the extracted feature mapping by using a Pooling layer (Pooling) to obtain refined feature mapping;

and 4, step 4: carrying out Batch normalization (Batch normalization) processing on the refined feature mapping by using a normalization layer (normalization) to prevent the gradient diffusion of the neural network;

and 5: circularly executing the step 2 to the step 4 for 6 times;

step 6: mapping the Map-to-Sequence to a Sequence network, and converting the Feature mapping into a Feature Sequence (Feature Sequence);

and 7: and inputting the characteristic sequence into a BERT model for prediction to obtain a character recognition result.

Fig. 4 is a schematic structural diagram of a model used in an identification process according to an embodiment of the present invention, and as shown in fig. 4, the model provided herein is composed of three parts, namely, a convolutional layer, a map-to-sequence layer, and a full link layer.

The convolution layer is used for extracting high-dimensional latent semantic features of the image;

the map-to-sequence layer is used for converting the three-dimensional continuous tensor into a three-dimensional sequence tensor;

the fully connected layer receives the sequence features of the image and maps them to text.

Specifically, in the BERT model, [ CLS ] represents a tag of a classification task, Fe is a Feature (Feature), E is an embedded vector, C is a classification tag, T is a contextual representation of a character, and O is a character predicted by the model.

The scheme provides a new multi-modal task combining natural language processing and computer vision, namely a character recognition task of low-quality images.

In addition, the scheme expands the character recognition task to a more subdivided field, so that the application range of a character recognition model is wider, and the artificial intelligence technology can be used for assisting cultural relic protection work such as recognition and recovery of ancient book characters even along with the continuous development of related research of the task provided by the text; or the fields of satellite high-altitude exploration and the like.

Example two

Fig. 5 is a schematic structural diagram of an apparatus for recognizing characters in an image according to a second embodiment of the present invention. As shown in fig. 5, the apparatus for recognizing characters in an image includes:

a text image region obtaining module 510, configured to obtain a text image region to be identified;

a text feature extraction module 520, configured to extract text features if the text image region to be identified contains text;

a character prediction result determining module 530, configured to input the character features into a pre-training language model, and enable the pre-training language model to predict each character to obtain a character prediction result; the pre-training language model is obtained by training based on a pre-constructed covered training sample;

and the recognition result determining module 540 is configured to use the character prediction result as a recognition result of characters in the image.

Further, the text feature extraction module includes:

Further, the feature extraction unit is specifically configured to:

and converting the refined feature mapping into a feature sequence.

Further, the text feature extraction module further includes:

and converting the normalization result into a characteristic sequence.

Further, the training process of the pre-training language model includes:

dividing the training samples into a training set and a test set;

The product can execute the method provided by the first embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

EXAMPLE III

Embodiments of the present application also provide a storage medium containing computer-executable instructions, which when executed by a computer processor, perform a method for recognizing characters in an image, the method including:

acquiring a character image area to be identified;

Storage medium-any of various types of memory electronics or storage electronics. The term "storage medium" is intended to include: mounting media such as CD-ROM, floppy disk, or tape devices; computer system memory or random access memory such as DRAM, DDR RAM, SRAM, EDO RAM, Lanbas (Rambus) RAM, etc.; non-volatile memory such as flash memory, magnetic media (e.g., hard disk or optical storage); registers or other similar types of memory elements, etc. The storage medium may also include other types of memory or combinations thereof. In addition, the storage medium may be located in the computer system in which the program is executed, or may be located in a different second computer system connected to the computer system through a network (such as the internet). The second computer system may provide the program instructions to the computer for execution. The term "storage medium" may include two or more storage media that may reside in different locations, such as in different computer systems that are connected by a network. The storage medium may store program instructions (e.g., embodied as a computer program) that are executable by one or more processors.

Of course, the storage medium provided in the embodiments of the present application contains computer-executable instructions, and the computer-executable instructions are not limited to the operation of recognizing characters in images as described above, and may also perform related operations in the method of recognizing characters in images provided in any embodiments of the present application.

Example four

The embodiment of the application provides electronic equipment. Fig. 6 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present application. As shown in fig. 6, the present embodiment provides an electronic device 600, which includes: one or more processors 620; the storage device 610 is configured to store one or more programs, and when the one or more programs are executed by the one or more processors 620, the one or more processors 620 are enabled to implement the method for recognizing characters in an image provided in an embodiment of the present application, the method includes:

acquiring a character image area to be identified;

The electronic device 600 shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 6, the electronic device 600 includes a processor 620, a storage device 610, an input device 630, and an output device 640; the number of the processors 620 in the electronic device may be one or more, and one processor 620 is taken as an example in fig. 6; the processor 620, the storage device 610, the input device 630, and the output device 640 in the electronic apparatus may be connected by a bus or other means, and are exemplified by being connected by a bus 650 in fig. 6.

The storage device 610 is a computer-readable storage medium, and can be used to store software programs, computer-executable programs, and module units, such as program instructions corresponding to the recognition method of characters in images in the embodiment of the present application.

The storage device 610 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. In addition, the storage 610 may include high speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the storage 610 may further include memory located remotely from the processor 620, which may be connected via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input means 630 may be used to receive input numbers, character information, or voice information, and to generate key signal inputs related to user settings and function control of the electronic device. The output device 640 may include a display screen, a speaker, and other electronic devices.

The electronic equipment provided by the embodiment of the application can realize the effect of accurately identifying the characters aiming at the low-quality images.

The device, the medium and the electronic device for recognizing characters in images provided in the above embodiments can operate the method for recognizing characters in images provided in any embodiment of the present application, and have corresponding functional modules and beneficial effects for operating the method. For the technical details not described in detail in the above embodiments, reference may be made to the method for recognizing characters in an image provided in any embodiment of the present application.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A method for recognizing characters in an image, the method comprising:

acquiring a character image area to be identified;

2. The method of claim 1, wherein extracting textual features comprises:

3. The method of claim 2, wherein extracting the text features of the image to be recognized using a feature extraction layer consisting of a convolutional layer and a pooling layer comprises:

and converting the refined feature mapping into a feature sequence.

4. The method of claim 3, wherein prior to converting the refined feature map into a feature sequence, the method further comprises:

and converting the normalization result into a characteristic sequence.

5. The method of claim 1, wherein the training process of the pre-trained language model comprises:

dividing the training samples into a training set and a test set;

6. An apparatus for recognizing characters in an image, the apparatus comprising:

7. The apparatus of claim 6, wherein the text feature extraction module comprises:

8. The apparatus according to claim 7, wherein the feature extraction unit is specifically configured to:

and converting the refined feature mapping into a feature sequence.

9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a method for recognizing characters in an image according to any one of claims 1 to 5.

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method for recognizing a character in an image according to any one of claims 1 to 5 when executing the computer program.