CN112163508A

CN112163508A - Character recognition method and system based on real scene and OCR terminal

Info

Publication number: CN112163508A
Application number: CN202011023019.XA
Authority: CN
Inventors: 张昊博; 杨军; 王滨; 周娜; 乔彩丽
Original assignee: CETC 15 Research Institute
Current assignee: CETC 15 Research Institute
Priority date: 2020-09-25
Filing date: 2020-09-25
Publication date: 2021-01-01

Abstract

The invention discloses a character recognition method and system based on a real scene and an OCR terminal. The method comprises the following steps: acquiring picture data in a real office scene; carrying out binarization processing on the picture data; training a CTPN model through the picture data after binarization processing; carrying out character region detection by using the trained CTPN model; training a DenseNet + CTC model through the detected character area; and performing character recognition by using the trained DenseNet + CTC model. According to the method, the picture data in the real office scene is acquired, an effective character detection and character recognition model is constructed for training, the trained deep learning model is used as a tool, an OCR tool terminal is built, a user can define a recognition area by himself, the working efficiency is improved, and meanwhile the recognition accuracy of the model is improved.

Description

Character recognition method and system based on real scene and OCR terminal

Technical Field

The invention relates to the field of character recognition, in particular to a character recognition method and system based on a real scene and an OCR terminal.

Background

With the rapid development of computers and information technologies, the process of forming human-understandable biological signals by processing visual signals by neurons of the brain by simulating human eyes is developed, and the application of image recognition technology is gradually expanded to a plurality of fields, and especially plays an increasingly important role in a plurality of fields such as biological recognition, image-text recognition and article recognition. Generally, image recognition technology mainly refers to processing a captured system front-end picture according to a set target by using a computer, in the field of artificial intelligence, a neural network is the most widely applied in the field of image recognition, and vector or raster coding of an image is converted into a feature vector representing characteristics of an object. Neural network models can compute and analyze these constructs by first simplifying the image and extracting the most important information, and then organizing the data by feature extraction and classification. Finally, neural network models decide by classification, prediction or other algorithms which class images or belong to or how best to describe them. The neural network model can realize several large plates such as face recognition, image detection, image retrieval, target tracking, style migration and the like. Among them, the functions of face recognition, image classification, and character recognition have achieved very excellent recognition results through long-term iterative development.

Optical Character Recognition (OCR) has emerged many conventional resolution algorithms before the advent of neural networks. The character recognition mainly comprises two parts of text detection and text recognition, and the accuracy of character recognition is improved when the deep neural network is widely applied. Opencv is a computer vision tool providing a full platform interface, is dedicated to real-time application in the real world, and has wide application range and strong executable capability; Tesseract-OCR is a recognition engine maintained by Google developed by the HP laboratory, developing a library of characters for recognition that encompasses almost all of the world's mainstream languages to date. OCR technology has been applied to various scenes in the office field. Overall, OCR technology can solve existing common tasks, but for some specific needs (e.g. certain fixed areas in the picture) the character recognition is not perfect.

Disclosure of Invention

Based on this, the invention aims to provide a character recognition method and system based on a real scene and an OCR terminal, so as to improve the character detection level and the character recognition accuracy.

In order to achieve the purpose, the invention provides the following scheme:

a character recognition method based on real scenes comprises the following steps:

acquiring picture data in a real office scene;

carrying out binarization processing on the picture data;

training a CTPN model through the picture data after binarization processing;

carrying out character region detection by using the trained CTPN model;

training a DenseNet + CTC model through the detected character area;

and performing character recognition by using the trained DenseNet + CTC model.

Further, the CTPN model includes a CNN model and an LSTM model.

Further, the training of the DenseNet + CTC model through the detected text region specifically includes:

training a DenseNet model through the detected character area;

extracting the characteristics of the character area through a trained DenseNet model;

training a CTC model through the characteristics of the character region; the trained CTC model is used for character recognition.

The invention also provides a character recognition system based on the real scene, which comprises the following components:

the image data acquisition module is used for acquiring image data in a real office scene;

the processing module is used for carrying out binarization processing on the picture data;

the first training module is used for training a CTPN model through the picture data after binarization processing;

the character region detection module is used for detecting character regions by utilizing the trained CTPN model;

the second training module is used for training a DenseNet + CTC model through the detected character area;

and the character recognition module is used for carrying out character recognition by utilizing the trained DenseNet + CTC model.

Further, the CTPN model includes a CNN model and an LSTM model.

Further, the second training module specifically includes:

a first training unit for training a DenseNet model through the detected text region;

the character extraction unit is used for extracting the characters of the character area through the trained DenseNet model;

the second training unit is used for training the CTC model through the characteristics of the character area; the trained CTC model is used for character recognition.

The invention also provides an OCR terminal applying the character recognition method based on the real scene, which comprises the following steps:

the picture uploading module is used for uploading picture data in a real office scene;

and the area selection module is used for carrying out area selection on the picture data in the real office scene.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

according to the method, the picture data in the real office scene is acquired, an effective character detection and character recognition model is constructed for training, the trained deep learning model is used as a tool, an OCR tool terminal is built, a user can define a recognition area by himself, the working efficiency is improved, and meanwhile the recognition accuracy of the model is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a flow chart of a text recognition method based on a real scene according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of the CTPN model according to the embodiment of the present invention;

FIG. 3 is a schematic diagram of the operation of the bidirectional recurrent neural network-BilSTM according to the embodiment of the present invention;

FIG. 4 is a diagram of a DenseNet model structure according to an embodiment of the present invention

FIG. 5 is a block diagram of a real scene-based text recognition system according to an embodiment of the present invention;

fig. 6 is a flowchart of the OCR terminal.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention aims to provide a character recognition method and system based on a real scene and an OCR terminal, which are used for improving the character detection level and the character recognition accuracy.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

As shown in fig. 1, a method for recognizing a character based on a real scene includes:

step 101: and acquiring picture data in a real office scene.

Acquiring picture data such as project propaganda pictures, notification announcements and form bills generated in the office field in the following specific modes: based on the existing public data set, the method is based on character recognition in a real scene, so that the data set including product description, street view real shooting and network advertisement is used for model training.

Step 102: and carrying out binarization processing on the picture data.

And carrying out area binarization processing on the picture based on an Opencv tool. The binarization of the image is to set the gray value of a pixel point on the image to be 0 or 255, so that the whole image has obvious black and white effect. In digital image processing, the image has uneven brightness due to light and other problems, and binarization greatly reduces the data amount in the image, so that the contour of the target can be highlighted. Specifically, before the binarization operation, a fixed threshold value is set in advance, and then comparison is performed according to the numerical values of the pixel points of the picture itself, and the portions exceeding the threshold value are set to be 255, while the portions smaller than the threshold value are set to be 0. However, such a "one-cut" operation is inevitable to generate some errors when processing complex pictures. To optimize this method, local binarization is used, which determines a binarization threshold at a pixel location based on the distribution of pixel values of a neighborhood block of pixels. The benefit of this is that the binarization threshold at each pixel location is not fixed, but rather determined by the distribution of its surrounding neighborhood pixels. The binarization threshold value of the image area with higher brightness is generally higher, while the binarization threshold value of the image area with lower brightness is correspondingly smaller. Local image regions of different brightness, contrast, texture will have corresponding local binarization thresholds.

Step 103: and training a CTPN model through the picture data after binarization processing. The CTPN model comprises a CNN model and an LSTM model.

And the image after binarization processing is sent to a deep learning model for training. The CTPN is combined with the CNN and the LSTM deep network, and can effectively detect the transversely distributed characters of the complex scene. The method divides the text line into slices, and presets a plurality of anchors with different scales for positioning the position of the characters, wherein the adopted bidirectional LSTM layer with time sequence characteristics improves the identification accuracy.

As shown in fig. 2, for a complete CTPN model, the VGG16 network is first used to process the input picture content. The VGG16 network is essentially a convolutional neural network consisting of 13 convolutional layers, 5 max pooling layers, and 3 fully-connected layers. The picture content can be regarded as matrix data consisting of three channels of pixel points. Using two-dimensional convolution kernel w epsilon R in convolution layer^3×3Extracting the characteristics of the picture P to obtain a characteristic matrix C of the picture_n。

Wherein n represents the number of convolution operations, m represents the number of convolution kernels, i represents the i-th acquired feature matrix, f represents the nonlinear activation function, represents the shared weight of the convolution kernels and the corresponding operation of the feature matrix, w represents the weight of the convolution kernels, and b represents the offset value.

The method of maximum pooling is used at the pooling level, and the expression of feature extraction is as follows:

p_u＝Max_2×2[C_n]

wherein u represents the number of pooling, Max_2×2The operation method of maximum pooling with a size of 2 × 2 matrix is shown.

As shown in fig. 3, after a plurality of convolution and pooling operations, inputting the Reshape-processed data stream into a bidirectional LSTM model to obtain a feature vector with a time sequence attribute, which is expressed by a formula:

wherein s is_tIndicating the output of the forward sequence at time t, s_t' denotes the output of the reverse timing at time t. U shape_XtInitial input, U, representing forward timing_Xt' denotes an initial input of a reverse timing.

Representing an input at a time in the forward sequence,

and reversing the input of the next moment in time. o_tIndicating the output at time t. After the feature extraction of the time sequence model, the feature vector with the space + sequence is input into the RPN network, and simultaneously, the two feature extraction networks are passed through. One of the strips covers the text content of different heights in the whole image through a group of 10 anchors with fixed width at 16 and variable height. Then, utilizing an activation function softmax to classify to obtain positive and negative so as to judge whether the Anchor contains texts; the other is used for calculating the calculated offset for the boundingboxregression task of anchors so as to obtain accurate proposal. The filtering and position shifting of the anchors to determine the proposal is achieved by applying a loss function to the results of the two branches.

Step 104: and detecting the character area by using the trained CTPN model.

Step 105: the Densenet + CTC model was trained over the detected text region. The method specifically comprises the following steps: training a DenseNet model through the detected character area; extracting the characteristics of the character area through a trained DenseNet model; training a CTC model through the characteristics of the character region; the trained CTC model is used for character recognition.

After the character positions in the images are accurately positioned, the character images selected by positioning are sent to a DenseNet + CTC model in a characteristic matrix form for training and recognizing the character contents.

As shown in fig. 4, densneet is a convolution deep neural network model based on residual network, and is composed of denseblock dense block + transitionayer transition block. Wherein, the denseblock defines the connection relation between the input and the output, and the transitionlayer controls the number of channels. The DenseNet improves the problem of discontinuous information flow between different layers of the original model, and directly connects all layers on the premise of ensuring the maximum information transmission between the layers in the network. The formula is expressed as follows:

x_l＝H_l([x₀，x₁，...，x_l-1])

wherein x is_lDenotes the output of each layer, H_lRepresents the ReLU activation function and convolution kernel of 3 x 3 and the regularized optimization for each layer. After multiple rounds of convolution operation, reshape operation is carried out on the two-dimensional feature matrix to form potential feature vectors of an image, and the potential feature vectors are sent into a CTC model, wherein the model takes network output as probability distribution of all possible label sequences based on an input sequence, and the problem of training of a conversion task of unsegmented sequence data is solved.

Where S represents a sample set and the individual samples are represented as (x, z). X represents the original sequence before conversion in the sample, X is a sequence composed of m-dimensional vectors, and the set X to which the sequence belongs is called an input space. Z represents the transformed sequence in the sample, Z is a set L called the target space to which it belongs, L is a set of sequences composed of finite elements, and the length of Z must be smaller than the length of x. CTC trains a mapping h (X, z) from X to L, and the smaller the LER value, the more accurate the task is. The method is based on the densnet model and uses the loss training network of the ctc, so that the effect of recognizing characters is achieved.

Step 106: and performing character recognition by using the trained DenseNet + CTC model.

As shown in fig. 5, the present invention further provides a real scene-based text recognition system, which includes:

a picture data obtaining module 501, configured to obtain picture data in a real office scene.

A processing module 502, configured to perform binarization processing on the picture data.

The first training module 503 is configured to train a CTPN model through the binarized picture data. The CTPN model comprises a CNN model and an LSTM model.

And a text region detection module 504, configured to perform text region detection using the trained CTPN model.

And a second training module 505, configured to train the DenseNet + CTC model through the detected text region.

The second training module 505 specifically includes:

And a character recognition module 506, configured to perform character recognition by using the trained DenseNet + CTC model.

The work flow diagram of the OCR terminal is shown in fig. 6. Based on the character recognition method and the actual requirements of the user, the terminal provides a character recognition tool capable of customizing the recognition area. The tool supports users to upload pictures and provides region selection functions for the users by using canvas plug-ins. The user can select a plurality of areas on the picture for character recognition by using the function. The interactive design is convenient for the user to accurately grasp the characters to be recognized, and is also beneficial for the model to accurately position the character contents, so that the working efficiency is improved while the accuracy is improved.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A character recognition method based on real scenes is characterized by comprising the following steps:

acquiring picture data in a real office scene;

carrying out binarization processing on the picture data;

training a CTPN model through the picture data after binarization processing;

carrying out character region detection by using the trained CTPN model;

training a DenseNet + CTC model through the detected character area;

and performing character recognition by using the trained DenseNet + CTC model.

2. The method of real scene-based word recognition according to claim 1, wherein the CTPN model comprises a CNN model and an LSTM model.

3. The method for recognizing characters based on real scenes as claimed in claim 1, wherein said training of DenseNet + CTC model through the detected character region specifically comprises:

training a DenseNet model through the detected character area;

4. A real scene based word recognition system, comprising:

5. The real scene-based word recognition system of claim 4, wherein the CTPN model comprises a CNN model and an LSTM model.

6. The real scene-based word recognition system of claim 4, wherein the second training module specifically comprises:

7. An OCR terminal applying the real scene-based character recognition method according to any one of claims 1 to 3, comprising: