CN114529904A

CN114529904A - Scene text recognition system based on consistency regular training

Info

Publication number: CN114529904A
Application number: CN202210061855.XA
Authority: CN
Inventors: 王鹏; 郑财源
Original assignee: Ningbo Research Institute of Northwestern Polytechnical University
Current assignee: Ningbo Research Institute of Northwestern Polytechnical University
Priority date: 2022-01-19
Filing date: 2022-01-19
Publication date: 2022-05-24

Abstract

The invention relates to a scene text recognition system based on consistency regular training, and belongs to the field of scene text recognition. The whole system comprises three branches including a supervision branch, an unsupervised branch and a domain adaptation branch. The invention uses an application consistency regularization method to train a more robust and better performing STR model. In particular, the STR model receives as input two enhanced views of an unlabeled text image and forces it to output the same result. Through the training mode, the model can learn and transform invariant features by utilizing large-scale unlabeled data. The invention adds a projection module in a path of an unsupervised branch to prevent the model from collapsing. Given the large domain gap between training data and real test data, domain adaptation losses are applied to approximate the distance between character-level features between the synthetic labeled data and the real unlabeled data.

Description

Scene text recognition system based on consistency regular training

Technical Field

The invention belongs to the field of scene text recognition, and particularly relates to a scene text recognition system based on consistency regular training.

Background

Scene Text Recognition (STR), which is the recognition of text in natural scenes, is a special form of Optical Character Recognition (OCR).

In the past, hand-crafted features were used for scene text recognition, such as histograms of oriented gradient descriptors, connected components, and stroke width transforms. With the rapid development of the deep learning technology, scene text recognition has great progress in the aspects of innovation, practicability, efficiency and the like. There are two main categories of scene text recognition, segmentation-based methods and segmentation-free methods. In particular, methods that do not require segmentation can be broadly classified into a Connection Timing Classification (CTC) based method and an attention-based method.

At present, the regular text recognition method has achieved good performance due to the application of convolutional neural networks and attention mechanisms. Irregular text recognition is more difficult than regular text recognition tasks due to multiple disturbances of the environment, various shapes and distorted patterns.

While the deep learning approach has had great success in many computer vision tasks, including scene text recognition, it has a great demand for large data. In addition to true marker data, synthetic marker data is also widely used to train STR models. The synthetic data and the real data each have advantages and disadvantages. Real data is manually annotated, but is typically expensive, time consuming, and small. Synthetic data is automatic and efficient, but it is difficult to design a good synthetic data engine for different tasks, and there is always a domain gap between synthetic data and real data.

Given the greater ease of collecting label-free data in the real world, many researchers have attempted to improve the performance of depth models using label-free data. Semi-supervised methods may combine additional unlabeled data with labeled data during the training process, the most common being self-training methods. Consistency regularization methods such as uda (unsuperviced Data augmentation) are generally considered more efficient and effective than self-training methods such as pseudo-labels. It is well known that the consistency regularization method has not been successfully applied in the scene text recognition task. When the consistency regularization method is applied to the STR model, the model is found to be seriously collapsed, and the performance is poor.

Disclosure of Invention

Technical problem to be solved

In order to solve the problems, the invention provides an asymmetric unsupervised branch structure and combines a domain adaptive branch, thereby effectively improving the stability and the final performance of model training.

Technical scheme

A scene text recognition system based on consistency regularization training is characterized by comprising a supervision branch, an unsupervised branch and a domain adaptation branch; the supervision branch receives a text image X^LAs input, and calculating a predicted distribution P of the student model^LAnd label text string Y^gtCross entropy loss between; said unsupervised branch unmarked image X by weak data enhancement and strong data enhancement^URespectively converted into two enhanced views X^U1And X^U2(ii) a For an input image X^U1Teacher model outputs a prediction distribution P^U1(ii) a For an input image X^U2Student model output prediction distribution P^U2(ii) a The weak data is then enhanced into view X^U1Predicted distribution P of^U1As a goal, we force strong data to enhance view X with consistency regularization loss^U2Output P of^U2Is closer to P^U1(ii) a The domain adaptation branch calculates CORAL loss to align character-level features of tagged data and untagged data; finally, a weighted sum of the three losses is calculated and the model is updated with a back-propagation algorithm.

The student model and the teacher model have the same structure but different parameters, and both adopt a scene text recognition model.

The scene text recognition model comprises a correction module, a feature extraction module, a sequence modeling module and a prediction module.

In the unsupervised branch a projection module is added to the student model before the classification layer.

The projection module is a feed-forward layer module PFF.

Advantageous effects

The invention provides a scene text recognition system based on consistency regularization training. The invention trains a more robust and better performing STR model using an application consistency regularization method. In particular, the STR model receives as input two enhanced views of an unlabeled text image and forces it to output the same result. Through the training mode, the model can learn and transform invariant features by utilizing large-scale unlabeled data. The invention adds a projection module in a path of an unsupervised branch to prevent the model from collapsing. Given the large domain gap between training data and real test data, domain adaptation losses are applied to approximate the distance between character-level features between the synthetic labeled data and the real unlabeled data.

The invention can effectively utilize real unmarked data to improve the robustness and performance of the STR model. The semi-supervised model of the present invention shows a great improvement compared to the supervised baseline. Without the use of artificial annotation data, the present invention can outperform the current state-of-the-art methods in all scene text recognition test datasets.

Drawings

The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, wherein like reference numerals are used to designate like parts throughout.

FIG. 1 is a general block diagram of the system;

figure 2TRBA flow diagram;

fig. 3 is a diagram of a domain adaptation branch structure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

As shown in FIG. 1, the system of the present invention consists of three branches, respectively called supervisory branch, noneSupervision branches and domain adaptation branches. The system comprises two scene text recognition models, namely a student model and a teacher model, wherein the student model and the teacher model have the same structure but different parameters. Supervised branch acceptance text image X^LAs input, and calculating a predicted distribution P of the student model^LAnd label text string Y^gtCross entropy loss between. Unsupervised branching unmarked image X with weak data enhancement and strong data enhancement^URespectively converted into two enhanced views X^U1And X^U2. For an input image X^U1Teacher model outputs a prediction distribution P^U1. For an input image X^U2Student model output prediction distribution P^U2. The weak data is then enhanced into view X^U1Predicted distribution P of^U1As a goal, we force strong data to enhance view X with consistency regularization loss^U2Output P of^U2Is closer to P^U1. To narrow the domain gap between the synthetic tagged data and the actual tagged data, the present invention calculates CORAL penalty to align the character-level features of the tagged data and the untagged data. Finally, a weighted sum of the three losses is calculated and the model is updated with a back-propagation algorithm.

And (4) identifying a scene text model. The scene text recognition model is composed of a rectification module (specifically thin-plate spline TPS), a feature extraction module (convolution residual error network ResNet), a sequence modeling module (bidirectional BilSTM) and a prediction module (Attention mechanism Attention), as shown in FIG. 2, and is called TRBA. The rectification module converts the input image X into a normalized image X using thin-plate splines (TPS) for mitigating the difficulty of subsequent recognition tasks. For an input image size of 32 × 100, the feature map size of the last convolutional layer of the feature extraction module is C × 1 × T (C is the channel size and T is the maximum sequence length of the decoder). The sequence modeling module uses a bidirectional long-short term memory recurrent neural network (BilSTM) to learn better sequence features H ∈ T × C. The prediction module uses attention-based sequence prediction. In the t-th time step of decoding, the prediction module uses the encoder output H, the last hidden state and the character s_t-1And the embedded information f (y) of the previous time step_t-1) fromThe character is predicted to predict the output of the current time step.

e_t,i＝w^ttanh(Ws_t-1+Vh_i+b)

Where W, V are trainable parameters. The attention alpha weights represent the importance of the sequence feature H in different time steps. The decoder weights H as a sum of the feature vectors g by taking the weights as coefficients_tIt is treated as a character-level feature in the domain adaptation branch. The loop unit of the prediction module will g_tAs visual input and producing an output vector s_t。

s_t＝rnn(s_t-1,(g_t,f(y_t-1)))

After the classifier layer, the TRBA model outputs a sequence prediction of size T × K, where K is the number of classifications for the character.

And (6) monitoring branches. The supervision branch will mark image X in the data^LAs input, obtaining output P through correction, feature extraction, sequence modeling and prediction stages of student model^L. Calculating the prediction result and labeled text string Y of the student model^gtCross entropy loss between:

wherein theta is_studentAre the model parameters of the student model.

And (4) unsupervised branching. Unsupervised branching requires only unlabeled text image X^UNo corresponding annotation strings are required. According to the difference of the intensity of the data enhancement strategy, the invention defines two different data enhancement strategies in advance, namely weak (data) enhancement and weak (data) enhancementStrong (data) enhancement. X^UConversion to X by weak enhancement^U1Conversion to X by strong enhancement^U2。

Where I is the index function. In practice, if the student model and teacher model share the same model structure and parameters, the unsupervised branch will crash severely and perform poorly. Therefore, the present invention adds a projection module to the student model before the classification layer to break the symmetry. The invention uses a Position-wise feed forward (PFF) module proposed in the transform as a projection module. In addition, the invention updates the teacher model using an exponential moving average method to keep the projection module in an optimal state.

The domain adapts the branches. As shown in fig. 3, the present invention uses domain adaptation loss to narrow the domain gap between the synthesized tagged data and the true untagged data. The invention aligns the global feature space not between the synthesized data and the real data, but adaptively focuses on the distribution of the synthesized data and the real data in the aligned character-level feature space. The present invention treats the vector g, which automatically focuses on local features at each decoding time step, as a character-level feature. F is measured using CORrelation Alignment (CORAL)^L _clfAnd F^U2 _clfThe distribution distance between two character-level features. The domain adaptation loss is defined as follows:

wherein U is_LIs F^L _clfSet of (2), U_U2Is F^U2 _clfThe set of (a) and (b),

representing the square matrix Frobenius norm, cov (u) is a covariance matrix.

A training process and an inference process. During training, the final loss is defined as follows:

L＝L_sup+λ_unlL_unl+λ_daL_da

wherein L is_unlAnd L_daIs a hyper-parameter. Adam was chosen as the optimizer and parameters of the student model were updated using back propagation. The gradient of the teacher model is stopped and its parameters are updated using an Exponential Moving Average (EMA) mechanism, defined as follows:

θ_teac＝αθ_teacher+(1-α)θ_student

where α is a hyperparameter. After training, only the student models are saved and used for predicting the content of the text images in the reasoning stage. In the inference phase, the prediction module first outputs the first character using the "BOS" character representing the start of decoding as input. Next, the prediction module iteratively uses the previous output character as input in each decoding time step of the inference phase. Once the "EOS" character is output, the prediction module ends.

The training and testing data sets, the experimental set-up of the system, used in the present invention will be described below.

(1) Data set

The present invention trains the system on two synthetic marker datasets, including Synth90K and SynthText, for a total of approximately 14.5M samples. Approximately 10.7M of real unlabeled data was used. The method evaluates the six scene text recognition benchmark test data sets. The composite tagged data and test data set are both public, while the actual untagged data is private.

The details of the six test data sets are as follows. IIT5K-Words (IT5K) contained 3000 cropped word images for testing. Street View Text (SVT) consists of 647 word images that are collected from google street view. Many images have low resolution or are very noisy and blurred. ICDAR 2013(IC13) contained 1015 cropped word images, while IC13_857 was a subset of IC13, with no word images shorter than 3 characters. SVT-Peractive (SVT-P) contains 645 cropped images for testing. Most images are perspective warped because the images are selected from side view corner snapshots in google street view. Cut 80 contains 288 high resolution images for testing, but some images are curved. ICDAR 2015(IC15 contains 2077 cropped word images IC15_1811 is a subset of IC15 that discards word images with non-alphanumeric characters.

The detailed information of the true unlabeled dataset is as follows. All unlabeled datasets contain about 107 million word images cropped from three scene image datasets (including Places2, OpenImages, and ImageNet ILSVRC 2012). The present invention uses a scene Text detector named CRAFT (Character-Region aware For Text detection) to detect and crop Text images. The text confidence threshold is set to 0.7 and text images of low resolution (width multiplied by height less than 1000) are discarded. No other post-processing method was applied and 10.7M text images were finally obtained.

(2) Data enhancement

Data enhancement plays an important role in the consistency regularization training of the present invention. As described above, weak (data) enhancement and strong (data) enhancement are used for unlabeled images in unsupervised branches. Strong enhancements are also used in supervised branch training in order to facilitate a fair comparison with fully supervised training. The weak enhancement method changes brightness, contrast, saturation, and hue of an image. The strong enhancement method is inherited from the RandAugment enhancement method. It contains channel transformations such as automatic contrast, brightness, color, equalization, color separation, exposure addition operation space transformations such as shearing, translation, rotation, etc.

(3) Experimental setup

The model and system of the present invention are implemented using PyTorch.

In advance ofAnd a processing stage, converting the original image into an enhanced image through different enhancement strategies and adjusting the original image to the same 32 × 100 size. To balance the ratio of marked and unmarked data, the batch size N of the marked data is set_LBatch N being 384 and no marker data_UTo 128 batches.

The present invention trains the system using an Adam optimizer. To speed up model convergence, the onecllr learning rate scheduler was employed, setting the maximum learning rate to 0.001. The onecllr learning rate scheduler first increases the learning rate from the base learning rate to the maximum learning rate and then decreases gradually until the base learning rate is reached at the end of training. The temperature coefficient τ is set to 0.4 and the confidence threshold β is 0.3. The exponentially moving average attenuation ratio was set to 0.999 to keep the projection layer in the optimum state. The present invention sets λ unl and λ da to 1.0 and 0.01, respectively. The system combines the tagged and untagged data training for a total of 250000 iterations.

While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention.

Claims

1. A scene text recognition system based on consistency regularization training is characterized by comprising a supervision branch, an unsupervised branch and a domain adaptation branch; the supervision branch receives a text image X^LAs input, and calculating a predicted distribution P of the student model^LAnd label text string Y^gtCross entropy loss between; said unsupervised branch unmarked image X by weak data enhancement and strong data enhancement^URespectively converted into two enhanced views X^U1And X^U2(ii) a For an input image X^U1Teacher model outputs a prediction distribution P^U1(ii) a For an input image X^U2Student model output prediction distribution P^U2(ii) a The weak data is then enhanced into view X^U1Predicted distribution P of^U1As a purpose of the inventionBidding, employing consistency regularization loss, forces strong data to enhance View X^U2Output P of^U2Is closer to P^U1(ii) a The domain adaptation branch calculates CORAL loss to align character-level features of tagged data and untagged data; finally, a weighted sum of the three losses is calculated and the model is updated with a back-propagation algorithm.

2. The system of claim 1, wherein the student model and the teacher model have the same structure but different parameters and adopt scene text recognition models.

3. The system according to claim 2, wherein the scene text recognition model comprises a rectification module, a feature extraction module, a sequence modeling module and a prediction module.

4. The system of claim 1, wherein a projection module is added to the student model before the classification layer in the unsupervised branch.

5. The system of claim 4, wherein the projection module is a feed-forward layer module PFF.