CN114529904A - Scene text recognition system based on consistency regular training - Google Patents

Scene text recognition system based on consistency regular training Download PDF

Info

Publication number
CN114529904A
CN114529904A CN202210061855.XA CN202210061855A CN114529904A CN 114529904 A CN114529904 A CN 114529904A CN 202210061855 A CN202210061855 A CN 202210061855A CN 114529904 A CN114529904 A CN 114529904A
Authority
CN
China
Prior art keywords
data
model
branch
text recognition
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210061855.XA
Other languages
Chinese (zh)
Inventor
王鹏
郑财源
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ningbo Research Institute of Northwestern Polytechnical University
Original Assignee
Ningbo Research Institute of Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ningbo Research Institute of Northwestern Polytechnical University filed Critical Ningbo Research Institute of Northwestern Polytechnical University
Priority to CN202210061855.XA priority Critical patent/CN114529904A/en
Publication of CN114529904A publication Critical patent/CN114529904A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a scene text recognition system based on consistency regular training, and belongs to the field of scene text recognition. The whole system comprises three branches including a supervision branch, an unsupervised branch and a domain adaptation branch. The invention uses an application consistency regularization method to train a more robust and better performing STR model. In particular, the STR model receives as input two enhanced views of an unlabeled text image and forces it to output the same result. Through the training mode, the model can learn and transform invariant features by utilizing large-scale unlabeled data. The invention adds a projection module in a path of an unsupervised branch to prevent the model from collapsing. Given the large domain gap between training data and real test data, domain adaptation losses are applied to approximate the distance between character-level features between the synthetic labeled data and the real unlabeled data.

Description

Scene text recognition system based on consistency regular training
Technical Field
The invention belongs to the field of scene text recognition, and particularly relates to a scene text recognition system based on consistency regular training.
Background
Scene Text Recognition (STR), which is the recognition of text in natural scenes, is a special form of Optical Character Recognition (OCR).
In the past, hand-crafted features were used for scene text recognition, such as histograms of oriented gradient descriptors, connected components, and stroke width transforms. With the rapid development of the deep learning technology, scene text recognition has great progress in the aspects of innovation, practicability, efficiency and the like. There are two main categories of scene text recognition, segmentation-based methods and segmentation-free methods. In particular, methods that do not require segmentation can be broadly classified into a Connection Timing Classification (CTC) based method and an attention-based method.
At present, the regular text recognition method has achieved good performance due to the application of convolutional neural networks and attention mechanisms. Irregular text recognition is more difficult than regular text recognition tasks due to multiple disturbances of the environment, various shapes and distorted patterns.
While the deep learning approach has had great success in many computer vision tasks, including scene text recognition, it has a great demand for large data. In addition to true marker data, synthetic marker data is also widely used to train STR models. The synthetic data and the real data each have advantages and disadvantages. Real data is manually annotated, but is typically expensive, time consuming, and small. Synthetic data is automatic and efficient, but it is difficult to design a good synthetic data engine for different tasks, and there is always a domain gap between synthetic data and real data.
Given the greater ease of collecting label-free data in the real world, many researchers have attempted to improve the performance of depth models using label-free data. Semi-supervised methods may combine additional unlabeled data with labeled data during the training process, the most common being self-training methods. Consistency regularization methods such as uda (unsuperviced Data augmentation) are generally considered more efficient and effective than self-training methods such as pseudo-labels. It is well known that the consistency regularization method has not been successfully applied in the scene text recognition task. When the consistency regularization method is applied to the STR model, the model is found to be seriously collapsed, and the performance is poor.
Disclosure of Invention
Technical problem to be solved
In order to solve the problems, the invention provides an asymmetric unsupervised branch structure and combines a domain adaptive branch, thereby effectively improving the stability and the final performance of model training.
Technical scheme
A scene text recognition system based on consistency regularization training is characterized by comprising a supervision branch, an unsupervised branch and a domain adaptation branch; the supervision branch receives a text image XLAs input, and calculating a predicted distribution P of the student modelLAnd label text string YgtCross entropy loss between; said unsupervised branch unmarked image X by weak data enhancement and strong data enhancementURespectively converted into two enhanced views XU1And XU2(ii) a For an input image XU1Teacher model outputs a prediction distribution PU1(ii) a For an input image XU2Student model output prediction distribution PU2(ii) a The weak data is then enhanced into view XU1Predicted distribution P ofU1As a goal, we force strong data to enhance view X with consistency regularization lossU2Output P ofU2Is closer to PU1(ii) a The domain adaptation branch calculates CORAL loss to align character-level features of tagged data and untagged data; finally, a weighted sum of the three losses is calculated and the model is updated with a back-propagation algorithm.
The student model and the teacher model have the same structure but different parameters, and both adopt a scene text recognition model.
The scene text recognition model comprises a correction module, a feature extraction module, a sequence modeling module and a prediction module.
In the unsupervised branch a projection module is added to the student model before the classification layer.
The projection module is a feed-forward layer module PFF.
Advantageous effects
The invention provides a scene text recognition system based on consistency regularization training. The invention trains a more robust and better performing STR model using an application consistency regularization method. In particular, the STR model receives as input two enhanced views of an unlabeled text image and forces it to output the same result. Through the training mode, the model can learn and transform invariant features by utilizing large-scale unlabeled data. The invention adds a projection module in a path of an unsupervised branch to prevent the model from collapsing. Given the large domain gap between training data and real test data, domain adaptation losses are applied to approximate the distance between character-level features between the synthetic labeled data and the real unlabeled data.
The invention can effectively utilize real unmarked data to improve the robustness and performance of the STR model. The semi-supervised model of the present invention shows a great improvement compared to the supervised baseline. Without the use of artificial annotation data, the present invention can outperform the current state-of-the-art methods in all scene text recognition test datasets.
Drawings
The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, wherein like reference numerals are used to designate like parts throughout.
FIG. 1 is a general block diagram of the system;
figure 2TRBA flow diagram;
fig. 3 is a diagram of a domain adaptation branch structure.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
As shown in FIG. 1, the system of the present invention consists of three branches, respectively called supervisory branch, noneSupervision branches and domain adaptation branches. The system comprises two scene text recognition models, namely a student model and a teacher model, wherein the student model and the teacher model have the same structure but different parameters. Supervised branch acceptance text image XLAs input, and calculating a predicted distribution P of the student modelLAnd label text string YgtCross entropy loss between. Unsupervised branching unmarked image X with weak data enhancement and strong data enhancementURespectively converted into two enhanced views XU1And XU2. For an input image XU1Teacher model outputs a prediction distribution PU1. For an input image XU2Student model output prediction distribution PU2. The weak data is then enhanced into view XU1Predicted distribution P ofU1As a goal, we force strong data to enhance view X with consistency regularization lossU2Output P ofU2Is closer to PU1. To narrow the domain gap between the synthetic tagged data and the actual tagged data, the present invention calculates CORAL penalty to align the character-level features of the tagged data and the untagged data. Finally, a weighted sum of the three losses is calculated and the model is updated with a back-propagation algorithm.
And (4) identifying a scene text model. The scene text recognition model is composed of a rectification module (specifically thin-plate spline TPS), a feature extraction module (convolution residual error network ResNet), a sequence modeling module (bidirectional BilSTM) and a prediction module (Attention mechanism Attention), as shown in FIG. 2, and is called TRBA. The rectification module converts the input image X into a normalized image X using thin-plate splines (TPS) for mitigating the difficulty of subsequent recognition tasks. For an input image size of 32 × 100, the feature map size of the last convolutional layer of the feature extraction module is C × 1 × T (C is the channel size and T is the maximum sequence length of the decoder). The sequence modeling module uses a bidirectional long-short term memory recurrent neural network (BilSTM) to learn better sequence features H ∈ T × C. The prediction module uses attention-based sequence prediction. In the t-th time step of decoding, the prediction module uses the encoder output H, the last hidden state and the character st-1And the embedded information f (y) of the previous time stept-1) fromThe character is predicted to predict the output of the current time step.
et,i=wttanh(Wst-1+Vhi+b)
Figure BDA0003478671590000041
Where W, V are trainable parameters. The attention alpha weights represent the importance of the sequence feature H in different time steps. The decoder weights H as a sum of the feature vectors g by taking the weights as coefficientstIt is treated as a character-level feature in the domain adaptation branch. The loop unit of the prediction module will gtAs visual input and producing an output vector st
Figure BDA0003478671590000051
st=rnn(st-1,(gt,f(yt-1)))
After the classifier layer, the TRBA model outputs a sequence prediction of size T × K, where K is the number of classifications for the character.
And (6) monitoring branches. The supervision branch will mark image X in the dataLAs input, obtaining output P through correction, feature extraction, sequence modeling and prediction stages of student modelL. Calculating the prediction result and labeled text string Y of the student modelgtCross entropy loss between:
Figure BDA0003478671590000052
wherein theta isstudentAre the model parameters of the student model.
And (4) unsupervised branching. Unsupervised branching requires only unlabeled text image XUNo corresponding annotation strings are required. According to the difference of the intensity of the data enhancement strategy, the invention defines two different data enhancement strategies in advance, namely weak (data) enhancement and weak (data) enhancementStrong (data) enhancement. XUConversion to X by weak enhancementU1Conversion to X by strong enhancementU2
Figure BDA0003478671590000053
Figure BDA0003478671590000054
Figure BDA0003478671590000055
Where I is the index function. In practice, if the student model and teacher model share the same model structure and parameters, the unsupervised branch will crash severely and perform poorly. Therefore, the present invention adds a projection module to the student model before the classification layer to break the symmetry. The invention uses a Position-wise feed forward (PFF) module proposed in the transform as a projection module. In addition, the invention updates the teacher model using an exponential moving average method to keep the projection module in an optimal state.
The domain adapts the branches. As shown in fig. 3, the present invention uses domain adaptation loss to narrow the domain gap between the synthesized tagged data and the true untagged data. The invention aligns the global feature space not between the synthesized data and the real data, but adaptively focuses on the distribution of the synthesized data and the real data in the aligned character-level feature space. The present invention treats the vector g, which automatically focuses on local features at each decoding time step, as a character-level feature. F is measured using CORrelation Alignment (CORAL)L clfAnd FU2 clfThe distribution distance between two character-level features. The domain adaptation loss is defined as follows:
Figure BDA0003478671590000061
wherein U isLIs FL clfSet of (2), UU2Is FU2 clfThe set of (a) and (b),
Figure BDA0003478671590000062
representing the square matrix Frobenius norm, cov (u) is a covariance matrix.
A training process and an inference process. During training, the final loss is defined as follows:
L=LsupunlLunldaLda
wherein L isunlAnd LdaIs a hyper-parameter. Adam was chosen as the optimizer and parameters of the student model were updated using back propagation. The gradient of the teacher model is stopped and its parameters are updated using an Exponential Moving Average (EMA) mechanism, defined as follows:
θteac=αθteacher+(1-α)θstudent
where α is a hyperparameter. After training, only the student models are saved and used for predicting the content of the text images in the reasoning stage. In the inference phase, the prediction module first outputs the first character using the "BOS" character representing the start of decoding as input. Next, the prediction module iteratively uses the previous output character as input in each decoding time step of the inference phase. Once the "EOS" character is output, the prediction module ends.
The training and testing data sets, the experimental set-up of the system, used in the present invention will be described below.
(1) Data set
The present invention trains the system on two synthetic marker datasets, including Synth90K and SynthText, for a total of approximately 14.5M samples. Approximately 10.7M of real unlabeled data was used. The method evaluates the six scene text recognition benchmark test data sets. The composite tagged data and test data set are both public, while the actual untagged data is private.
The details of the six test data sets are as follows. IIT5K-Words (IT5K) contained 3000 cropped word images for testing. Street View Text (SVT) consists of 647 word images that are collected from google street view. Many images have low resolution or are very noisy and blurred. ICDAR 2013(IC13) contained 1015 cropped word images, while IC13_857 was a subset of IC13, with no word images shorter than 3 characters. SVT-Peractive (SVT-P) contains 645 cropped images for testing. Most images are perspective warped because the images are selected from side view corner snapshots in google street view. Cut 80 contains 288 high resolution images for testing, but some images are curved. ICDAR 2015(IC15 contains 2077 cropped word images IC15_1811 is a subset of IC15 that discards word images with non-alphanumeric characters.
The detailed information of the true unlabeled dataset is as follows. All unlabeled datasets contain about 107 million word images cropped from three scene image datasets (including Places2, OpenImages, and ImageNet ILSVRC 2012). The present invention uses a scene Text detector named CRAFT (Character-Region aware For Text detection) to detect and crop Text images. The text confidence threshold is set to 0.7 and text images of low resolution (width multiplied by height less than 1000) are discarded. No other post-processing method was applied and 10.7M text images were finally obtained.
(2) Data enhancement
Data enhancement plays an important role in the consistency regularization training of the present invention. As described above, weak (data) enhancement and strong (data) enhancement are used for unlabeled images in unsupervised branches. Strong enhancements are also used in supervised branch training in order to facilitate a fair comparison with fully supervised training. The weak enhancement method changes brightness, contrast, saturation, and hue of an image. The strong enhancement method is inherited from the RandAugment enhancement method. It contains channel transformations such as automatic contrast, brightness, color, equalization, color separation, exposure addition operation space transformations such as shearing, translation, rotation, etc.
(3) Experimental setup
The model and system of the present invention are implemented using PyTorch.
In advance ofAnd a processing stage, converting the original image into an enhanced image through different enhancement strategies and adjusting the original image to the same 32 × 100 size. To balance the ratio of marked and unmarked data, the batch size N of the marked data is setLBatch N being 384 and no marker dataUTo 128 batches.
The present invention trains the system using an Adam optimizer. To speed up model convergence, the onecllr learning rate scheduler was employed, setting the maximum learning rate to 0.001. The onecllr learning rate scheduler first increases the learning rate from the base learning rate to the maximum learning rate and then decreases gradually until the base learning rate is reached at the end of training. The temperature coefficient τ is set to 0.4 and the confidence threshold β is 0.3. The exponentially moving average attenuation ratio was set to 0.999 to keep the projection layer in the optimum state. The present invention sets λ unl and λ da to 1.0 and 0.01, respectively. The system combines the tagged and untagged data training for a total of 250000 iterations.
While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention.

Claims (5)

1. A scene text recognition system based on consistency regularization training is characterized by comprising a supervision branch, an unsupervised branch and a domain adaptation branch; the supervision branch receives a text image XLAs input, and calculating a predicted distribution P of the student modelLAnd label text string YgtCross entropy loss between; said unsupervised branch unmarked image X by weak data enhancement and strong data enhancementURespectively converted into two enhanced views XU1And XU2(ii) a For an input image XU1Teacher model outputs a prediction distribution PU1(ii) a For an input image XU2Student model output prediction distribution PU2(ii) a The weak data is then enhanced into view XU1Predicted distribution P ofU1As a purpose of the inventionBidding, employing consistency regularization loss, forces strong data to enhance View XU2Output P ofU2Is closer to PU1(ii) a The domain adaptation branch calculates CORAL loss to align character-level features of tagged data and untagged data; finally, a weighted sum of the three losses is calculated and the model is updated with a back-propagation algorithm.
2. The system of claim 1, wherein the student model and the teacher model have the same structure but different parameters and adopt scene text recognition models.
3. The system according to claim 2, wherein the scene text recognition model comprises a rectification module, a feature extraction module, a sequence modeling module and a prediction module.
4. The system of claim 1, wherein a projection module is added to the student model before the classification layer in the unsupervised branch.
5. The system of claim 4, wherein the projection module is a feed-forward layer module PFF.
CN202210061855.XA 2022-01-19 2022-01-19 Scene text recognition system based on consistency regular training Pending CN114529904A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210061855.XA CN114529904A (en) 2022-01-19 2022-01-19 Scene text recognition system based on consistency regular training

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210061855.XA CN114529904A (en) 2022-01-19 2022-01-19 Scene text recognition system based on consistency regular training

Publications (1)

Publication Number Publication Date
CN114529904A true CN114529904A (en) 2022-05-24

Family

ID=81621053

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210061855.XA Pending CN114529904A (en) 2022-01-19 2022-01-19 Scene text recognition system based on consistency regular training

Country Status (1)

Country Link
CN (1) CN114529904A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115082800A (en) * 2022-07-21 2022-09-20 阿里巴巴达摩院(杭州)科技有限公司 Image segmentation method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115082800A (en) * 2022-07-21 2022-09-20 阿里巴巴达摩院(杭州)科技有限公司 Image segmentation method
CN115082800B (en) * 2022-07-21 2022-11-15 阿里巴巴达摩院(杭州)科技有限公司 Image segmentation method

Similar Documents

Publication Publication Date Title
CN109299274B (en) Natural scene text detection method based on full convolution neural network
CN109754015B (en) Neural networks for drawing multi-label recognition and related methods, media and devices
CN110334705B (en) Language identification method of scene text image combining global and local information
CN114022432B (en) Insulator defect detection method based on improved yolov5
CN111027562B (en) Optical character recognition method based on multiscale CNN and RNN combined with attention mechanism
CN109949317A (en) Based on the semi-supervised image instance dividing method for gradually fighting study
CN111738169B (en) Handwriting formula recognition method based on end-to-end network model
CN111444367A (en) Image title generation method based on global and local attention mechanism
Wan et al. Generative adversarial multi-task learning for face sketch synthesis and recognition
CN114881092A (en) Signal modulation identification method based on feature fusion
CN115563327A (en) Zero sample cross-modal retrieval method based on Transformer network selective distillation
CN113065549A (en) Deep learning-based document information extraction method and device
CN111126155B (en) Pedestrian re-identification method for generating countermeasure network based on semantic constraint
CN116386104A (en) Self-supervision facial expression recognition method combining contrast learning and mask image modeling
CN114529904A (en) Scene text recognition system based on consistency regular training
CN112434686B (en) End-to-end misplaced text classification identifier for OCR (optical character) pictures
CN114596477A (en) Foggy day train fault detection method based on field self-adaption and attention mechanism
CN117437426A (en) Semi-supervised semantic segmentation method for high-density representative prototype guidance
CN111242114B (en) Character recognition method and device
CN111382871A (en) Domain generalization and domain self-adaptive learning method based on data expansion consistency
CN116939320A (en) Method for generating multimode mutually-friendly enhanced video semantic communication
CN113887504B (en) Strong-generalization remote sensing image target identification method
CN115410131A (en) Method for intelligently classifying short videos
CN112348007B (en) Optical character recognition method based on neural network
CN114220145A (en) Face detection model generation method and device and fake face detection method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination