CN114529904A - Scene text recognition system based on consistency regular training - Google Patents
Scene text recognition system based on consistency regular training Download PDFInfo
- Publication number
- CN114529904A CN114529904A CN202210061855.XA CN202210061855A CN114529904A CN 114529904 A CN114529904 A CN 114529904A CN 202210061855 A CN202210061855 A CN 202210061855A CN 114529904 A CN114529904 A CN 114529904A
- Authority
- CN
- China
- Prior art keywords
- data
- model
- branch
- text recognition
- module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a scene text recognition system based on consistency regular training, and belongs to the field of scene text recognition. The whole system comprises three branches including a supervision branch, an unsupervised branch and a domain adaptation branch. The invention uses an application consistency regularization method to train a more robust and better performing STR model. In particular, the STR model receives as input two enhanced views of an unlabeled text image and forces it to output the same result. Through the training mode, the model can learn and transform invariant features by utilizing large-scale unlabeled data. The invention adds a projection module in a path of an unsupervised branch to prevent the model from collapsing. Given the large domain gap between training data and real test data, domain adaptation losses are applied to approximate the distance between character-level features between the synthetic labeled data and the real unlabeled data.
Description
Technical Field
The invention belongs to the field of scene text recognition, and particularly relates to a scene text recognition system based on consistency regular training.
Background
Scene Text Recognition (STR), which is the recognition of text in natural scenes, is a special form of Optical Character Recognition (OCR).
In the past, hand-crafted features were used for scene text recognition, such as histograms of oriented gradient descriptors, connected components, and stroke width transforms. With the rapid development of the deep learning technology, scene text recognition has great progress in the aspects of innovation, practicability, efficiency and the like. There are two main categories of scene text recognition, segmentation-based methods and segmentation-free methods. In particular, methods that do not require segmentation can be broadly classified into a Connection Timing Classification (CTC) based method and an attention-based method.
At present, the regular text recognition method has achieved good performance due to the application of convolutional neural networks and attention mechanisms. Irregular text recognition is more difficult than regular text recognition tasks due to multiple disturbances of the environment, various shapes and distorted patterns.
While the deep learning approach has had great success in many computer vision tasks, including scene text recognition, it has a great demand for large data. In addition to true marker data, synthetic marker data is also widely used to train STR models. The synthetic data and the real data each have advantages and disadvantages. Real data is manually annotated, but is typically expensive, time consuming, and small. Synthetic data is automatic and efficient, but it is difficult to design a good synthetic data engine for different tasks, and there is always a domain gap between synthetic data and real data.
Given the greater ease of collecting label-free data in the real world, many researchers have attempted to improve the performance of depth models using label-free data. Semi-supervised methods may combine additional unlabeled data with labeled data during the training process, the most common being self-training methods. Consistency regularization methods such as uda (unsuperviced Data augmentation) are generally considered more efficient and effective than self-training methods such as pseudo-labels. It is well known that the consistency regularization method has not been successfully applied in the scene text recognition task. When the consistency regularization method is applied to the STR model, the model is found to be seriously collapsed, and the performance is poor.
Disclosure of Invention
Technical problem to be solved
In order to solve the problems, the invention provides an asymmetric unsupervised branch structure and combines a domain adaptive branch, thereby effectively improving the stability and the final performance of model training.
Technical scheme
A scene text recognition system based on consistency regularization training is characterized by comprising a supervision branch, an unsupervised branch and a domain adaptation branch; the supervision branch receives a text image XLAs input, and calculating a predicted distribution P of the student modelLAnd label text string YgtCross entropy loss between; said unsupervised branch unmarked image X by weak data enhancement and strong data enhancementURespectively converted into two enhanced views XU1And XU2(ii) a For an input image XU1Teacher model outputs a prediction distribution PU1(ii) a For an input image XU2Student model output prediction distribution PU2(ii) a The weak data is then enhanced into view XU1Predicted distribution P ofU1As a goal, we force strong data to enhance view X with consistency regularization lossU2Output P ofU2Is closer to PU1(ii) a The domain adaptation branch calculates CORAL loss to align character-level features of tagged data and untagged data; finally, a weighted sum of the three losses is calculated and the model is updated with a back-propagation algorithm.
The student model and the teacher model have the same structure but different parameters, and both adopt a scene text recognition model.
The scene text recognition model comprises a correction module, a feature extraction module, a sequence modeling module and a prediction module.
In the unsupervised branch a projection module is added to the student model before the classification layer.
The projection module is a feed-forward layer module PFF.
Advantageous effects
The invention provides a scene text recognition system based on consistency regularization training. The invention trains a more robust and better performing STR model using an application consistency regularization method. In particular, the STR model receives as input two enhanced views of an unlabeled text image and forces it to output the same result. Through the training mode, the model can learn and transform invariant features by utilizing large-scale unlabeled data. The invention adds a projection module in a path of an unsupervised branch to prevent the model from collapsing. Given the large domain gap between training data and real test data, domain adaptation losses are applied to approximate the distance between character-level features between the synthetic labeled data and the real unlabeled data.
The invention can effectively utilize real unmarked data to improve the robustness and performance of the STR model. The semi-supervised model of the present invention shows a great improvement compared to the supervised baseline. Without the use of artificial annotation data, the present invention can outperform the current state-of-the-art methods in all scene text recognition test datasets.
Drawings
The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, wherein like reference numerals are used to designate like parts throughout.
FIG. 1 is a general block diagram of the system;
figure 2TRBA flow diagram;
fig. 3 is a diagram of a domain adaptation branch structure.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
As shown in FIG. 1, the system of the present invention consists of three branches, respectively called supervisory branch, noneSupervision branches and domain adaptation branches. The system comprises two scene text recognition models, namely a student model and a teacher model, wherein the student model and the teacher model have the same structure but different parameters. Supervised branch acceptance text image XLAs input, and calculating a predicted distribution P of the student modelLAnd label text string YgtCross entropy loss between. Unsupervised branching unmarked image X with weak data enhancement and strong data enhancementURespectively converted into two enhanced views XU1And XU2. For an input image XU1Teacher model outputs a prediction distribution PU1. For an input image XU2Student model output prediction distribution PU2. The weak data is then enhanced into view XU1Predicted distribution P ofU1As a goal, we force strong data to enhance view X with consistency regularization lossU2Output P ofU2Is closer to PU1. To narrow the domain gap between the synthetic tagged data and the actual tagged data, the present invention calculates CORAL penalty to align the character-level features of the tagged data and the untagged data. Finally, a weighted sum of the three losses is calculated and the model is updated with a back-propagation algorithm.
And (4) identifying a scene text model. The scene text recognition model is composed of a rectification module (specifically thin-plate spline TPS), a feature extraction module (convolution residual error network ResNet), a sequence modeling module (bidirectional BilSTM) and a prediction module (Attention mechanism Attention), as shown in FIG. 2, and is called TRBA. The rectification module converts the input image X into a normalized image X using thin-plate splines (TPS) for mitigating the difficulty of subsequent recognition tasks. For an input image size of 32 × 100, the feature map size of the last convolutional layer of the feature extraction module is C × 1 × T (C is the channel size and T is the maximum sequence length of the decoder). The sequence modeling module uses a bidirectional long-short term memory recurrent neural network (BilSTM) to learn better sequence features H ∈ T × C. The prediction module uses attention-based sequence prediction. In the t-th time step of decoding, the prediction module uses the encoder output H, the last hidden state and the character st-1And the embedded information f (y) of the previous time stept-1) fromThe character is predicted to predict the output of the current time step.
et,i=wttanh(Wst-1+Vhi+b)
Where W, V are trainable parameters. The attention alpha weights represent the importance of the sequence feature H in different time steps. The decoder weights H as a sum of the feature vectors g by taking the weights as coefficientstIt is treated as a character-level feature in the domain adaptation branch. The loop unit of the prediction module will gtAs visual input and producing an output vector st。
st=rnn(st-1,(gt,f(yt-1)))
After the classifier layer, the TRBA model outputs a sequence prediction of size T × K, where K is the number of classifications for the character.
And (6) monitoring branches. The supervision branch will mark image X in the dataLAs input, obtaining output P through correction, feature extraction, sequence modeling and prediction stages of student modelL. Calculating the prediction result and labeled text string Y of the student modelgtCross entropy loss between:
wherein theta isstudentAre the model parameters of the student model.
And (4) unsupervised branching. Unsupervised branching requires only unlabeled text image XUNo corresponding annotation strings are required. According to the difference of the intensity of the data enhancement strategy, the invention defines two different data enhancement strategies in advance, namely weak (data) enhancement and weak (data) enhancementStrong (data) enhancement. XUConversion to X by weak enhancementU1Conversion to X by strong enhancementU2。
Where I is the index function. In practice, if the student model and teacher model share the same model structure and parameters, the unsupervised branch will crash severely and perform poorly. Therefore, the present invention adds a projection module to the student model before the classification layer to break the symmetry. The invention uses a Position-wise feed forward (PFF) module proposed in the transform as a projection module. In addition, the invention updates the teacher model using an exponential moving average method to keep the projection module in an optimal state.
The domain adapts the branches. As shown in fig. 3, the present invention uses domain adaptation loss to narrow the domain gap between the synthesized tagged data and the true untagged data. The invention aligns the global feature space not between the synthesized data and the real data, but adaptively focuses on the distribution of the synthesized data and the real data in the aligned character-level feature space. The present invention treats the vector g, which automatically focuses on local features at each decoding time step, as a character-level feature. F is measured using CORrelation Alignment (CORAL)L clfAnd FU2 clfThe distribution distance between two character-level features. The domain adaptation loss is defined as follows:
wherein U isLIs FL clfSet of (2), UU2Is FU2 clfThe set of (a) and (b),representing the square matrix Frobenius norm, cov (u) is a covariance matrix.
A training process and an inference process. During training, the final loss is defined as follows:
L=Lsup+λunlLunl+λdaLda
wherein L isunlAnd LdaIs a hyper-parameter. Adam was chosen as the optimizer and parameters of the student model were updated using back propagation. The gradient of the teacher model is stopped and its parameters are updated using an Exponential Moving Average (EMA) mechanism, defined as follows:
θteac=αθteacher+(1-α)θstudent
where α is a hyperparameter. After training, only the student models are saved and used for predicting the content of the text images in the reasoning stage. In the inference phase, the prediction module first outputs the first character using the "BOS" character representing the start of decoding as input. Next, the prediction module iteratively uses the previous output character as input in each decoding time step of the inference phase. Once the "EOS" character is output, the prediction module ends.
The training and testing data sets, the experimental set-up of the system, used in the present invention will be described below.
(1) Data set
The present invention trains the system on two synthetic marker datasets, including Synth90K and SynthText, for a total of approximately 14.5M samples. Approximately 10.7M of real unlabeled data was used. The method evaluates the six scene text recognition benchmark test data sets. The composite tagged data and test data set are both public, while the actual untagged data is private.
The details of the six test data sets are as follows. IIT5K-Words (IT5K) contained 3000 cropped word images for testing. Street View Text (SVT) consists of 647 word images that are collected from google street view. Many images have low resolution or are very noisy and blurred. ICDAR 2013(IC13) contained 1015 cropped word images, while IC13_857 was a subset of IC13, with no word images shorter than 3 characters. SVT-Peractive (SVT-P) contains 645 cropped images for testing. Most images are perspective warped because the images are selected from side view corner snapshots in google street view. Cut 80 contains 288 high resolution images for testing, but some images are curved. ICDAR 2015(IC15 contains 2077 cropped word images IC15_1811 is a subset of IC15 that discards word images with non-alphanumeric characters.
The detailed information of the true unlabeled dataset is as follows. All unlabeled datasets contain about 107 million word images cropped from three scene image datasets (including Places2, OpenImages, and ImageNet ILSVRC 2012). The present invention uses a scene Text detector named CRAFT (Character-Region aware For Text detection) to detect and crop Text images. The text confidence threshold is set to 0.7 and text images of low resolution (width multiplied by height less than 1000) are discarded. No other post-processing method was applied and 10.7M text images were finally obtained.
(2) Data enhancement
Data enhancement plays an important role in the consistency regularization training of the present invention. As described above, weak (data) enhancement and strong (data) enhancement are used for unlabeled images in unsupervised branches. Strong enhancements are also used in supervised branch training in order to facilitate a fair comparison with fully supervised training. The weak enhancement method changes brightness, contrast, saturation, and hue of an image. The strong enhancement method is inherited from the RandAugment enhancement method. It contains channel transformations such as automatic contrast, brightness, color, equalization, color separation, exposure addition operation space transformations such as shearing, translation, rotation, etc.
(3) Experimental setup
The model and system of the present invention are implemented using PyTorch.
In advance ofAnd a processing stage, converting the original image into an enhanced image through different enhancement strategies and adjusting the original image to the same 32 × 100 size. To balance the ratio of marked and unmarked data, the batch size N of the marked data is setLBatch N being 384 and no marker dataUTo 128 batches.
The present invention trains the system using an Adam optimizer. To speed up model convergence, the onecllr learning rate scheduler was employed, setting the maximum learning rate to 0.001. The onecllr learning rate scheduler first increases the learning rate from the base learning rate to the maximum learning rate and then decreases gradually until the base learning rate is reached at the end of training. The temperature coefficient τ is set to 0.4 and the confidence threshold β is 0.3. The exponentially moving average attenuation ratio was set to 0.999 to keep the projection layer in the optimum state. The present invention sets λ unl and λ da to 1.0 and 0.01, respectively. The system combines the tagged and untagged data training for a total of 250000 iterations.
While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention.
Claims (5)
1. A scene text recognition system based on consistency regularization training is characterized by comprising a supervision branch, an unsupervised branch and a domain adaptation branch; the supervision branch receives a text image XLAs input, and calculating a predicted distribution P of the student modelLAnd label text string YgtCross entropy loss between; said unsupervised branch unmarked image X by weak data enhancement and strong data enhancementURespectively converted into two enhanced views XU1And XU2(ii) a For an input image XU1Teacher model outputs a prediction distribution PU1(ii) a For an input image XU2Student model output prediction distribution PU2(ii) a The weak data is then enhanced into view XU1Predicted distribution P ofU1As a purpose of the inventionBidding, employing consistency regularization loss, forces strong data to enhance View XU2Output P ofU2Is closer to PU1(ii) a The domain adaptation branch calculates CORAL loss to align character-level features of tagged data and untagged data; finally, a weighted sum of the three losses is calculated and the model is updated with a back-propagation algorithm.
2. The system of claim 1, wherein the student model and the teacher model have the same structure but different parameters and adopt scene text recognition models.
3. The system according to claim 2, wherein the scene text recognition model comprises a rectification module, a feature extraction module, a sequence modeling module and a prediction module.
4. The system of claim 1, wherein a projection module is added to the student model before the classification layer in the unsupervised branch.
5. The system of claim 4, wherein the projection module is a feed-forward layer module PFF.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210061855.XA CN114529904A (en) | 2022-01-19 | 2022-01-19 | Scene text recognition system based on consistency regular training |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210061855.XA CN114529904A (en) | 2022-01-19 | 2022-01-19 | Scene text recognition system based on consistency regular training |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114529904A true CN114529904A (en) | 2022-05-24 |
Family
ID=81621053
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210061855.XA Pending CN114529904A (en) | 2022-01-19 | 2022-01-19 | Scene text recognition system based on consistency regular training |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114529904A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115082800A (en) * | 2022-07-21 | 2022-09-20 | 阿里巴巴达摩院(杭州)科技有限公司 | Image segmentation method |
-
2022
- 2022-01-19 CN CN202210061855.XA patent/CN114529904A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115082800A (en) * | 2022-07-21 | 2022-09-20 | 阿里巴巴达摩院(杭州)科技有限公司 | Image segmentation method |
CN115082800B (en) * | 2022-07-21 | 2022-11-15 | 阿里巴巴达摩院(杭州)科技有限公司 | Image segmentation method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109299274B (en) | Natural scene text detection method based on full convolution neural network | |
CN109754015B (en) | Neural networks for drawing multi-label recognition and related methods, media and devices | |
CN110334705B (en) | Language identification method of scene text image combining global and local information | |
CN114022432B (en) | Insulator defect detection method based on improved yolov5 | |
CN111027562B (en) | Optical character recognition method based on multiscale CNN and RNN combined with attention mechanism | |
CN109949317A (en) | Based on the semi-supervised image instance dividing method for gradually fighting study | |
CN111738169B (en) | Handwriting formula recognition method based on end-to-end network model | |
CN111444367A (en) | Image title generation method based on global and local attention mechanism | |
Wan et al. | Generative adversarial multi-task learning for face sketch synthesis and recognition | |
CN114881092A (en) | Signal modulation identification method based on feature fusion | |
CN115563327A (en) | Zero sample cross-modal retrieval method based on Transformer network selective distillation | |
CN113065549A (en) | Deep learning-based document information extraction method and device | |
CN111126155B (en) | Pedestrian re-identification method for generating countermeasure network based on semantic constraint | |
CN116386104A (en) | Self-supervision facial expression recognition method combining contrast learning and mask image modeling | |
CN114529904A (en) | Scene text recognition system based on consistency regular training | |
CN112434686B (en) | End-to-end misplaced text classification identifier for OCR (optical character) pictures | |
CN114596477A (en) | Foggy day train fault detection method based on field self-adaption and attention mechanism | |
CN117437426A (en) | Semi-supervised semantic segmentation method for high-density representative prototype guidance | |
CN111242114B (en) | Character recognition method and device | |
CN111382871A (en) | Domain generalization and domain self-adaptive learning method based on data expansion consistency | |
CN116939320A (en) | Method for generating multimode mutually-friendly enhanced video semantic communication | |
CN113887504B (en) | Strong-generalization remote sensing image target identification method | |
CN115410131A (en) | Method for intelligently classifying short videos | |
CN112348007B (en) | Optical character recognition method based on neural network | |
CN114220145A (en) | Face detection model generation method and device and fake face detection method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |