CN114255464A

CN114255464A - Natural scene character detection and identification method based on CRAFT and SCRN-SEED framework

Info

Publication number: CN114255464A
Application number: CN202111530794.9A
Authority: CN
Inventors: 叶堂华; 孙乐; 朱均可; 刘凯
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2021-12-14
Filing date: 2021-12-14
Publication date: 2022-03-29

Abstract

The invention discloses a natural scene character detection and identification method based on CRAFT and SCRN-SEED frameworks, which comprises the following steps: (1) establishing an image data set by utilizing the real data set and the synthetic data set, and dividing the image data set into a training set and a testing set; (2) training a CRAFT network by using an image data set; (3) training irregular texts by using a real data set to correct network SCRN; (4) combining the SCRN with the SEED network, and training the combined SCRN-SEED network; (5) and connecting the CRAFT network with the SCRN-SEED network, constructing a complete model and training. The method can fully detect the bending deformation text or long text examples, achieves the detection purpose by accurately positioning each character and then connecting the detected characters into one text through an affinity mechanism, and is suitable for bending, deforming or extremely long texts; through correction of irregular text pictures and detection of semantic information used for global information, low-quality text examples can be accurately identified.

Description

Natural scene character detection and identification method based on CRAFT and SCRN-SEED framework

Technical Field

The invention relates to the technical field of character detection methods, in particular to a natural scene character detection and identification method based on CRAFT and SCRN-SEED frameworks.

Background

Optical Character Recognition (OCR) conventionally refers to analyzing an input scanned document image to recognize Character information in the image, and such a technique assumes that an input image has a clean background, simple fonts and regular Character arrangement, and can achieve a higher Recognition level in a case of meeting requirements. Scene Text Recognition (STR) refers to recognizing Text information in a natural Scene picture, and the difficulty of the Scene Text Recognition is much higher than that of Text Recognition in a scanned document image, and the display forms of characters in the natural Scene are extremely rich, and the following situations may occur: the text line has a plurality of patterns such as horizontal, vertical, bending, rotating, twisting and the like; the character area in the image generates the phenomena of incomplete, fuzzy and the like; background is various, such as characters appear on a plane, a curved surface or a wrinkled surface, complex interference textures are arranged near character areas, or non-character areas have textures similar to characters, such as sand, a grass, a fence, a brick wall and the like.

The scene text detection method based on the neural network obtains a good effect, far exceeds the traditional technology in detection and identification, and still cannot well solve the problems of bending, fuzzy interference texture and the like in a natural scene. The existing natural scene character detection method has the following problems: the character positioning is not accurate, the traditional text positioning model frame mostly focuses on the whole line of text, a large receptive field is needed, and a single rectangular box is adopted to mark the position of the text, so that the method is not suitable for bending, deforming or extremely long text and is difficult to accurately position and mark; character recognition is inaccurate, and many proposed recognition methods based on an encoder-decoder framework are used for processing curved texts, but most of the methods are based on local visual features, and ignore global semantic information, so that recognition accuracy is greatly reduced when the method is faced with conditions such as image blurring, uneven illumination, incomplete characters and the like.

Disclosure of Invention

The purpose of the invention is as follows: in view of the above problems, the present invention aims to provide a natural scene text detection and recognition method based on a CRAFT and SCRN-SEED framework.

The technical scheme is as follows: the invention discloses a natural scene character detection and identification method based on CRAFT and SCRN-SEED frameworks, which comprises the following steps:

(1) establishing an image data set by utilizing the real data set and the synthetic data set, and dividing the image data set into a training set and a testing set;

(2) training the CRAFT network with an image data set:

(201) improving the CRAFT network, taking a ResNet50 network as a backbone network, inputting the pictures in the synthetic data set into the improved CRAFT network for feature extraction, and outputting a region score and an affinity score;

(202) coding is carried out through Gaussian thermodynamic mapping according to the two scores to generate a Gaussian thermodynamic diagram;

(203) cutting a complete text in an input picture into single characters according to a watershed algorithm, and generating polygons of texts in any shapes from the characters through post-processing operation;

(203) initializing the improved CRAFT network by using a pre-training model by applying the idea of transfer learning;

(3) training irregular texts by using a real data set to correct network SCRN;

(4) combining the SCRN with the SEED network, and training the combined SCRN-SEED network;

(5) and connecting the improved CRAFT network and the SCRN-SEED network, constructing a complete model and training.

Further, the step of initializing the improved CRAFT network by using the pre-training model according to the idea of transfer learning includes:

first, a CRAFT network is trained using a synthetic dataset, the network is optimized using an Adam optimizer, and the network is then fine-tuned using multiple real datasets, during the fine-tuning, with a 1: ratio of 5 using the synthttext dataset, 1: 3, using on-line difficult excavation;

then, a CRAFT network is trained by using a real data set containing quadrilateral labels and a SynthText data set, and a part of the CRAFT network is divided into a test set to adjust network parameters.

Further, the step of combining the SCRN and the SEED network and training the combined SCRN-SEED network comprises:

replacing an image correction module in the SEED network with the trained SCRN to construct a SCRN-SEED network, initializing parameters of a pre-training model by using a pre-training language model of a semantic model FastText, primarily training the SCRN-SEED network by using a test set, and adjusting network parameters according to a training effect.

Further, the step of connecting the improved CRAFT network with the SCRN-SEED network, constructing a complete model and training the complete model comprises: generating a minimum rectangular frame containing all characters from polygons of texts with any shapes, cutting the rectangular frame, adjusting the format of the cut pictures, inputting the minimum rectangular frame into a SCRN-SEED network to complete the construction of a model, training the model by using a verification set, reserving parameters with the optimal training effect, inputting pictures in a natural scene into the model, and performing automatic character detection and recognition tasks.

Further, the real data sets are from ICDAR2013, ICDAR2015, ICDAR2017, MSRA-TD500, TotalText, CTW-1500, etc., and the synthetic data sets are SynthText data sets;

resizing each picture in the image dataset and converting mdb the format of the pictures in the dataset.

Has the advantages that: compared with the prior art, the invention has the following remarkable advantages:

1. the method can fully detect the bending deformation text or the long text example, achieves the detection purpose by accurately positioning each character and then connecting the detected characters into one text through an affinity machine system, only needs to pay attention to the distance between the characters, does not need to pay attention to the whole line of text, does not need a large receptive field, and is suitable for the bending, deformation or extremely long text;

2. the invention can accurately identify the text example with low quality: the invention corrects the irregular text picture by the correcting module of the SCRN according to the output result of the detector, and the SCRN network achieves the purpose of correction by means of the central line of each character instance and some geometric attributes, so that the effect is better; the corrected text picture is subjected to character recognition work through a recognition module in the SEED network, semantic information is used for predicting global information, and the accuracy is greatly improved in text recognition with defects, fuzzy phenomena and the like.

Drawings

FIG. 1 is a diagram of the basic framework of the improved CRAFT model of the present invention;

FIG. 2 is a basic frame diagram of an irregular text correction network SCRN employed in the present invention;

FIG. 3 is a basic framework diagram of the SEED network identification module employed in the present invention;

FIG. 4 is a diagram of an overall model structure of the method for detecting and identifying characters in a natural scene according to the present invention;

fig. 5 is a diagram illustrating an exemplary effect of the method for detecting and identifying characters in a natural scene according to the present invention.

Detailed Description

The method for detecting and identifying the natural scene characters based on the CRAFT and the SCRN-SEED framework comprises the following steps:

(1) and establishing an image data set by using the real data set and the synthetic data set, dividing the image data set into a training set and a testing set, adjusting the size of each picture in the image data set, and converting mdb the format of the pictures in the data set.

The real data set is from ICDAR2013, ICDAR2015, ICDAR2017, MSRA-TD500, TotalText and CTW-1500, and the synthetic data set is a SynthText data set.

(2) The CRAFT network is trained using an image data set, and the flowchart is shown in FIG. 1.

(203) applying the idea of transfer learning, initializing the improved CRAFT network by using a pre-training model:

first, a CRAFT network is trained using a synthetic dataset, the network is optimized using an Adam optimizer, and the network is then fine-tuned using multiple real datasets, during the fine-tuning, with a 1: a ratio of 5 uses the synthttext data set to ensure that the character regions are certainly separate, with 1: the scale of 3 uses online hard mining and may also apply data enhancement techniques such as cropping, rotation, or color change.

(3) The irregular text correction network SCRN is trained using the real data set, and the network framework is shown in fig. 2.

(4) Combining the SCRN with the SEED network, and training the combined SCRN-SEED network to be used as a recognition network. The framework of the identity module in the original SEED network is shown in fig. 3.

And replacing an image correction module in the original SEED network with the trained SCRN to construct a SCRN-SEED network, so that the correction effect and the recognition accuracy of irregular texts are improved, and the difficulty of model training is reduced. Downloading a pre-training language model of a semantic model FastText according to a required recognition language, initializing parameters of the pre-training model, preliminarily training the SCRN-SEED network by using a test set, and adjusting network parameters according to a training effect.

(5) And connecting the improved CRAFT network with the SCRN-SEED network, constructing a complete model and training.

And generating a minimum rectangular frame containing all characters by using the polygon of the text with any shape, cutting the rectangular frame, adjusting the cut picture format, and inputting the picture into the SCRN-SEED network to complete the construction of the model. The model framework after construction is shown in fig. 4 and comprises three parts, namely a CRAFT network is used as a detector part, an SCRN network is used as a correction network part, and a SEED network is used as an identification part. Training the model by using a verification set, reserving parameters with the optimal training effect, inputting pictures in a natural scene to the model weight, and performing automatic character detection and recognition tasks, wherein fig. 5 is an effect graph of each part of the pictures in the natural scene when the pictures are positioned and recognized by the model.

Claims

1. The natural scene character detection and identification method based on the CRAFT and the SCRN-SEED framework is characterized by comprising the following steps:

(2) training the CRAFT network with an image data set:

(3) training irregular texts by using a real data set to correct network SCRN;

2. The method for detecting and recognizing characters in natural scene according to claim 1, wherein the step of initializing the improved CRAFT network by using the pre-training model based on the idea of transfer learning comprises:

3. The method for detecting and recognizing natural scene characters as claimed in claim 1, wherein the step of combining the SCRN with the SEED network and training the combined SCRN-SEED network comprises:

replacing an image correction module in the SEED network with the trained SCRN, initializing parameters of a pre-training model by using a pre-training language model of a semantic model FastText, initially training the improved SCRN-SEED network by using a test set, and adjusting network parameters according to a training effect.

4. The natural scene character detection and recognition method of claim 1, wherein the step of connecting the improved CRAFT network with the SCRN-SEED network, constructing a complete model and training the model comprises: generating a minimum rectangular frame containing all characters from polygons of texts with any shapes, cutting the rectangular frame, adjusting the format of the cut pictures, inputting the minimum rectangular frame into a SCRN-SEED network to complete the construction of a model, training the model by using a verification set, reserving parameters with the optimal training effect, inputting pictures in a natural scene into the model, and performing automatic character detection and recognition tasks.

5. The natural scene word detection and recognition method of claim 1, wherein the real data set is from an ICDAR2013, ICDAR2015, ICDAR2017, MSRA-TD500, TotalText, CTW-1500 database, and the synthetic data set is a SynthText data set;