CN112733822B - End-to-end text detection and identification method - Google Patents

End-to-end text detection and identification method Download PDF

Info

Publication number
CN112733822B
CN112733822B CN202110344324.7A CN202110344324A CN112733822B CN 112733822 B CN112733822 B CN 112733822B CN 202110344324 A CN202110344324 A CN 202110344324A CN 112733822 B CN112733822 B CN 112733822B
Authority
CN
China
Prior art keywords
text
text box
image
feature
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110344324.7A
Other languages
Chinese (zh)
Other versions
CN112733822A (en
Inventor
姜华
王晴晴
杜沁益
李蔡元
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Minpu Technology Co ltd
Original Assignee
Shanghai Minpu Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Minpu Technology Co ltd filed Critical Shanghai Minpu Technology Co ltd
Priority to CN202110344324.7A priority Critical patent/CN112733822B/en
Publication of CN112733822A publication Critical patent/CN112733822A/en
Application granted granted Critical
Publication of CN112733822B publication Critical patent/CN112733822B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/412Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words

Abstract

The invention belongs to the technical field of visual recognition and discloses an end-to-end text detection and recognition method which includes the steps of filtering background pixels by utilizing a semantic segmentation result of an input text image, generating a preset text box set, classifying and regressing a plurality of reference points on the edge of the preset text box, detecting a target text box, extracting features of the input text image by utilizing scale transformation and space transformation, training a recognizer by utilizing a feature similarity constraint strategy, and finally recognizing character sequences in the target text box by utilizing the trained recognizer. The method provided by the invention can be used for detecting and identifying the text in the natural scene image end to end, and the robustness of the model to the bent text and the low-resolution text is improved while the model efficiency is considered, so that the method has innovation and practical application values.

Description

End-to-end text detection and identification method
Technical Field
The invention relates to the technical field of visual recognition, in particular to an end-to-end text detection and recognition method.
Background
Characters play a very important role in daily life, and transmit information and knowledge to people in the forms of traffic signs, poster advertisements, product description on packaging bags and the like. With the popularization of devices with camera shooting functions such as mobile phones and vehicle-mounted cameras, more and more characters are collected, transmitted and stored in an image form, and the characters automatically detected and identified from the images have wide application prospects in the fields of intelligent transportation, image detection, scene understanding and the like, so that related researches are always concerned in the field of computer vision.
In recent years, a network model based on deep learning is a key solution for each related task in three fields of sound (speech recognition), graph (computer vision) and text (natural language processing), and meanwhile, text detection and recognition also enter the deep learning era. The existing text detection algorithm based on deep learning mainly has three types: a semantic segmentation based network model, a target detection based network model, and a hybrid model. And performing pixel-level prediction on the text image based on the semantic segmentation network model, and deducing the position, shape and angle of the text box to which each pixel belongs according to the prediction result. The network model based on target detection takes the text as a specific target, and directly outputs target text box information by classifying and regression predicting a large number of preset text boxes. Although the two models achieve excellent performance in text detection, they have disadvantages, for example, the network model based on semantic segmentation is not an end-to-end text detection model, and such models often require a large amount of complicated post-processing operations in order to deduce information of a target text box from a prediction result, whereas the text detection model based on target detection easily misses a text region with a large width and height ratio. The hybrid text detection model takes the length of the two text detection models to avoid the length of the two text detection models, and predicts the pixel and the preset text box simultaneously, so the detection rate can be effectively improved.
Existing text recognition models based on deep learning can be divided into recognition models based on an attention mechanism and recognition models based on a joint temporal classification (CTC) according to different sequence prediction modules of the text recognition models. The two models use a Convolutional Neural Network (CNN) and a long-short time memory network (LSTM) to perform feature extraction on a text image and encode feature segments, and the difference is that an attention mechanism-based recognition model uses an attention-based GRU or an attention-based LSTM to decode a feature sequence to obtain a character string sequence output, and a CTC-based recognition model uses a forward-backward algorithm CTC to perform mapping from a frame-level prediction result to the character string sequence. However, both of the above recognition models face the following problems: firstly, the identification effect on the bent text is poor, an additional text correction module is needed, and in addition, because the LSTM only accepts one-dimensional feature vectors as input, the two-dimensional feature map needs to be mapped to a one-dimensional space by flattening or pooling operation, so that the space and structural information of the image are damaged, and the identification performance is further influenced; secondly, the robustness of the low-resolution text image is poor, and due to the fact that the difference of the resolution of the natural scene text image is large, the low-resolution image is blurred after being amplified through the scale normalization operation in the preprocessing stage, and the recognition performance is further influenced.
Disclosure of Invention
In order to solve the problems, the invention provides an end-to-end text detection and identification method, which includes the steps of filtering most background pixels based on a semantic segmentation idea, then carrying out classification and regression prediction on preset text boxes aiming at the reserved text pixels, directly outputting information such as positions and shapes of target text boxes, and finally, designing a data self-enhanced identifier with feature similarity constraint for text identification through an algorithm.
The invention can be realized by the following technical scheme:
an end-to-end text detection and identification method includes filtering out background pixels by utilizing semantic segmentation results of input text images, generating a preset text box set, classifying and regressing a plurality of reference points on edges of the preset text box in the preset text box set to detect a target text box, extracting features of the input text images by utilizing scale transformation and space transformation, training a recognizer by utilizing a feature similarity constraint strategy, and finally identifying character sequences in the target text box by utilizing the trained recognizer.
Further, the method for generating the preset text box set comprises the following steps: establishing an image library comprising character sequences, carrying out normalization processing on each text image, then using a full convolution network and an up-sampling network to carry out multi-scale feature map extraction with different scaling ratios on an input text image, taking the multi-scale feature map extraction as input, and combining a plurality of convolution layerssigmoidAnd generating a semantic segmentation graph by using the function, simultaneously, performing region proposal generation on all pixel positions on the multi-scale feature graph by using an RPN (resilient packet network), then setting a probability threshold according to the semantic segmentation graph, filtering out region proposals corresponding to pixel points smaller than the probability threshold, and recording the rest region proposal set as a preset text box set.
Further, the method for generating the preset text box set comprises the following steps:
step 1: collecting and expanding a text image data set of a natural scene as a training sample set, and carrying out comparison on text images in the training sample setIA text area inRIs marked with a notation ofGTR=[(x 1 , y 1 ), (x 2 , y 2 ), …, (x N , y N ), txt],Wherein(x n , y n )As text regionsROn the edge of the firstnThe coordinates of the individual reference points are,Nfor a predefined total number of reference points,txtas text regionsRThe character string content of (1);
step 2: multi-scale feature extraction based on a full convolution network and an up-sampling network: after the samples are normalized, the full convolution network is used for extracting the features of the input text picture to generate 1/2 scalingT,1/2(T+1),1/2(T +2)…1/2(T+U)U group characteristic diagramF 1 , F 2 ,… F U Then using the up-sampling network to extract the characteristicsAdditional U-set profiles at the same scaleF 1 , F 2 ,…, F U
And step 3: by means of characteristic diagramsF 1 , F 2 ,…, F U As input, a feature map required for semantic segmentation is calculated using a plurality of convolutional layers and then usedsigmoidThe function calculates the probability that the pixel points are texts on all scales, namely generating a semantic segmentation graphS 1 , S 2 , …, S U
And 4, step 4: region proposal generation is carried out on all pixel positions on the multi-scale feature map by using an RPN network, and the map is segmented according to semanticsS 1 , S 2 , …,S U Setting a probability threshold value for the value in the text, filtering out the area proposal corresponding to the pixel point smaller than the probability threshold value, and setting the rest area proposal set as a preset text box setB
Further, the method for detecting the target text box comprises the following steps: firstly, performing feature extraction on each preset text box by using a RoIAlign method to generate a feature vector with a specified length, then performing classification prediction on each preset text box by using a full-connection layer, and sampling the preset text boxes equidistantly to obtain a reference point [ (x'1, y’1), (x’2, y’2), …, (x’N, y’N)]Performing regression prediction to generate text score for each preset text boxScAnd reference point offset (Δ x)1, Δy1, Δx2, Δy2,…, ΔxN, ΔyN) Keeping a preset text box with the text score Sc larger than a set score threshold value and according to a formula xti=x’i+ ΔxiAnd yti=y’i+ ΔyiAnd calculating the positions of the reference points obtained after the preset text box is regressed, and connecting the reference points together to generate a target text area, namely the target text box.
Further, the method for generating the target text box comprises the following steps:
step (1): for preset text box setsBThe preset text boxes with different sizes in the method are firstly used for generating a feature vector with a specified length by using a RoIAlign method, then a full-connection layer is used for carrying out classification prediction on each preset text box, and a reference point [ (x ') obtained by sampling the preset text boxes at equal distances is obtained'1, y’1), (x’2, y’2), …, (x’N, y’N)]Performing regression prediction to generate text score for each preset text boxScAnd reference point offset(Δx 1 , Δy 1 , Δx 2 , Δy 2 ,…, Δx N , Δ y N )
Step (2): reserving text regions with the text score Sc larger than a set score threshold value and according to a formula xti=x’i+ ΔxiAnd yti=y’i+ ΔyiAnd calculating the positions of the reference points obtained after regression, connecting the reference points together to generate a target text area which is a target text box, and finally eliminating redundant target text boxes by adopting a non-maximum suppression algorithm.
Further, the method for training the recognizer comprises the following steps: first pair of GTR = [ (x)1, y1), (x2, y2), …, (xN, yN), txt]Performing cubic scale transformation on the input text image T with the marked height h to obtain a transformed image T1, T2, T3And performing distortion correction by using thin plate spline transformation according to the marked reference point to obtain the height h1Transformed image T of4(ii) a Reuse of full convolution network for transforming image T1, T2, T3, T4Extracting two-dimensional features, performing multiple down-sampling on feature maps of different scales according to the size of the feature maps to map the feature maps to a same scale space, converting the two-dimensional features into a one-dimensional space through flattening operation, extracting one-dimensional feature vectors by utilizing a fully-connected layer group,its corresponding feature vector is v1, v2, v3, v4And taking the character string as input, calculating characteristic similarity constraint loss, performing character string sequence prediction by using a full-connection layer with a self-attention mechanism, and calculating character string prediction loss according to a prediction result. And finally, performing end-to-end training on the whole network structure by using a total loss function, namely, a semantic segmentation loss, a preset text box classification and regression loss, a characteristic similarity constraint loss and a linear combination of character string prediction loss to obtain an optimal network model parameter.
Further, carrying out primary size transformation on the target text box to obtain a transformation image T ', then carrying out two-dimensional feature extraction on the transformation image T' by using a full convolution network, sampling and mapping the feature map to a specific scale space according to the size of the feature map, then converting the two-dimensional features into a one-dimensional space through flattening operation, carrying out one-dimensional feature vector extraction by using a trained full-connection layer group, taking the corresponding feature vector as v 'and taking the v' as an input, and identifying the character sequence in the target text box by using the trained full-connection layer with an attention mechanism.
Further, the variation image T is expressed by the following equation1, T2, T3
Figure DEST_PATH_IMAGE001
Wherein, f (T, h)i) Meaning that the input text image T is normalized in size to a height h with aspect ratio maintainediD (.) represents 2-fold down-sampling, u (.) represents 2-fold up-sampling, h1, h2, h3, thred1, thred2, thred3Is a predefined value and h1=2*h2=3*h3
When the height of the target text boxhp>thred 1 When the temperature of the water is higher than the set temperature,T’=f(TP,h 1 )(ii) a When in usethred 1
Figure DEST_PATH_IMAGE002
hp>thred 2 When the temperature of the water is higher than the set temperature,T’=f (TP,h 2 )(ii) a When in usehp
Figure 100002_DEST_PATH_IMAGE003
thred 2 When the temperature of the water is higher than the set temperature,T’=f(TP,h 3 )
attention is calculated as follows:
Figure DEST_PATH_IMAGE004
Figure 100002_DEST_PATH_IMAGE005
Figure DEST_PATH_IMAGE006
Figure 100002_DEST_PATH_IMAGE007
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE008
and
Figure 100002_DEST_PATH_IMAGE009
respectively representing the attention size and the feature vector after the attention weighting;
the equation for the total loss function is as follows:
Figure DEST_PATH_IMAGE010
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE012
the weights representing the various losses are then calculated,
Figure 100002_DEST_PATH_IMAGE013
represents a semantic segmentation loss,
Figure DEST_PATH_IMAGE014
Indicating a preset text box classification penalty and
Figure DEST_PATH_IMAGE015
shows the regression loss of the preset text box,
Figure DEST_PATH_IMAGE016
Representing feature similarity constraint penalties and
Figure 100002_DEST_PATH_IMAGE017
representing four string prediction losses.
The beneficial technical effects of the invention are as follows:
(1) most background pixels are filtered by using the prediction result of the image segmentation module, so that the number of preset text boxes to be predicted is greatly reduced, and the efficiency of the model is improved.
(2) And performing regression prediction on the reference points on the edge of the preset text frame, thereby being beneficial to detecting the text area in any direction and shape.
(3) The data is enhanced by utilizing scale transformation and space transformation, and the characteristic with strong expression capability is extracted from the text image by using a characteristic similarity constraint strategy, so that the robustness of the model for recognizing the curved text and the low-resolution text image is improved.
Drawings
FIG. 1 is a block diagram of an implementation of the detection and identification method of the present invention;
fig. 2 is a schematic flow diagram of the detection and identification method of the present invention.
Detailed Description
The following detailed description of the preferred embodiments will be made with reference to the accompanying drawings.
As shown in fig. 1 and 2, the present invention provides an end-to-end text detection and recognition method, which filters background pixels from a semantic segmentation result of an input text image to generate a preset text box set, classifies and regresses a plurality of reference points on the edge of the preset text box therein to detect a target text box, performs feature extraction on the input text image by using scale transformation and spatial transformation, trains a recognizer by using a feature similarity constraint strategy, and finally recognizes a character sequence in the target text box by using the trained recognizer. The method specifically comprises the following steps:
step 1: collecting and expanding a natural scene text image data set as a training sample set;
images and their annotations in public databases such as ICDAR2015, ICDAR 2017 MLT, SynthText, TotalText, etc. were collected as training samples. And then, according to the model training requirement, expanding the label of the sample region, namely sampling the reference point of the text region boundary to be used as a new labeling mode of the text region. For training imagesIA text area inRAre labeled and can be represented asGTR=[(x 1 , y 1 ), (x 2 , y 2 ), …, (x N , y N ), txt],Wherein(x n , y n )Is as followsnThe coordinates of the reference points are determined,Nfor a predefined total number of reference points,txtis the character string content of the text region.
Step 2: extracting multi-scale features based on a full convolution network and an up-sampling network;
a training stage: carrying out preprocessing such as turning, scaling, pixel normalization and the like on a training sample, and then randomly cutting 8 rectangular areas with the size of 512 x 512 in each batch as network input for model training;
and (3) a testing stage: on the premise of keeping the aspect ratio, the longest edge of the picture is normalized to 1600 or 2400, and then the picture is subjected to pixel normalization processing and input by taking 1 piece per batch as a network.
To ensure the robustness of the model to the text size, the network first performs feature extraction on the input picture using a full convolution network such as ResNet-50 to generate a scale of 1 ^ based on the text size2, 1/4, 1/8, and 1/16F 1 , F 2 , F 3 , F 4 . Then, in order to fuse the higher-level features and the lower-level features, the network performs the features by means of upsampling using an upsampling network such as FPN, and generates another four groups of feature maps with the same scalingF 1 , F 2 , F 3 , F 4
And step 3: calculating a multi-scale semantic segmentation graph;
to be provided withF 1 , F 2 , F 3 , F 4 As input, a feature map required for semantic segmentation is computed using a plurality of convolutional layers, such as 2 3x3 convolutional layers and 1x1 convolutional layers, and then usedsigmoidThe function calculates the probability that the pixel points are texts on all scales, namely generating a semantic segmentation graphS 1 , S 2 , S 3 , S 4 Training phase, which can calculate the semantic segmentation lossL seg
And 4, step 4: using the RPN network to generate a region proposal;
aiming at each pixel position on the feature map with different scales, the RPN network generates a large number of area proposals, namely preset text boxes, according to predefined hyperparameters such as base size, aspect ratio and the like. The number of the text boxes is in the million level, and in order to reduce the number of the preset text boxes to be predicted and improve the efficiency of the model, the model in the invention divides the graph according to the semanticsS 1 , S 2 , S 3 ,S 4 Setting a probability threshold value, such as 0.3, filtering some background pixel points, such as points with text probability lower than 0.3, and then generating a region proposal by the RPN network only aiming at the pixel points with higher text probability to obtain a preset text box setB. Or all texts using RPN networkGenerating region proposal for the probability pixel points, setting a probability threshold, filtering out the region proposal corresponding to the pixel points smaller than the probability threshold, and recording the rest region proposal set as a preset text box setB
And 5: presetting text box classification and regression prediction;
for preset text box setsBThe model firstly uses RoIAlign to generate a feature vector with a specific length, then uses a full connection layer to carry out classification and regression prediction, and generates a text score for each preset text boxScAnd reference point offset(Δx 1 , Δy 1 , Δx 2 , Δy 2 ,…, Δx N , Δy N ). The existing algorithm generally only carries out regression prediction on the central point, width, height or angular point of a threshold text box, so that a target text box obtained after regression is still rectangular and has no robustness on a text region, particularly the shape of a bent text. The invention samples the equal distance of the threshold text box to obtain the reference point[(x’ 1 , y’ 1 ), (x’ 2 , y’ 2 ), …, (x’ N , y’ N )]Regression is performed and can be applied to any text shape. In the training stage, marking according to the reference point of each text box on the training imageGTR=[(x 1 , y 1 ), (x 2 , y 2 ), …, (x N , y N ), txt]And predicting the resultScAnd(Δx 1 , Δy 1 , Δx 2 , Δy 2 ,…, Δx N , Δy N )classification and regression losses can be calculatedL cls AndL reg
step 6: generating a target text box;
model retention articleThis scoreScThe text area larger than 0.5 is the target text area and is based on the formulax ti =x i + Δx i Andy ti =y’ i + Δy i and calculating the position of the reference point obtained after the regression of the region. Then, the reference points on the target text box are connected in sequence to obtain the position of the target text area with any shape and direction. Finally, a non-maximum suppression algorithm is used to eliminate redundant target text boxes.
And 7: constructing a recognizer;
in order to obtain a text recognizer with high robustness to text distortion, blurring and low resolution, we can select a new text recognizer with high robustness to text distortion, blurring and low resolution in the training stageGTR=[(x 1 , y 1 ), (x 2 , y 2 ), …, (x N , y N ), txt]Marked with a height ofhText image ofTPerforming cubic scale transformation to obtain transformed imageT 1 , T 2 , T 3
Figure DEST_PATH_IMAGE018
Whereinf(T, h i )Indicating that the image is to be viewed with aspect ratio maintainedTNormalized to heighth i d(.)Which means a 2-fold down-sampling,u(.)which means that the up-sampling is 2 times,h 1 , h 2 , h3, thred 1 , thred 2 , thred 3 is a predefined value andh 1 = 2*h 2 =3*h 3 . In addition, based on the fiducial points on the training sample edge, a thin-plate spline transform is used for distortion correction to obtain a height ofh 1 Is transformed into an imageT 4 . Training phase,T 1 , T 2 , T 3 AndT 4 taken as the input of the recognizer in the training stage, and in the testing stage, the image TP predicted in the step 6 is obtained by only carrying out one-time scale changeT’And will beT’As a network input, whereinTPHeight of (2)hp>thred 1 Time of flightT’=f(TP,h 1 )(ii) a When in usethred 1
Figure 167294DEST_PATH_IMAGE002
hp>thred 2 Time of flightT’=f(TP,h 2 )(ii) a When in usehp
Figure 615593DEST_PATH_IMAGE003
thred 2 Time of flightT’=f(TP,h 3 )
And 8: extracting two-dimensional features of the text image;
the model uses a full convolution network such as ResNet-32 pairsT 1 , T 2 , T 3 ,T 4 OrT’And performing two-dimensional feature extraction, and performing 4-time down-sampling or 2-time down-sampling on feature maps with different sizes according to the size of the feature maps to map the feature maps to the same scale space.
And step 9: extracting one-dimensional feature vectors of the text images by utilizing a fully-connected layer group;
the method comprises the steps of firstly converting two-dimensional features into a one-dimensional space through flattening operation, and then extracting one-dimensional feature vectors of text images by utilizing a fully-connected layer group. Training phaseT 1 , T 2 , T 3 ,T 4 Corresponding feature vector isv 1 , v 2 , v 3 , v 4 Test phaseT’Corresponding feature vector isv’
Step 10: calculating the feature similarity constraint;
the recognizer is intended to be constrained from warped, low resolution, and blurred images by feature similarityT 1 , T 2 , T 3 Extracting and high-resolution distortion corrected imageT 4 Similar features. This constraint makes the one-dimensional features extracted in the present invention more favorable for the sequence prediction in step 11 than existing recognizers. Thus, the corresponding feature vector is obtainedv 1 , v 2 , v 3 , v 4 The model then calculates the feature similarity loss using the following formula:
Figure DEST_PATH_IMAGE019
step 11: outputting and predicting the character sequence;
the recognizer uses a fully connected layer with attention mechanism to infer the textual content in the picture from the feature vector vi. This attention is used to emphasize features related to text while suppressing background corresponding features to improve the robustness of the recognizer to background noise in the text image. The attention calculation is as follows:
Figure 192067DEST_PATH_IMAGE004
Figure 115505DEST_PATH_IMAGE005
Figure 461036DEST_PATH_IMAGE006
Figure 814657DEST_PATH_IMAGE007
wherein the content of the first and second substances,
Figure 612849DEST_PATH_IMAGE008
and
Figure 342907DEST_PATH_IMAGE009
respectively representing the attention size and the attention weighted feature vector, wherein the weighted feature vector is used as the input of a subsequent full-connection layer, and the output is the character string sequence probability with the length of T, and T is the predefined maximum character string length. At present, the recognizer is widely applied to LSTM embedded with attention mechanism, the calculation is complex, and global information cannot be fully utilized for prediction. The invention directly utilizes the full-connection layer with the self-attention mechanism to predict the character string sequence, can effectively utilize the global characteristics of the text image and simultaneously reduces the complexity of the model. A testing phase of the layerv’And finally outputting the character string prediction result obtained in the input process. A training phase of the layerv 1 , v 2 , v 3 , v 4 The prediction result obtained for input will be combined with the character string truth value in the input image labeltxtCalculating to obtain character string prediction lossL recg1 , L recg2 , L recg3 , L recg4 . The total loss function consists of semantic segmentation loss, preset text box classification and regression loss, feature similarity constraint loss, and string prediction loss, which can be expressed as follows, where
Figure 542945DEST_PATH_IMAGE012
For each lost weight:
Figure DEST_PATH_IMAGE020
the total loss function can carry out end-to-end training on the text detection model and the text recognizer to obtain the optimal model parameters.
It will be appreciated by those skilled in the art that these are merely examples and that many variations or modifications may be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is therefore defined by the appended claims.

Claims (7)

1. An end-to-end text detection and identification method, characterized by: filtering background pixels by utilizing a semantic segmentation result of an input text image to generate a preset text box set, classifying and performing regression prediction on a plurality of reference points on the edge of the preset text box to detect a target text box, performing feature extraction on the input text image by utilizing scale transformation and space transformation, training a recognizer by utilizing a feature similarity constraint strategy, and finally recognizing a character sequence in the target text box by utilizing the trained recognizer;
the method for training the recognizer comprises the following steps: first pair of GTR = [ (x)1, y1), (x2, y2), …, (xN, yN), txt]Performing cubic scale transformation on the input text image T with the marked height h to obtain a transformed image T1, T2, T3And performing distortion correction by using thin plate spline transformation according to the marked reference point to obtain the height h1Transformed image T of4(ii) a Reuse of full convolution network for transforming image T1, T2, T3, T4Extracting two-dimensional features, performing multiple down-sampling on feature maps of different scales according to the size of the feature maps to map the feature maps to the same scale space, converting the two-dimensional features into a one-dimensional space through flattening operation, and extracting one-dimensional feature vectors by utilizing a full-connection layer group, wherein the corresponding feature vectors are v1, v2, v3, v4And finally, performing end-to-end training on the whole network structure by using a total loss function, namely a linear combination of semantic segmentation loss, preset text box classification and regression loss, characteristic similarity constraint loss and character string prediction loss to obtain an optimal network model parameter.
2. According to claimThe method for end-to-end text detection and identification as claimed in claim 1, wherein the method for generating the preset text box set comprises: establishing an image library containing character sequences, carrying out normalization processing on each text image, then using a full convolution network and an up-sampling network to carry out multi-scale feature map extraction with different scaling ratios on an input text image, taking the multi-scale feature map extraction as input, and combining a plurality of convolution layerssigmoidAnd generating a semantic segmentation graph by using the function, simultaneously, performing region proposal generation on all pixel positions on the multi-scale feature graph by using an RPN (resilient packet network), then setting a probability threshold according to the semantic segmentation graph, filtering out region proposals corresponding to pixel points smaller than the probability threshold, and recording the rest region proposal set as a preset text box set.
3. The method for end-to-end text detection and identification according to claim 2, wherein the method for generating the preset text box set comprises the steps of:
step 1: collecting and expanding a text image data set of a natural scene as a training sample set, and carrying out comparison on text images in the training sample setIText area in (1)RIs marked with a notation ofGTR=[(x 1 , y 1 ), (x 2 , y 2 ), …, (x N , y N ), txt],Wherein(x n , y n )As text regionsR(Edge)On the upper partTo (1) anThe coordinates of the reference points are determined,Nfor a predefined total number of reference points,txtas text regionsRThe character string content of (1);
step 2: multi-scale feature extraction based on a full convolution network and an up-sampling network: after the sample is normalized, the full convolution network is used for extracting the characteristics of the input text image to generate 1/2 scalingT,1/2(T+1),1/2(T+2)…1/2(T+U)U group characteristic diagramF 1 , F 2 ,… F U Then using up-sampling network to extract features and generate same scalingAdditional U-set feature maps ofF 1 , F 2 ,…, F U
And step 3: by means of characteristic diagramsF 1 , F 2 ,…, F U As input, a feature map required for semantic segmentation is calculated using a plurality of convolutional layers and then usedsigmoidThe function calculates the probability that the pixel points are texts on all scales, namely generating a semantic segmentation graphS 1 , S 2 , …, S U
And 4, step 4: region proposal generation is carried out on all pixel positions on the multi-scale feature map by using an RPN network, and the map is segmented according to semanticsS 1 , S 2 , …,S U Setting a probability threshold value for the value in the text, filtering out the area proposal corresponding to the pixel point smaller than the probability threshold value, and setting the rest area proposal set as a preset text box setB
4. The method of end-to-end text detection and recognition as claimed in claim 1, wherein the method of detecting the target text box comprises: firstly, performing feature extraction on each preset text box by using a RoIAlign method to generate a feature vector with a specified length, then performing classification prediction on each preset text box by using a full-connection layer, and sampling the preset text boxes equidistantly to obtain a reference point [ (x'1, y’1), (x’2, y’2), …, (x’N, y’N)]Performing regression prediction to generate text score for each preset text boxScAnd reference point offset (Δ x)1, Δy1, Δx2, Δy2,…, ΔxN, ΔyN) Keeping a preset text box with the text score Sc larger than a set score threshold value and according to a formula xti=x’i+ ΔxiAnd yti=y’i+ ΔyiCalculating the position of the reference point obtained after the preset text box is regressedAnd connecting the text frames together to generate a target text area, namely the target text box.
5. The end-to-end text detection and identification method of claim 4, wherein the method of generating the target text box comprises the steps of:
step (1): for preset text box setsBThe preset text boxes with different sizes in the method are firstly used for generating a feature vector with a specified length by using a RoIAlign method, then a full-connection layer is used for carrying out classification prediction on each preset text box, and a reference point [ (x ') obtained by sampling the preset text boxes at equal distances is obtained'1, y’1), (x’2, y’2), …, (x’N, y’N)]Performing regression prediction to generate text score for each preset text boxScAnd reference point offset(Δx 1 , Δy 1 , Δx 2 , Δy 2 ,…, Δx N , Δy N )
Step (2): reserving text regions with the text score Sc larger than a set score threshold value and according to a formula xti=x’i+ ΔxiAnd yti=y’i+ ΔyiAnd calculating the positions of the reference points obtained after regression, connecting the reference points together to generate a target text area, namely a target text box, and finally eliminating redundant target text boxes by adopting a non-maximum suppression algorithm.
6. The end-to-end text detection and recognition method of claim 5, wherein predicting with a recognizer comprises: carrying out primary size transformation on a detected target text box to obtain a transformation image T ', then carrying out two-dimensional feature extraction on the transformation image T ' by using a full convolution network, carrying out down-sampling on a feature map according to the size of the feature map to map the feature map to a specific scale space, then converting two-dimensional features into a one-dimensional space through flattening operation, carrying out one-dimensional feature vector extraction by using a trained full-connection layer group, taking a corresponding feature vector as v ', and identifying a character sequence in the target text box by using the trained full-connection layer with an attention mechanism.
7. The end-to-end text detection and recognition method of claim 6, wherein: the transformed image T is expressed by the following equation1, T2, T3
Figure 96284DEST_PATH_IMAGE002
Wherein, f (T, h)i) Meaning that the input text image T is normalized in size to a height h with aspect ratio maintainediD (.) represents 2-fold down-sampling, u (.) represents 2-fold up-sampling, h1, h2, h3, thred1, thred2Is a predefined value and h1=2*h2=3*h3
When the height of the target text boxhp>thred 1 When the temperature of the water is higher than the set temperature,T’=f(TP,h 1 )(ii) a When in usethred 1
Figure DEST_PATH_IMAGE003
hp>thred 2 When the temperature of the water is higher than the set temperature,T’=f(TP,h 2 )(ii) a When in usehp
Figure 524729DEST_PATH_IMAGE004
thred 2 When the temperature of the water is higher than the set temperature,T’=f(TP,h 3 )
attention is calculated as follows:
Figure DEST_PATH_IMAGE005
Figure 412962DEST_PATH_IMAGE006
Figure DEST_PATH_IMAGE007
Figure 28489DEST_PATH_IMAGE008
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE009
and
Figure 474252DEST_PATH_IMAGE010
respectively representing the attention size and the feature vector after the attention weighting;
the equation for the total loss function is as follows:
Figure DEST_PATH_IMAGE011
wherein the content of the first and second substances,
Figure 298814DEST_PATH_IMAGE012
the weights representing the various losses are then calculated,
Figure DEST_PATH_IMAGE013
represents a semantic segmentation loss,
Figure 617669DEST_PATH_IMAGE014
Indicating a preset text box classification penalty and
Figure 189333DEST_PATH_IMAGE015
shows the regression loss of the preset text box,
Figure 861623DEST_PATH_IMAGE016
Representing feature similarity constraint penaltiesAnd
Figure DEST_PATH_IMAGE017
representing four string prediction losses.
CN202110344324.7A 2021-03-31 2021-03-31 End-to-end text detection and identification method Active CN112733822B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110344324.7A CN112733822B (en) 2021-03-31 2021-03-31 End-to-end text detection and identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110344324.7A CN112733822B (en) 2021-03-31 2021-03-31 End-to-end text detection and identification method

Publications (2)

Publication Number Publication Date
CN112733822A CN112733822A (en) 2021-04-30
CN112733822B true CN112733822B (en) 2021-07-27

Family

ID=75596175

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110344324.7A Active CN112733822B (en) 2021-03-31 2021-03-31 End-to-end text detection and identification method

Country Status (1)

Country Link
CN (1) CN112733822B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112801092B (en) * 2021-01-29 2022-07-15 重庆邮电大学 Method for detecting character elements in natural scene image
CN113205049A (en) * 2021-05-07 2021-08-03 开放智能机器(上海)有限公司 Document identification method and identification system
CN113486716B (en) * 2021-06-04 2022-06-14 电子科技大学长三角研究院(衢州) Airport scene target segmentation method and system thereof
CN113282718B (en) * 2021-07-26 2021-12-10 北京快鱼电子股份公司 Language identification method and system based on self-adaptive center anchor
CN113591719A (en) * 2021-08-02 2021-11-02 南京大学 Method and device for detecting text with any shape in natural scene and training method
CN113343958B (en) * 2021-08-06 2021-11-19 北京世纪好未来教育科技有限公司 Text recognition method, device, equipment and medium
CN113780276B (en) * 2021-09-06 2023-12-05 成都人人互娱科技有限公司 Text recognition method and system combined with text classification
CN114359932B (en) * 2022-01-11 2023-05-23 北京百度网讯科技有限公司 Text detection method, text recognition method and device
CN114067321B (en) * 2022-01-14 2022-04-08 腾讯科技(深圳)有限公司 Text detection model training method, device, equipment and storage medium
CN114882485A (en) * 2022-04-25 2022-08-09 华南理工大学 Natural scene character detection method, system and medium for slender text
CN117312928B (en) * 2023-11-28 2024-02-13 南京网眼信息技术有限公司 Method and system for identifying user equipment information based on AIGC

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108549893A (en) * 2018-04-04 2018-09-18 华中科技大学 A kind of end-to-end recognition methods of the scene text of arbitrary shape
CN110837835A (en) * 2019-10-29 2020-02-25 华中科技大学 End-to-end scene text identification method based on boundary point detection
CN111062854A (en) * 2019-12-26 2020-04-24 Oppo广东移动通信有限公司 Method, device, terminal and storage medium for detecting watermark
CN112364873A (en) * 2020-11-20 2021-02-12 深圳壹账通智能科技有限公司 Character recognition method and device for curved text image and computer equipment
CN112580656A (en) * 2021-02-23 2021-03-30 上海旻浦科技有限公司 End-to-end text detection method, system, terminal and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10354168B2 (en) * 2016-04-11 2019-07-16 A2Ia S.A.S. Systems and methods for recognizing characters in digitized documents
US10223585B2 (en) * 2017-05-08 2019-03-05 Adobe Systems Incorporated Page segmentation of vector graphics documents
CN108734169A (en) * 2018-05-21 2018-11-02 南京邮电大学 One kind being based on the improved scene text extracting method of full convolutional network
CN111553347B (en) * 2020-04-26 2023-04-18 佛山市南海区广工大数控装备协同创新研究院 Scene text detection method oriented to any angle

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108549893A (en) * 2018-04-04 2018-09-18 华中科技大学 A kind of end-to-end recognition methods of the scene text of arbitrary shape
CN110837835A (en) * 2019-10-29 2020-02-25 华中科技大学 End-to-end scene text identification method based on boundary point detection
CN111062854A (en) * 2019-12-26 2020-04-24 Oppo广东移动通信有限公司 Method, device, terminal and storage medium for detecting watermark
CN112364873A (en) * 2020-11-20 2021-02-12 深圳壹账通智能科技有限公司 Character recognition method and device for curved text image and computer equipment
CN112580656A (en) * 2021-02-23 2021-03-30 上海旻浦科技有限公司 End-to-end text detection method, system, terminal and storage medium

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
Cursive-Text: A Comprehensive Dataset for End-to-End Urdu Text Recognition in Natural Scene Images;Asghar Ali Chandio 等;《Data in Brief》;20200831;第31卷;第1-14页 *
End-to-End Scene Text Recognition;Kai Wang 等;《IEEE International Conference on Computer Vision》;20121231;第1-8页 *
基于Mask-RCNN无分割手写数字字符串的识别;陶志勇 等;《激光与光电子学进展》;20200731;第57卷(第14期);第2节 *
基于深度学习的自然场景文本检测与识别综述;王建新 等;《软件学报》;20200407;第31卷(第5期);第1465-1496页 *
基于语义分割技术的任意方向文字识别;王涛 等;《应用科技》;20180630;第45卷(第3期);第55-60页 *

Also Published As

Publication number Publication date
CN112733822A (en) 2021-04-30

Similar Documents

Publication Publication Date Title
CN112733822B (en) End-to-end text detection and identification method
CN109299274B (en) Natural scene text detection method based on full convolution neural network
CN108665481B (en) Self-adaptive anti-blocking infrared target tracking method based on multi-layer depth feature fusion
CN109543695B (en) Population-density population counting method based on multi-scale deep learning
CN109447008B (en) Crowd analysis method based on attention mechanism and deformable convolutional neural network
CN109726657B (en) Deep learning scene text sequence recognition method
CN112150493B (en) Semantic guidance-based screen area detection method in natural scene
CN112507777A (en) Optical remote sensing image ship detection and segmentation method based on deep learning
CN111414906A (en) Data synthesis and text recognition method for paper bill picture
CN110782420A (en) Small target feature representation enhancement method based on deep learning
WO2023083280A1 (en) Scene text recognition method and device
CN106997597A (en) It is a kind of based on have supervision conspicuousness detection method for tracking target
CN107169994A (en) Correlation filtering tracking based on multi-feature fusion
CN112288772B (en) Channel attention target tracking method based on online multi-feature selection
CN112016512A (en) Remote sensing image small target detection method based on feedback type multi-scale training
CN111461039A (en) Landmark identification method based on multi-scale feature fusion
CN114330529A (en) Real-time pedestrian shielding detection method based on improved YOLOv4
CN112488128A (en) Bezier curve-based detection method for any distorted image line segment
Liu et al. Cloud detection using super pixel classification and semantic segmentation
Liu et al. SLPR: A deep learning based chinese ship license plate recognition framework
Ren et al. Research on infrared small target segmentation algorithm based on improved mask R-CNN
CN113688821A (en) OCR character recognition method based on deep learning
CN110555406B (en) Video moving target identification method based on Haar-like characteristics and CNN matching
CN110796145B (en) Multi-certificate segmentation association method and related equipment based on intelligent decision
CN110827319B (en) Improved Staple target tracking method based on local sensitive histogram

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant