CN112733822B

CN112733822B - End-to-end text detection and identification method

Info

Publication number: CN112733822B
Application number: CN202110344324.7A
Authority: CN
Inventors: 姜华; 王晴晴; 杜沁益; 李蔡元
Original assignee: Shanghai Minpu Technology Co ltd
Current assignee: Shanghai Minpu Technology Co ltd
Priority date: 2021-03-31
Filing date: 2021-03-31
Publication date: 2021-07-27
Anticipated expiration: 2041-03-31
Also published as: CN112733822A

Abstract

The invention belongs to the technical field of visual recognition and discloses an end-to-end text detection and recognition method which includes the steps of filtering background pixels by utilizing a semantic segmentation result of an input text image, generating a preset text box set, classifying and regressing a plurality of reference points on the edge of the preset text box, detecting a target text box, extracting features of the input text image by utilizing scale transformation and space transformation, training a recognizer by utilizing a feature similarity constraint strategy, and finally recognizing character sequences in the target text box by utilizing the trained recognizer. The method provided by the invention can be used for detecting and identifying the text in the natural scene image end to end, and the robustness of the model to the bent text and the low-resolution text is improved while the model efficiency is considered, so that the method has innovation and practical application values.

Description

End-to-end text detection and identification method

Technical Field

The invention relates to the technical field of visual recognition, in particular to an end-to-end text detection and recognition method.

Background

Characters play a very important role in daily life, and transmit information and knowledge to people in the forms of traffic signs, poster advertisements, product description on packaging bags and the like. With the popularization of devices with camera shooting functions such as mobile phones and vehicle-mounted cameras, more and more characters are collected, transmitted and stored in an image form, and the characters automatically detected and identified from the images have wide application prospects in the fields of intelligent transportation, image detection, scene understanding and the like, so that related researches are always concerned in the field of computer vision.

In recent years, a network model based on deep learning is a key solution for each related task in three fields of sound (speech recognition), graph (computer vision) and text (natural language processing), and meanwhile, text detection and recognition also enter the deep learning era. The existing text detection algorithm based on deep learning mainly has three types: a semantic segmentation based network model, a target detection based network model, and a hybrid model. And performing pixel-level prediction on the text image based on the semantic segmentation network model, and deducing the position, shape and angle of the text box to which each pixel belongs according to the prediction result. The network model based on target detection takes the text as a specific target, and directly outputs target text box information by classifying and regression predicting a large number of preset text boxes. Although the two models achieve excellent performance in text detection, they have disadvantages, for example, the network model based on semantic segmentation is not an end-to-end text detection model, and such models often require a large amount of complicated post-processing operations in order to deduce information of a target text box from a prediction result, whereas the text detection model based on target detection easily misses a text region with a large width and height ratio. The hybrid text detection model takes the length of the two text detection models to avoid the length of the two text detection models, and predicts the pixel and the preset text box simultaneously, so the detection rate can be effectively improved.

Existing text recognition models based on deep learning can be divided into recognition models based on an attention mechanism and recognition models based on a joint temporal classification (CTC) according to different sequence prediction modules of the text recognition models. The two models use a Convolutional Neural Network (CNN) and a long-short time memory network (LSTM) to perform feature extraction on a text image and encode feature segments, and the difference is that an attention mechanism-based recognition model uses an attention-based GRU or an attention-based LSTM to decode a feature sequence to obtain a character string sequence output, and a CTC-based recognition model uses a forward-backward algorithm CTC to perform mapping from a frame-level prediction result to the character string sequence. However, both of the above recognition models face the following problems: firstly, the identification effect on the bent text is poor, an additional text correction module is needed, and in addition, because the LSTM only accepts one-dimensional feature vectors as input, the two-dimensional feature map needs to be mapped to a one-dimensional space by flattening or pooling operation, so that the space and structural information of the image are damaged, and the identification performance is further influenced; secondly, the robustness of the low-resolution text image is poor, and due to the fact that the difference of the resolution of the natural scene text image is large, the low-resolution image is blurred after being amplified through the scale normalization operation in the preprocessing stage, and the recognition performance is further influenced.

Disclosure of Invention

In order to solve the problems, the invention provides an end-to-end text detection and identification method, which includes the steps of filtering most background pixels based on a semantic segmentation idea, then carrying out classification and regression prediction on preset text boxes aiming at the reserved text pixels, directly outputting information such as positions and shapes of target text boxes, and finally, designing a data self-enhanced identifier with feature similarity constraint for text identification through an algorithm.

The invention can be realized by the following technical scheme:

an end-to-end text detection and identification method includes filtering out background pixels by utilizing semantic segmentation results of input text images, generating a preset text box set, classifying and regressing a plurality of reference points on edges of the preset text box in the preset text box set to detect a target text box, extracting features of the input text images by utilizing scale transformation and space transformation, training a recognizer by utilizing a feature similarity constraint strategy, and finally identifying character sequences in the target text box by utilizing the trained recognizer.

Further, the method for generating the preset text box set comprises the following steps: establishing an image library comprising character sequences, carrying out normalization processing on each text image, then using a full convolution network and an up-sampling network to carry out multi-scale feature map extraction with different scaling ratios on an input text image, taking the multi-scale feature map extraction as input, and combining a plurality of convolution layerssigmoidAnd generating a semantic segmentation graph by using the function, simultaneously, performing region proposal generation on all pixel positions on the multi-scale feature graph by using an RPN (resilient packet network), then setting a probability threshold according to the semantic segmentation graph, filtering out region proposals corresponding to pixel points smaller than the probability threshold, and recording the rest region proposal set as a preset text box set.

Further, the method for generating the preset text box set comprises the following steps:

step 1: collecting and expanding a text image data set of a natural scene as a training sample set, and carrying out comparison on text images in the training sample setIA text area inRIs marked with a notation ofGTR=[(x ₁ , y ₁ ), (x ₂ , y ₂ ), …, (x _N , y _N ), txt],Wherein(x _n , y _n )As text regionsROn the edge of the firstnThe coordinates of the individual reference points are,Nfor a predefined total number of reference points,txtas text regionsRThe character string content of (1);

step 2: multi-scale feature extraction based on a full convolution network and an up-sampling network: after the samples are normalized, the full convolution network is used for extracting the features of the input text picture to generate 1/2 scaling^T，1/2^(T+1)，1/2^(T ⁺²⁾…1/2^(T+U)U group characteristic diagramF ₁ , F ₂ ,… F _UThen using the up-sampling network to extract the characteristicsAdditional U-set profiles at the same scaleF ^’ ₁ , F ^’ ₂ ,…, F ^’ _U；

And step 3: by means of characteristic diagramsF ^’ ₁ , F ^’ ₂ ,…, F ^’ _UAs input, a feature map required for semantic segmentation is calculated using a plurality of convolutional layers and then usedsigmoidThe function calculates the probability that the pixel points are texts on all scales, namely generating a semantic segmentation graphS ₁ , S ₂ , …, S _U；

And 4, step 4: region proposal generation is carried out on all pixel positions on the multi-scale feature map by using an RPN network, and the map is segmented according to semanticsS ₁ , S ₂ , …,S _USetting a probability threshold value for the value in the text, filtering out the area proposal corresponding to the pixel point smaller than the probability threshold value, and setting the rest area proposal set as a preset text box setB。

Further, the method for detecting the target text box comprises the following steps: firstly, performing feature extraction on each preset text box by using a RoIAlign method to generate a feature vector with a specified length, then performing classification prediction on each preset text box by using a full-connection layer, and sampling the preset text boxes equidistantly to obtain a reference point [ (x'₁, y’₁), (x’₂, y’₂), …, (x’_N, y’_N)]Performing regression prediction to generate text score for each preset text boxScAnd reference point offset (Δ x)₁, Δy₁, Δx₂, Δy₂,…, Δx_N, Δy_N) Keeping a preset text box with the text score Sc larger than a set score threshold value and according to a formula x_ti=x’_i+ Δx_iAnd y_ti=y’_i+ Δy_iAnd calculating the positions of the reference points obtained after the preset text box is regressed, and connecting the reference points together to generate a target text area, namely the target text box.

Further, the method for generating the target text box comprises the following steps:

step (1): for preset text box setsBThe preset text boxes with different sizes in the method are firstly used for generating a feature vector with a specified length by using a RoIAlign method, then a full-connection layer is used for carrying out classification prediction on each preset text box, and a reference point [ (x ') obtained by sampling the preset text boxes at equal distances is obtained'₁, y’₁), (x’₂, y’₂), …, (x’_N, y’_N)]Performing regression prediction to generate text score for each preset text boxScAnd reference point offset(Δx ₁ , Δy ₁ , Δx ₂ , Δy ₂ ,…, Δx _N , Δ y _N )；

Step (2): reserving text regions with the text score Sc larger than a set score threshold value and according to a formula x_ti=x’_i+ Δx_iAnd y_ti=y’_i+ Δy_iAnd calculating the positions of the reference points obtained after regression, connecting the reference points together to generate a target text area which is a target text box, and finally eliminating redundant target text boxes by adopting a non-maximum suppression algorithm.

Further, the method for training the recognizer comprises the following steps: first pair of GTR = [ (x)₁, y₁), (x₂, y₂), …, (x_N, y_N), txt]Performing cubic scale transformation on the input text image T with the marked height h to obtain a transformed image T₁, T₂, T₃And performing distortion correction by using thin plate spline transformation according to the marked reference point to obtain the height h₁Transformed image T of₄(ii) a Reuse of full convolution network for transforming image T₁, T₂, T₃, T₄Extracting two-dimensional features, performing multiple down-sampling on feature maps of different scales according to the size of the feature maps to map the feature maps to a same scale space, converting the two-dimensional features into a one-dimensional space through flattening operation, extracting one-dimensional feature vectors by utilizing a fully-connected layer group,its corresponding feature vector is v₁, v₂, v₃, v₄And taking the character string as input, calculating characteristic similarity constraint loss, performing character string sequence prediction by using a full-connection layer with a self-attention mechanism, and calculating character string prediction loss according to a prediction result. And finally, performing end-to-end training on the whole network structure by using a total loss function, namely, a semantic segmentation loss, a preset text box classification and regression loss, a characteristic similarity constraint loss and a linear combination of character string prediction loss to obtain an optimal network model parameter.

Further, carrying out primary size transformation on the target text box to obtain a transformation image T ', then carrying out two-dimensional feature extraction on the transformation image T' by using a full convolution network, sampling and mapping the feature map to a specific scale space according to the size of the feature map, then converting the two-dimensional features into a one-dimensional space through flattening operation, carrying out one-dimensional feature vector extraction by using a trained full-connection layer group, taking the corresponding feature vector as v 'and taking the v' as an input, and identifying the character sequence in the target text box by using the trained full-connection layer with an attention mechanism.

Further, the variation image T is expressed by the following equation₁, T₂, T₃，

Wherein, f (T, h)_i) Meaning that the input text image T is normalized in size to a height h with aspect ratio maintained_iD (.) represents 2-fold down-sampling, u (.) represents 2-fold up-sampling, h₁, h₂, h3, thred₁, thred₂, thred₃Is a predefined value and h₁=2*h₂=3*h₃；

When the height of the target text boxhp>thred ₁When the temperature of the water is higher than the set temperature,T’=f(TP,h ₁ )(ii) a When in usethred ₁

hp>thred ₂When the temperature of the water is higher than the set temperature,T’=f (TP,h ₂ )(ii) a When in usehp

thred ₂When the temperature of the water is higher than the set temperature,T’=f(TP,h ₃ )；

attention is calculated as follows:

wherein the content of the first and second substances,

and

respectively representing the attention size and the feature vector after the attention weighting;

the equation for the total loss function is as follows:

wherein the content of the first and second substances,

the weights representing the various losses are then calculated,

represents a semantic segmentation loss,

Indicating a preset text box classification penalty and

shows the regression loss of the preset text box,

Representing feature similarity constraint penalties and

representing four string prediction losses.

The beneficial technical effects of the invention are as follows:

(1) most background pixels are filtered by using the prediction result of the image segmentation module, so that the number of preset text boxes to be predicted is greatly reduced, and the efficiency of the model is improved.

(2) And performing regression prediction on the reference points on the edge of the preset text frame, thereby being beneficial to detecting the text area in any direction and shape.

(3) The data is enhanced by utilizing scale transformation and space transformation, and the characteristic with strong expression capability is extracted from the text image by using a characteristic similarity constraint strategy, so that the robustness of the model for recognizing the curved text and the low-resolution text image is improved.

Drawings

FIG. 1 is a block diagram of an implementation of the detection and identification method of the present invention;

fig. 2 is a schematic flow diagram of the detection and identification method of the present invention.

Detailed Description

The following detailed description of the preferred embodiments will be made with reference to the accompanying drawings.

As shown in fig. 1 and 2, the present invention provides an end-to-end text detection and recognition method, which filters background pixels from a semantic segmentation result of an input text image to generate a preset text box set, classifies and regresses a plurality of reference points on the edge of the preset text box therein to detect a target text box, performs feature extraction on the input text image by using scale transformation and spatial transformation, trains a recognizer by using a feature similarity constraint strategy, and finally recognizes a character sequence in the target text box by using the trained recognizer. The method specifically comprises the following steps:

step 1: collecting and expanding a natural scene text image data set as a training sample set;

images and their annotations in public databases such as ICDAR2015, ICDAR 2017 MLT, SynthText, TotalText, etc. were collected as training samples. And then, according to the model training requirement, expanding the label of the sample region, namely sampling the reference point of the text region boundary to be used as a new labeling mode of the text region. For training imagesIA text area inRAre labeled and can be represented asGTR=[(x ₁ , y ₁ ), (x ₂ , y ₂ ), …, (x _N , y _N ), txt],Wherein(x _n , y _n )Is as followsnThe coordinates of the reference points are determined,Nfor a predefined total number of reference points,txtis the character string content of the text region.

Step 2: extracting multi-scale features based on a full convolution network and an up-sampling network;

a training stage: carrying out preprocessing such as turning, scaling, pixel normalization and the like on a training sample, and then randomly cutting 8 rectangular areas with the size of 512 x 512 in each batch as network input for model training;

and (3) a testing stage: on the premise of keeping the aspect ratio, the longest edge of the picture is normalized to 1600 or 2400, and then the picture is subjected to pixel normalization processing and input by taking 1 piece per batch as a network.

To ensure the robustness of the model to the text size, the network first performs feature extraction on the input picture using a full convolution network such as ResNet-50 to generate a scale of 1 ^ based on the text size2, 1/4, 1/8, and 1/16F ₁ , F ₂ , F ₃ , F ₄. Then, in order to fuse the higher-level features and the lower-level features, the network performs the features by means of upsampling using an upsampling network such as FPN, and generates another four groups of feature maps with the same scalingF ^’ ₁ , F ^’ ₂ , F ^’ ₃ , F ^’ ₄。

And step 3: calculating a multi-scale semantic segmentation graph;

to be provided withF ^’ ₁ , F ^’ ₂ , F ^’ ₃ , F ^’ ₄As input, a feature map required for semantic segmentation is computed using a plurality of convolutional layers, such as 2 3x3 convolutional layers and 1x1 convolutional layers, and then usedsigmoidThe function calculates the probability that the pixel points are texts on all scales, namely generating a semantic segmentation graphS ₁ , S ₂ , S ₃ , S ₄Training phase, which can calculate the semantic segmentation lossL _seg。

And 4, step 4: using the RPN network to generate a region proposal;

aiming at each pixel position on the feature map with different scales, the RPN network generates a large number of area proposals, namely preset text boxes, according to predefined hyperparameters such as base size, aspect ratio and the like. The number of the text boxes is in the million level, and in order to reduce the number of the preset text boxes to be predicted and improve the efficiency of the model, the model in the invention divides the graph according to the semanticsS ₁ , S ₂ , S ₃ ,S ₄Setting a probability threshold value, such as 0.3, filtering some background pixel points, such as points with text probability lower than 0.3, and then generating a region proposal by the RPN network only aiming at the pixel points with higher text probability to obtain a preset text box setB. Or all texts using RPN networkGenerating region proposal for the probability pixel points, setting a probability threshold, filtering out the region proposal corresponding to the pixel points smaller than the probability threshold, and recording the rest region proposal set as a preset text box setB。

And 5: presetting text box classification and regression prediction;

for preset text box setsBThe model firstly uses RoIAlign to generate a feature vector with a specific length, then uses a full connection layer to carry out classification and regression prediction, and generates a text score for each preset text boxScAnd reference point offset(Δx ₁ , Δy ₁ , Δx ₂ , Δy ₂ ,…, Δx _N , Δy _N ). The existing algorithm generally only carries out regression prediction on the central point, width, height or angular point of a threshold text box, so that a target text box obtained after regression is still rectangular and has no robustness on a text region, particularly the shape of a bent text. The invention samples the equal distance of the threshold text box to obtain the reference point[(x’ ₁ , y’ ₁ ), (x’ ₂ , y’ ₂ ), …, (x’ _N , y’ _N )]Regression is performed and can be applied to any text shape. In the training stage, marking according to the reference point of each text box on the training imageGTR=[(x ₁ , y ₁ ), (x ₂ , y ₂ ), …, (x _N , y _N ), txt]And predicting the resultScAnd(Δx ₁ , Δy ₁ , Δx ₂ , Δy ₂ ,…, Δx _N , Δy _N )classification and regression losses can be calculatedL _clsAndL _reg。

step 6: generating a target text box;

model retention articleThis scoreScThe text area larger than 0.5 is the target text area and is based on the formulax _ti =x ^’ _i + Δx _iAndy _ti =y’ _i + Δy _iand calculating the position of the reference point obtained after the regression of the region. Then, the reference points on the target text box are connected in sequence to obtain the position of the target text area with any shape and direction. Finally, a non-maximum suppression algorithm is used to eliminate redundant target text boxes.

And 7: constructing a recognizer;

in order to obtain a text recognizer with high robustness to text distortion, blurring and low resolution, we can select a new text recognizer with high robustness to text distortion, blurring and low resolution in the training stageGTR=[(x ₁ , y ₁ ), (x ₂ , y ₂ ), …, (x _N , y _N ), txt]Marked with a height ofhText image ofTPerforming cubic scale transformation to obtain transformed imageT ₁ , T ₂ , T ₃：

Whereinf(T, h _i )Indicating that the image is to be viewed with aspect ratio maintainedTNormalized to heighth _i，d(.)Which means a 2-fold down-sampling,u(.)which means that the up-sampling is 2 times,h ₁ , h ₂ , h3, thred ₁ , thred ₂ , thred ₃is a predefined value andh ₁ = 2*h ₂ =3*h ₃. In addition, based on the fiducial points on the training sample edge, a thin-plate spline transform is used for distortion correction to obtain a height ofh ₁Is transformed into an imageT ₄. Training phase，T ₁ , T ₂ , T ₃AndT ₄taken as the input of the recognizer in the training stage, and in the testing stage, the image TP predicted in the step 6 is obtained by only carrying out one-time scale changeT’And will beT’As a network input, whereinTPHeight of (2)hp>thred ₁Time of flightT’=f(TP,h ₁ )(ii) a When in usethred ₁

hp>thred ₂Time of flightT’=f(TP,h ₂ )(ii) a When in usehp

thred ₂Time of flightT’=f(TP,h ₃ )。

And 8: extracting two-dimensional features of the text image;

the model uses a full convolution network such as ResNet-32 pairsT ₁ , T ₂ , T ₃ ,T ₄OrT’And performing two-dimensional feature extraction, and performing 4-time down-sampling or 2-time down-sampling on feature maps with different sizes according to the size of the feature maps to map the feature maps to the same scale space.

And step 9: extracting one-dimensional feature vectors of the text images by utilizing a fully-connected layer group;

the method comprises the steps of firstly converting two-dimensional features into a one-dimensional space through flattening operation, and then extracting one-dimensional feature vectors of text images by utilizing a fully-connected layer group. Training phaseT ₁ , T ₂ , T ₃ ,T ₄Corresponding feature vector isv ₁ , v ₂ , v ₃ , v ₄Test phaseT’Corresponding feature vector isv’。

Step 10: calculating the feature similarity constraint;

the recognizer is intended to be constrained from warped, low resolution, and blurred images by feature similarityT ₁ , T ₂ , T ₃Extracting and high-resolution distortion corrected imageT ₄Similar features. This constraint makes the one-dimensional features extracted in the present invention more favorable for the sequence prediction in step 11 than existing recognizers. Thus, the corresponding feature vector is obtainedv ₁ , v ₂ , v ₃ , v ₄The model then calculates the feature similarity loss using the following formula:

step 11: outputting and predicting the character sequence;

the recognizer uses a fully connected layer with attention mechanism to infer the textual content in the picture from the feature vector vi. This attention is used to emphasize features related to text while suppressing background corresponding features to improve the robustness of the recognizer to background noise in the text image. The attention calculation is as follows:

wherein the content of the first and second substances,

and

respectively representing the attention size and the attention weighted feature vector, wherein the weighted feature vector is used as the input of a subsequent full-connection layer, and the output is the character string sequence probability with the length of T, and T is the predefined maximum character string length. At present, the recognizer is widely applied to LSTM embedded with attention mechanism, the calculation is complex, and global information cannot be fully utilized for prediction. The invention directly utilizes the full-connection layer with the self-attention mechanism to predict the character string sequence, can effectively utilize the global characteristics of the text image and simultaneously reduces the complexity of the model. A testing phase of the layerv’And finally outputting the character string prediction result obtained in the input process. A training phase of the layerv ₁ , v ₂ , v ₃ , v ₄The prediction result obtained for input will be combined with the character string truth value in the input image labeltxtCalculating to obtain character string prediction lossL _recg1 , L _recg2 , L _recg3 , L _recg4. The total loss function consists of semantic segmentation loss, preset text box classification and regression loss, feature similarity constraint loss, and string prediction loss, which can be expressed as follows, where

For each lost weight:

the total loss function can carry out end-to-end training on the text detection model and the text recognizer to obtain the optimal model parameters.

It will be appreciated by those skilled in the art that these are merely examples and that many variations or modifications may be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is therefore defined by the appended claims.

Claims

1. An end-to-end text detection and identification method, characterized by: filtering background pixels by utilizing a semantic segmentation result of an input text image to generate a preset text box set, classifying and performing regression prediction on a plurality of reference points on the edge of the preset text box to detect a target text box, performing feature extraction on the input text image by utilizing scale transformation and space transformation, training a recognizer by utilizing a feature similarity constraint strategy, and finally recognizing a character sequence in the target text box by utilizing the trained recognizer;

the method for training the recognizer comprises the following steps: first pair of GTR = [ (x)₁, y₁), (x₂, y₂), …, (x_N, y_N), txt]Performing cubic scale transformation on the input text image T with the marked height h to obtain a transformed image T₁, T₂, T₃And performing distortion correction by using thin plate spline transformation according to the marked reference point to obtain the height h₁Transformed image T of₄(ii) a Reuse of full convolution network for transforming image T₁, T₂, T₃, T₄Extracting two-dimensional features, performing multiple down-sampling on feature maps of different scales according to the size of the feature maps to map the feature maps to the same scale space, converting the two-dimensional features into a one-dimensional space through flattening operation, and extracting one-dimensional feature vectors by utilizing a full-connection layer group, wherein the corresponding feature vectors are v₁, v₂, v₃, v₄And finally, performing end-to-end training on the whole network structure by using a total loss function, namely a linear combination of semantic segmentation loss, preset text box classification and regression loss, characteristic similarity constraint loss and character string prediction loss to obtain an optimal network model parameter.

2. According to claimThe method for end-to-end text detection and identification as claimed in claim 1, wherein the method for generating the preset text box set comprises: establishing an image library containing character sequences, carrying out normalization processing on each text image, then using a full convolution network and an up-sampling network to carry out multi-scale feature map extraction with different scaling ratios on an input text image, taking the multi-scale feature map extraction as input, and combining a plurality of convolution layerssigmoidAnd generating a semantic segmentation graph by using the function, simultaneously, performing region proposal generation on all pixel positions on the multi-scale feature graph by using an RPN (resilient packet network), then setting a probability threshold according to the semantic segmentation graph, filtering out region proposals corresponding to pixel points smaller than the probability threshold, and recording the rest region proposal set as a preset text box set.

3. The method for end-to-end text detection and identification according to claim 2, wherein the method for generating the preset text box set comprises the steps of:

step 1: collecting and expanding a text image data set of a natural scene as a training sample set, and carrying out comparison on text images in the training sample setIText area in (1)RIs marked with a notation ofGTR=[(x ₁ , y ₁ ), (x ₂ , y ₂ ), …, (x _N , y _N ), txt],Wherein(x _n , y _n )As text regionsR(Edge)On the upper partTo (1) anThe coordinates of the reference points are determined,Nfor a predefined total number of reference points,txtas text regionsRThe character string content of (1);

step 2: multi-scale feature extraction based on a full convolution network and an up-sampling network: after the sample is normalized, the full convolution network is used for extracting the characteristics of the input text image to generate 1/2 scaling^T，1/2^(T+1)，1/2^(T+2)…1/2^(T+U)U group characteristic diagramF ₁ , F ₂ ,… F _UThen using up-sampling network to extract features and generate same scalingAdditional U-set feature maps ofF ^’ ₁ , F ^’ ₂ ,…, F ^’ _U；

4. The method of end-to-end text detection and recognition as claimed in claim 1, wherein the method of detecting the target text box comprises: firstly, performing feature extraction on each preset text box by using a RoIAlign method to generate a feature vector with a specified length, then performing classification prediction on each preset text box by using a full-connection layer, and sampling the preset text boxes equidistantly to obtain a reference point [ (x'₁, y’₁), (x’₂, y’₂), …, (x’_N, y’_N)]Performing regression prediction to generate text score for each preset text boxScAnd reference point offset (Δ x)₁, Δy₁, Δx₂, Δy₂,…, Δx_N, Δy_N) Keeping a preset text box with the text score Sc larger than a set score threshold value and according to a formula x_ti=x’_i+ Δx_iAnd y_ti=y’_i+ Δy_iCalculating the position of the reference point obtained after the preset text box is regressedAnd connecting the text frames together to generate a target text area, namely the target text box.

5. The end-to-end text detection and identification method of claim 4, wherein the method of generating the target text box comprises the steps of:

step (1): for preset text box setsBThe preset text boxes with different sizes in the method are firstly used for generating a feature vector with a specified length by using a RoIAlign method, then a full-connection layer is used for carrying out classification prediction on each preset text box, and a reference point [ (x ') obtained by sampling the preset text boxes at equal distances is obtained'₁, y’₁), (x’₂, y’₂), …, (x’_N, y’_N)]Performing regression prediction to generate text score for each preset text boxScAnd reference point offset(Δx ₁ , Δy ₁ , Δx ₂ , Δy ₂ ,…, Δx _N , Δy _N )；

Step (2): reserving text regions with the text score Sc larger than a set score threshold value and according to a formula x_ti=x’_i+ Δx_iAnd y_ti=y’_i+ Δy_iAnd calculating the positions of the reference points obtained after regression, connecting the reference points together to generate a target text area, namely a target text box, and finally eliminating redundant target text boxes by adopting a non-maximum suppression algorithm.

6. The end-to-end text detection and recognition method of claim 5, wherein predicting with a recognizer comprises: carrying out primary size transformation on a detected target text box to obtain a transformation image T ', then carrying out two-dimensional feature extraction on the transformation image T ' by using a full convolution network, carrying out down-sampling on a feature map according to the size of the feature map to map the feature map to a specific scale space, then converting two-dimensional features into a one-dimensional space through flattening operation, carrying out one-dimensional feature vector extraction by using a trained full-connection layer group, taking a corresponding feature vector as v ', and identifying a character sequence in the target text box by using the trained full-connection layer with an attention mechanism.

7. The end-to-end text detection and recognition method of claim 6, wherein: the transformed image T is expressed by the following equation₁, T₂, T₃，

Wherein, f (T, h)_i) Meaning that the input text image T is normalized in size to a height h with aspect ratio maintained_iD (.) represents 2-fold down-sampling, u (.) represents 2-fold up-sampling, h₁, h₂, h₃, thred₁, thred₂Is a predefined value and h₁=2*h₂=3*h₃；