CN110837835A - End-to-end scene text identification method based on boundary point detection - Google Patents

End-to-end scene text identification method based on boundary point detection Download PDF

Info

Publication number
CN110837835A
CN110837835A CN201911038568.1A CN201911038568A CN110837835A CN 110837835 A CN110837835 A CN 110837835A CN 201911038568 A CN201911038568 A CN 201911038568A CN 110837835 A CN110837835 A CN 110837835A
Authority
CN
China
Prior art keywords
text
network
rpn
multidirectional
boundary point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911038568.1A
Other languages
Chinese (zh)
Other versions
CN110837835B (en
Inventor
刘文予
白翔
许永超
王豪
卢普
张辉
杨明锟
何梦超
王永攀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201911038568.1A priority Critical patent/CN110837835B/en
Publication of CN110837835A publication Critical patent/CN110837835A/en
Application granted granted Critical
Publication of CN110837835B publication Critical patent/CN110837835B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a scene text end-to-end identification method based on boundary point detection, which extracts text characteristics through a characteristic pyramid network and is used for generating a candidate text box through a regional extraction network; then, detecting a more accurate multidirectional bounding box of the text example through a multidirectional rectangular detection network; secondly, detecting an upper boundary point sequence and a lower boundary point sequence of the text in the multidirectional bounding box; and finally, converting the text in any shape into a horizontal text by using the detected boundary point sequence for recognition by a subsequent sequence recognition network based on an attention mechanism, and finally finding the best matching word of the prediction sequence in the given dictionary by using a cluster search algorithm to obtain a final text recognition result. The method can simultaneously detect and identify scene texts in any shapes in the natural image without character-level labeling, wherein the scene texts comprise horizontal texts, multi-directional texts and curved texts, and can completely carry out end-to-end training.

Description

End-to-end scene text identification method based on boundary point detection
Technical Field
The invention belongs to the technical field of computer vision, and particularly relates to a scene text end-to-end identification method based on boundary point detection.
Background
Scene text detection and recognition is a very active and challenging research direction in the field of computer vision, and many practical applications are highly relevant to the scene text detection and recognition, such as network information security monitoring systems, intelligent transportation systems, blind help and the like.
In most of past researches, scene text detection and recognition technology is regarded as two separate processes, namely, firstly, a trained detector is used for detecting character areas in a natural scene picture, and secondly, the character areas detected in the first step are input into a recognition module for recognition to obtain character contents. Since the detection and recognition tasks are highly correlated and complementary to each other, on the one hand, the quality of the detection step determines the accuracy of the recognition; on the other hand, the result of the recognition may also provide feedback for the detection. Such separate processing may result in less than optimal performance of the detection and identification.
Recently, there are various methods for providing an end-to-end identification solution, and these methods can be roughly classified into two types. The first approach follows a similar process flow: firstly, a text instance is represented as a horizontal or multidirectional bounding box, the text bounding box is detected by using a detection network, and then a text image or a feature is acquired from an image or a feature map according to the detected bounding box and is identified by a subsequent text identification network. Since text instances are described as horizontal or multi-directional bounding boxes, such schemes have difficulty handling arbitrarily shaped text. The second solution consists of a text detector based on example segmentation and a text recognizer based on character segmentation. Detecting texts in any shapes by a method of segmenting example text regions; and recognizing the text through semantic segmentation in a two-dimensional space, so that irregular text instances are recognized. However, such methods require character-level labeling and the recognition network cannot model literal sequence information. Therefore, an economical and efficient end-to-end recognition method is needed to process the scene text with any shape.
Disclosure of Invention
The invention aims to provide a scene text end-to-end identification method based on boundary point detection, which consists of a text detector based on boundary point detection and a text recognizer based on sequence identification of an attention mechanism. Detecting texts in any shapes by a method for detecting boundary points of text instances; correcting the text in any shape into a horizontal text by utilizing a thin plate spline interpolation algorithm according to the detected text instance boundary points; identifying irregular text instances is accomplished by identifying the rectified text with a text recognizer based on the sequence recognition of the attention mechanism. The method can detect and recognize text instances of arbitrary shapes and can perform end-to-end training completely.
In order to achieve the above object, the present invention provides an end-to-end recognition method for scene texts with arbitrary shapes, comprising the following steps:
(1) training a scene text end-to-end recognition network model based on boundary point detection, comprising the following sub-steps:
(1.1) carrying out word-level labeling on texts in any shapes of all pictures in an original data set, wherein labels are the clockwise vertex coordinates of polygons of text bounding boxes in word level and word character sequences of the texts, and obtaining a standard training data set with labels;
and (1.2) defining a scene text end-to-end identification network model based on boundary point detection, wherein the scene text end-to-end identification network model based on boundary point detection is composed of a characteristic pyramid structure network, a region extraction network, a multi-direction rectangular detection network, a boundary point detection network and a sequence identification network based on an attention mechanism. Calculating a training label according to the standard training data set with the label in the step (1.1), designing a loss function, and training the scene text end-to-end recognition network based on the boundary point detection by using a reverse conduction method to obtain a scene text end-to-end recognition network model based on the boundary point detection; the method specifically comprises the following substeps:
(1.2.1) constructing a scene text end-to-end identification network model based on boundary point detection, wherein the identification network model consists of a characteristic pyramid structure network, a region extraction network, a multidirectional rectangular detection network, a boundary point detection network and a sequence identification network based on an attention mechanism; the characteristic pyramid structure network is formed by adding a bottom-up connection, a top-down connection and a transverse connection by taking a ResNet-50 deep convolution neural network as a basic network, and is used for extracting and fusing characteristics with different resolutions from an input standard data set picture; inputting the extracted features of different scales into a region extraction network to obtain a candidate text region, and after the alignment operation of the region of interest, obtaining the candidate text region of a fixed scale; inputting a candidate text region with the resolution of 7 multiplied by 7 extracted by a region extraction network into a rapid region classification regression network, predicting the probability that the input candidate text region is a positive sample through classification branches, providing a more accurate candidate text region, calculating the offset of the candidate text region relative to a real text region through regression branches, and adjusting the position of the candidate text region; the multidirectional rectangle detection network is composed of 3 full-connection layers FC1, FC2 and FC3, and outputs a prediction vector with dimension 5, which respectively represents the offset of the center of a candidate text region from the center of a minimum circumscribed rectangle, the width and height of the minimum circumscribed rectangle and the rotation angle of the minimum circumscribed rectangle. The boundary point detection network is composed of 4 convolutional layers Conv1, Conv2, Conv3 and Conv4 and a full-connection layer, and outputs a vector with dimension of 28, wherein the vector respectively represents the offset of 7 boundary points of the upper boundary and the lower boundary of the text example; the attention-based sequence recognition network is composed of three convolutional layers and an attention-based model, and the attention model outputs probability distribution of predicted characters at each step.
(1.2.2) generating a horizontal initial bounding box, a multidirectional rectangular bounding box and upper and lower boundary points of a text example on an original image according to a standard training set with labels and a characteristic diagram, and respectively providing training labels for the area extraction network, the multidirectional rectangular detection network and the boundary point detection network: for the labeled standard training set Itr, the input picture true label contains a polygon P ═ { P ] representing the text region1,p2…pmAnd a character string S ═ S representing the text content1,s2…smFor input picture ItriWherein P isiIs a picture ItriPolygonal bounding box of the middle text region, pij=(xij,yij) Is a polygon PiCoordinates of jth vertex, m represents the number of polygonal text label boxes, siIs a polygon PiThe content of the text in the text.
For a given standard dataset Itr, first the polygon P in the dataset tag is given as { P ═ P1,p2…pmThe smallest horizontal rectangular bounding box, G, which is represented by the center point (x, y) of the rectangle, as well as the height h and width w, translates into a polygonal text label boxd(x, y, h, w); for the area extraction network, labeling bounding box G according to the labeling data setd(x, y, h, w), each pixel on each feature map in the feature maps to be extracted output by the feature pyramid is corresponding to the original image, a plurality of initial bounding boxes are generated according to candidate text regions predicted by the region extraction network, and the initial bounding box Q is calculated0Annotation bounding box G with respect to an annotation data setdWhen all the labeled bounding boxes G are labeleddAnd an initial bounding box Q0All Jaccard coefficients are less than 0.5, then the initial bounding box Q0Labeled negative class non-text, class label PrpnThe value is 0; otherwise, i.e. there is at least one label bounding box GdAnd Q0Has a Jaccard coefficient of not less than 0.5, Q0Quilt labelMarked as positive text, category label PrpnThe value is 1, and the position offset is calculated relative to the labeling box with the maximum Jaccard coefficient, and the formula is as follows:
x=x0+w0Δx
y=y0+h0Δy
w=w0exp(Δw)
h=h0exp(Δh)
wherein x is0、y0Respectively an initial bounding box Q0Abscissa, ordinate, w of the center point of (a)0、h0Respectively an initial bounding box Q0And Δ x, Δ y are Q, respectively0Center point of (D) relative to GdThe horizontal and vertical coordinate position offset of the central point, exp is exponential operation, and the training label of the area extraction network is obtained as follows:
gtrpn=(Δxrpn,Δyrpn,Δhrpn,Δwrpn,Prpn)
for a multidirectional rectangular detection network, first the polygon P in the dataset label is set to { P ═ P1,p2…pmConverting to a minimum multidirectional rectangular bounding box of a polygonal text labeling box, representing the multidirectional rectangular bounding box G by the center point (x, y), height h, width w and rotation angle theta of a rectanglerotate(x, y, h, w, θ); the candidate text region after the network correction by the region extraction is Grpn(xrpn,yrpn,wrpn,hrpn) The predicted positional deviation amount calculation formula is as follows:
x=xrpn+wrpnΔxor
y=yrpn+hrpnΔyor
w=wrpnexp(Δwor)
h=hrpnexp(Δhor)
the training label of the multidirectional rectangular detection network obtained by the formula is as follows:
gtor=(Δxor,Δyor,Δhor,Δwor,θ)
for the boundary point detection network, the training label calculation process of the boundary point detection network is as follows:
a. setting default boundary points: based on detected multidirectional rectangular bounding boxes
Grotate(x, y, h, w, theta), rotating the rectangle clockwise by theta degrees to obtain a horizontal bounding box Ghorizon(x, y, h, w), sampling K boundary points at equal intervals on each long side of the horizontal bounding box to obtain an upper default boundary point sequence and a lower default boundary point sequence: pdu={p1,p2…pKAnd Pdd={p1,p2…pKIs of Pd=Pdu∪Pdd
b. Generating a target boundary point:
a) first, a polygon P is divided into two sides according to a long side, P1={p1,p2…plAnd P2={pl+1,…,pmP represents a point in the polygon.
b) According to P1And P2Generating boundary points of an upper boundary and a lower boundary: ptu={p1,p2…pKAnd Ptd={p1,p2…pKIs of Pt=Ptu∪Ptd
c. Calculating training label gt according to the following formulabp={(Δxi,Δyi),|i∈[0,2K-1)}:
Figure BDA0002252221870000062
Wherein the content of the first and second substances,and
Figure BDA0002252221870000064
respectively representing the coordinates of the ith target boundary point and the coordinates of the ith default boundary point.
For the sequence recognition network based on the attention mechanism, each text instance in the input image is marked with a corresponding character string s with the length of ni={(c0,c1,…,cn-1),|ciE {0,1, …,9, a, B, …, Z, a, B, …, Z } } to describe text content. Identifying the training label of the network as gtrecog=(onehot(c0),onehot(c1),…,onehot(cn-1) Wherein onehot (c)i) Indicates a character c1And converting into a one-hot coding form. Combining the above, the final training label is generated as follows: gt ═ { gtrpn,gtor,gtbp,gtrecog};
(1.2.3) training data set I with the standardtrAs the input of the recognition network model, extracting the characteristics by using a characteristic pyramid network module, namely extracting the characteristics of a standard training data set ItrIn the ResNet-50 network structure of the image input feature pyramid network from bottom to top, a convolutional layer unit which does not change the size of a feature map in the network is defined as a level (levels { P2, P3, P4, P5 and P6 }), and finally output convolutional features F of each level are extracted; the top-down connection in the feature pyramid network module upsamples the output convolution feature of the ResNet-50 to generate a multi-scale upsampling feature, and the transverse connection structure in the feature pyramid network module fuses the feature of each level upsampled in the top-down process and the feature generated in the bottom-up process to generate a final feature { F2, F3, F4, F5, F6}, which is shown in fig. 3.
(1.2.4) inputting the features extracted by the feature pyramid network into a region extraction network, distributing anchor points, adjusting a feature map by using a region-of-interest alignment method, and generating a candidate text box:
for input picture ItrkExtracting 5 stage features { F2, F3, F4, F5, F6} through a feature pyramid network, and defining anchors according to stages { P2, P3, P4, P5, P6}, wherein the anchors are defined in the stagesFeature scale of different stages is 322,642,1282,2562,5122And each scale layer has 5 aspect ratios {1:5, 1:2, 1:1, 2:1, 5:1 }; thus, 25 candidate text boxes { Ftr with different scales and proportions can be extracted1,Ftr2,…,Ftr25Is denoted as FtrpSubscript p ═ 1, …, 25; in the region extraction network, the probability that each candidate text box is a correct text region bounding box is predicted to be P through classificationrpnPredicting candidate textbox offsets by regression:
Yrpn=(Δxrpn,Δyrpn,Δhrpn,Δwrpn)。
selecting candidate text boxes predicted as correct text area bounding boxes, and inputting the candidate text boxes to a subsequent multidirectional rectangular detection network, a boundary point detection network and a sequence identification network based on an attention mechanism; and generating a candidate text region with a fixed scale of 7 multiplied by 7 by the correct text region selected by the region extraction network through a region-of-interest alignment operation, and predicting the multidirectional bounding box of the text instance in the candidate text region with the fixed scale by the multidirectional rectangular prediction network. In particular, the multidirectional rectangular prediction network prediction quantity Yor=(Δxor,Δyor,Δhor,Δworθ), which includes 4 prediction offsets and one prediction angle, the network finally learns the multidirectional bounding box of the predicted text instance by computing the loss function and conducting backwards.
(1.2.5) after the multidirectional bounding box of each text example is predicted by the multidirectional rectangular prediction network, generating a candidate text region with a fixed scale of 7 x 7 through a rotating region-of-interest alignment operation. The boundary point prediction network outputs 28 prediction regression offsets Ybp={(Δxi,Δyi) And i belongs to 0,14), and by calculating a loss function and conducting reversely, the network finally learns and predicts the boundary points of the text instance.
(1.2.6) after the boundary point of each text example is predicted by the boundary point prediction network, generating a sampling grid by a thin plate spline interpolation algorithm, and enabling the text in any shape to be in a special shapeThe eigen-rectification is a horizontal, fixed-scale 16 x 64 feature map. And inputting the feature map into a sequence recognition network based on an attention mechanism to predict text content. The identification network consists of 3 convolutional layers and an RNN network with a basic unit of GRU. After 3 convolutional layers, the resolution of text features is 2 × 32, the RNN model outputs probability distribution with dimension 63(62 characters and a stop character) for each step length, and the value of each dimension is [0,1 ]]And is 1. Predicted probability distribution P combining all stepsrecogAnd the beam search algorithm to predict the character sequence Sq
(1.2.7) taking training label gt as expected output of the network to predict labelsFor the network prediction output, an objective loss function between the desired output and the prediction output is designed for the constructed network model: taking the training label gt obtained by calculation in the step (1.2.2) as the expected output of the network, and taking the prediction labels in the steps (1.2.4), (1.2.5) and (1.2.6)
Figure BDA0002252221870000082
For network prediction output, aiming at the network model constructed in (1.2.1), designing an objective loss function between expected output and prediction output, wherein the overall objective loss function consists of a region extraction network, a multidirectional rectangular prediction network, a boundary point prediction network and a sequence identification network, and the overall objective loss function expression is as follows:
wherein L isrpn(Prpn,Yrpn) Extracting the loss function of the network for the region, Lor(Yor) Detecting the loss function of the network for multidirectional rectangles, Lbp(Ybp) The loss function of the network is detected for the boundary points,
L(Prpn,Yrpn,Yor,Ybp,Precog)=Lrpn(Prpn,Yrpn)+α1Lor(Yor)+α2Lbp(Ybp)+α3Lrecog(Precog)
Lrecog(Precog) Identifying loss function of network for sequence α1,α2,α3Are respectively a loss function Lrcnn、LbpAnd LrecogThe weight coefficient of (1) is simply set to 1;
according to a designed overall target loss function, iterative training is carried out on the model by utilizing a back propagation algorithm, the overall target loss function is minimized, an optimal network model is realized, and aiming at a scene character detection and recognition task, iterative training is firstly carried out on a synthetic text data set (SynthText) in the training process to obtain initial network parameters; training is then performed on the real dataset to fine-tune the network parameters.
(2) The character recognition is carried out on the text picture to be recognized by utilizing the trained model, and the character recognition method comprises the following substeps:
(2.1) sequentially inputting the extracted features of the text picture of the scene to be detected and identified into a region extraction network and a multidirectional rectangular detection network to generate a multidirectional candidate text region, and filtering the non-maximum suppression operation to obtain a more accurate multidirectional candidate text region: for the data set I to be detectedtstIth picture ItstkInputting the text data into the model trained in the step (1.2), extracting a positive candidate text region by the region extraction network after the model passes through the feature pyramid network and the region extraction network, and extracting the same test picture ItstkThe situation that the positive type text quadrangles regressed on each feature map usually overlap with each other occurs, and then the non-maximum suppression operation is carried out on the positions of all the positive type text quadrangles, and the specific steps are as follows: 1) for the predicted text bounding box, if and only if the text classification score PrcnnWhen the detection text box is more than or equal to 0.5, the detection text box is reserved; 2) and (4) carrying out non-maximum suppression operation (NMS) on the text box reserved in the previous step according to the Jaccard coefficient of 0.2 to obtain the final reserved quadrilateral bounding box of the positive text. Then extracting features with fixed scale from the filtered positive class text quadrilateral bounding box and inputting the features to a multidirectional rectangular prediction network to predict Yor=(Δxor,Δyor,Δhor,Δworθ). Calculating a predicted multidirectional text bounding box according to the coordinates, the length and the width and the rotation angle of the center point of the predicted multidirectional rectangle; and rotating the multi-directional text features into horizontal features according to the predicted multi-directional text bounding boxes, and inputting the horizontal features into the boundary point detection network. Boundary point detection network predicts regression quantity Y of 7 boundary points of upper and lower boundariesbp={(Δxi,Δyi) And, | i ∈ [0,14) }. And combining 14 preset default boundary points, calculating the coordinates of the boundary points in the horizontal frame by using the formula in (1.2.2), and then rotating the predicted coordinates of the boundary points counterclockwise by theta by using the predicted rotation angle of the multidirectional rectangle to obtain the positions of the boundary points in the original image.
And (2.2) generating a sampling grid by using a thin plate spline interpolation algorithm according to the boundary points of the text example predicted in the step (2.1), and correcting the text characteristic of an arbitrary shape into a horizontal shape. The corrected text feature resolution is 16 x 64, and the feature map is input into a sequence recognition network to obtain a probability distribution sequence { p0,p1,…,pN-1In which p isiThe probability distribution of each step of prediction of the RNN is represented, the dimensionality is 63, and N represents the maximum step size of the RNN and takes 35. In the test process, when the predicted value of the k step is a stop character, stopping prediction, and finally predicting the probability distribution of the sequence to be { p }0,p1,…,pk-1}. According to the probability distribution, the category of the maximum probability obtained in each step is the current predicted character, and the predicted character sequence S is finally obtainedq
Through the technical scheme, compared with the prior art, the invention has the following technical effects:
(1) the accuracy is high: aiming at the problem of recognizing texts in any shapes in scene texts, the method converts the texts in any shapes into horizontal texts by predicting boundary points of the texts, and more accurately detects the text positions and recognizes the texts.
(2) The speed is high: the detection and recognition model provided by the invention has the advantages that the detection and recognition accuracy is ensured, the training speed is high, iterative training is not needed, and the whole network can be trained end to end.
(3) The universality is strong: the invention discloses an end-to-end trainable text detection and recognition model, which can not only simultaneously detect and recognize texts, but also process texts in various shapes without marking at a character level, including horizontal, directional and curved texts;
(4) the robustness is strong: the invention can overcome the change of text dimension and shape, and can detect the recognition level, orientation and curve text at the same time.
Drawings
FIG. 1 is a flowchart of a method for recognizing a scene text end-to-end based on boundary point detection according to the present invention, in which a solid arrow represents training and a dotted arrow represents testing;
FIG. 2 is a diagram of an end-to-end recognition network model for scene text based on boundary point detection according to the present invention;
FIG. 3 is a schematic diagram of a network structure of a feature pyramid structure module in an end-to-end scene text recognition model based on boundary point detection according to the present invention;
FIG. 4 is a diagram of a sequence recognition network structure based on attention mechanism in a scene text end-to-end recognition model based on boundary point detection according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The technical terms of the present invention are explained and explained first:
ResNet-50: a neural network for classification mainly comprises 50 convolutional layers, a pooling layer and a short connecting layer. The convolution layer is used for extracting picture characteristics; the pooling layer has the functions of reducing the dimensionality of the feature vector output by the convolutional layer and reducing overfitting; the shortcut connection layer is used for transferring gradient and solving the problems of extinction and explosion gradient. The network parameters can be updated through a reverse conduction algorithm;
area extraction network: a network for generating candidate text regions is used for generating full-connection features with the height of a specific dimension on an extracted feature map by using a sliding window, generating two full-connection branch classification and regression candidate text regions according to the full-connection features, and finally generating candidate text regions with different scale proportions for a subsequent network according to different anchor points and proportions.
Jaccard coefficient: the Jaccard coefficient is used for comparing similarity and difference between limited sample sets, in the field of text detection, the Jaccard coefficient is defaulted to be equal to IOU (input/output), namely the intersection area/combination area of two frames, and describes the overlapping rate of a predicted text box and an original marked text box generated by a model, wherein the IOU is larger, the overlapping degree is higher, and the detection is more accurate.
Non-maximum inhibition (NMS): the non-maximum suppression is a post-processing algorithm widely applied in the field of computer vision detection, and the non-maximum suppression is used for filtering overlapped detection frames by means of sorting, traversing and rejecting to realize loop iteration according to a set threshold value, and removing redundant detection frames to obtain a final detection result.
Thin plate spline interpolation algorithm (TPS): the thin-plate spline interpolation algorithm is an interpolation method that finds a smooth surface with minimal curvature through all control points. By the algorithm, characters in any shapes can be converted into horizontal shapes, so that the distortion degree of the whole characters is minimum.
As shown in fig. 1, the method for recognizing a scene text end-to-end based on boundary point detection of the present invention includes the following steps:
(1) training a scene text end-to-end recognition network model based on boundary point detection, comprising the following sub-steps:
(1.1) carrying out word-level labeling on texts in any shapes of all pictures in an original data set, wherein labels are the clockwise vertex coordinates of polygons of text bounding boxes in word level and word character sequences of the texts, and obtaining a standard training data set with labels;
and (1.2) defining a scene text end-to-end identification network model based on boundary point detection, wherein the model consists of a characteristic pyramid structure network, a region extraction network, a multidirectional rectangular detection network, a boundary point detection network and a sequence identification network based on an attention mechanism. Calculating a training label according to the standard training data set with the label in the step (1.1), designing a loss function, and training the scene text end-to-end recognition network based on the boundary point detection by using a reverse conduction method to obtain a scene text end-to-end recognition network model based on the boundary point detection; the method specifically comprises the following substeps:
(1.2.1) constructing a scene text end-to-end identification network model based on boundary point detection, wherein the identification network model consists of a characteristic pyramid structure network, a region extraction network, a multidirectional rectangular detection network, a boundary point detection network and a sequence identification network based on an attention mechanism, as shown in fig. 2; the feature pyramid structure network is shown in fig. 3, and is formed by adding a bottom-up connection, a top-down connection and a transverse connection to a base network of a ResNet-50 deep convolutional neural network, and is used for extracting features fused with different resolutions from an input standard data set picture; inputting the extracted features of different scales into a region extraction network to obtain a candidate text region, and after the alignment operation of the region of interest, obtaining the candidate text region of a fixed scale; inputting a candidate text region with the resolution of 7 multiplied by 7 extracted by a region extraction network into a rapid region classification regression network, predicting the probability that the input candidate text region is a positive sample through classification branches, providing a more accurate candidate text region, calculating the offset of the candidate text region relative to a real text region through regression branches, and adjusting the position of the candidate text region; the multidirectional rectangle detection network is composed of 3 full-connection layers FC1, FC2 and FC3, and outputs a prediction vector with dimension 5, which respectively represents the offset of the center of a candidate text region from the center of a minimum circumscribed rectangle, the width and height of the minimum circumscribed rectangle and the rotation angle of the minimum circumscribed rectangle. The boundary point detection network is composed of 4 convolutional layers Conv1, Conv2, Conv3 and Conv4 and a full-connection layer, and outputs a vector with dimension of 28, wherein the vector respectively represents the offset of 7 boundary points of the upper boundary and the lower boundary of the text example; the attention-based sequence recognition network is shown in fig. 4 and is composed of three convolutional layers and an attention-based model, and the attention-based model outputs a probability distribution of a predicted character at each step.
(1.2.2) generating a horizontal initial bounding box, a multidirectional rectangular bounding box and upper and lower boundary points of a text example on an original image according to a standard training set with labels and a characteristic diagram, and respectively providing training labels for the area extraction network, the multidirectional rectangular detection network and the boundary point detection network: for the labeled standard training set Itr, the input picture true label contains a polygon P ═ { P ] representing the text region1,p2…pmAnd a character string S ═ S representing the text content1,s2…smFor input picture ItriWherein P isiIs a picture ItriPolygonal bounding box of the middle text region, pij=(xij,yij) Is a polygon PiCoordinates of jth vertex, m represents the number of polygonal text label boxes, siIs a polygon PiThe content of the text in the text.
For a given standard dataset Itr, first the polygon P in the dataset tag is given as { P ═ P1,p2…pmThe smallest horizontal rectangular bounding box, G, which is represented by the center point (x, y) of the rectangle, as well as the height h and width w, translates into a polygonal text label boxd(x, y, h, w); for the area extraction network, labeling bounding box G according to the labeling data setd(x, y, h, w), each pixel on each feature map in the feature maps to be extracted output by the feature pyramid is corresponding to the original image, a plurality of initial bounding boxes are generated according to candidate text regions predicted by the region extraction network, and the initial bounding box Q is calculated0Annotation bounding box G with respect to an annotation data setdWhen all the labeled bounding boxes G are labeleddAnd an initial bounding box Q0All Jaccard coefficients are less than 0.5, then the initial bounding box Q0Labeled negative class non-text, class label PrpnThe value is 0; otherwise, i.e. there is at least one label bounding box GdAnd Q0Jaccard coefficient ofNot less than 0.5, Q0Marked as positive text, class label PrpnThe value is 1, and the position offset is calculated relative to the labeling box with the maximum Jaccard coefficient, and the formula is as follows:
x=x0+w0Δx
y=y0+h0Δy
w=w0exp(Δw)
h=h0exp(Δh)
wherein x is0、y0Respectively an initial bounding box Q0Abscissa, ordinate, w of the center point of (a)0、h0Respectively an initial bounding box Q0And Δ x, Δ y are Q, respectively0Center point of (D) relative to GdThe horizontal and vertical coordinate position offset of the central point, exp is exponential operation, and the training label of the area extraction network is obtained as follows:
gtrpn=(Δxrpn,Δyrpn,Δhrpn,Δwrpn,Prpn)
for a multidirectional rectangular detection network, first the polygon P in the dataset label is set to { P ═ P1,p2…pmConverting to a minimum multidirectional rectangular bounding box of a polygonal text labeling box, representing the multidirectional rectangular bounding box G by the center point (x, y), height h, width w and rotation angle theta of a rectanglerotate(x, y, h, w, θ); the candidate text region after the network correction by the region extraction is Grpn(xrpn,yrpn,wrpn,hrpn) The predicted positional deviation amount calculation formula is as follows:
x=xrpn+wrpnΔxor
y=yrpn+hrpnΔyor
w=wrpnexp(Δwor)
h=hrpnexp(Δhor)
the training label of the multidirectional rectangular detection network obtained by the formula is as follows:
gtor=(Δxor,Δyor,Δhor,Δwor,θ)
for the boundary point detection network, the training label calculation process of the boundary point detection network is as follows:
a. setting default boundary points: according to the detected multidirectional rectangular bounding box Grotate(x, y, h, w, theta), rotating the rectangle clockwise by theta degrees to obtain a horizontal bounding box Ghorizon(x, y, h, w), sampling K boundary points at equal intervals on each long side of the horizontal bounding box to obtain an upper default boundary point sequence and a lower default boundary point sequence: pdu={p1,p2…pKAnd Pdd={p1,p2…pKIs of Pd=Pdu∪Pdd
b. Generating a target boundary point:
a) first, a polygon P is divided into two sides according to a long side, P1={p1,p2…plAnd P2={pl+1,…,pmP represents a point in the polygon.
b) Will P1And P2Inputting boundary points in Algorithm 1 which generate upper and lower boundaries: ptu={p1,p2…pKAnd Ptd={p1,p2…pKIs of Pt=Ptu∪Ptd
c. Calculating training label gt according to the following formulabp={(Δxi,Δyi),|i∈[0,2K-1)}:
Figure BDA0002252221870000161
Figure BDA0002252221870000162
Wherein the content of the first and second substances,
Figure BDA0002252221870000163
and
Figure BDA0002252221870000164
respectively representing the coordinates of the ith target boundary point and the coordinates of the ith default boundary point.
For the sequence recognition network based on the attention mechanism, each text instance in the input image is marked with a corresponding character string s with the length of ni={(c0,c1,…,cn-1),|ciE {0,1, …,9, a, B, …, Z, a, B, …, Z } } to describe text content. Identifying the training label of the network as gtrecog=(onehot(c0),onehot(c1),…,onehot(cn-1) Wherein onehot (c)i) Indicates a character c1And converting into a one-hot coding form. Combining the above, the final training label is generated as follows: gt ═ { gtrpn,gtor,gtbp,gtrecog};
(1.2.3) training data set I with the standardtrAs the input of the recognition network model, extracting the characteristics by using a characteristic pyramid network module, namely extracting the characteristics of a standard training data set ItrIn the ResNet-50 network structure of the image input feature pyramid network from bottom to top, a convolutional layer unit which does not change the size of a feature map in the network is defined as a level (levels { P2, P3, P4, P5 and P6 }), and finally output convolutional features F of each level are extracted; the top-down connection in the feature pyramid network module upsamples the output convolution feature of the ResNet-50 to generate a multi-scale upsampling feature, and the transverse connection structure in the feature pyramid network module fuses the feature of each level upsampled in the top-down process and the feature generated in the bottom-up process to generate a final feature { F2, F3, F4, F5, F6}, which is shown in fig. 3.
(1.2.4) inputting the features extracted by the feature pyramid network into a region extraction network, distributing anchor points, adjusting a feature map by using a region-of-interest alignment method, and generating a candidate text box:
for input picture ItrkExtracting 5 stage features { F2, F3, F4, F5, F6} through a feature pyramid network, and extracting the stage features according to stages { P2, P3, P4, P5 }P6 defines the characteristic dimension of the anchor at different stages as 322,642,1282,2562,5122And each scale layer has 5 aspect ratios {1:5, 1:2, 1:1, 2:1, 5:1 }; thus, 25 candidate text boxes { Ftr with different scales and proportions can be extracted1,Ftr2,…,Ftr25Is denoted as FtrpSubscript p ═ 1, …, 25; in the region extraction network, the probability that each candidate text box is a correct text region bounding box is predicted to be P through classificationrpnPredicting candidate textbox offsets by regression:
Yrpn=(Δxrpn,Δyrpn,Δhrpn,Δwrpn)。
selecting candidate text boxes predicted as correct text area bounding boxes, and inputting the candidate text boxes to a subsequent multidirectional rectangular detection network, a boundary point detection network and a sequence identification network based on an attention mechanism; and generating a candidate text region with a fixed scale of 7 multiplied by 7 by the correct text region selected by the region extraction network through a region-of-interest alignment operation, and predicting the multidirectional bounding box of the text instance in the candidate text region with the fixed scale by the multidirectional rectangular prediction network. In particular, the multidirectional rectangular prediction network prediction quantity Yor=(Δxor,Δyor,Δhor,Δworθ), which includes 4 prediction offsets and one prediction angle, the network finally learns the multidirectional bounding box of the predicted text instance by computing the loss function and conducting backwards.
(1.2.5) after the multidirectional bounding box of each text example is predicted by the multidirectional rectangular prediction network, generating a candidate text region with a fixed scale of 7 x 7 through a rotating region-of-interest alignment operation. The boundary point prediction network outputs 28 prediction regression offsets Ybp={(Δxi,Δyi) And i belongs to 0,14), and by calculating a loss function and conducting reversely, the network finally learns and predicts the boundary points of the text instance.
(1.2.6) after the boundary point of each text instance is predicted by the boundary point prediction network, generating a sampling grid by a thin plate spline interpolation algorithm, and randomly selecting the text instanceThe text features of the shape are rectified into a horizontal, fixed scale 16 x 64 feature map. And inputting the feature map into a sequence recognition network based on an attention mechanism to predict text content. As shown in fig. 4, the identification network is composed of 3 convolutional layers and an RNN network whose basic unit is a GRU. After 3 convolutional layers, the resolution of text features is 2 × 32, the RNN model outputs probability distribution with dimension 63(62 characters and a stop character) for each step length, and the value of each dimension is [0,1 ]]And is 1. Predicted probability distribution P combining all stepsrecogAnd the beam search algorithm to predict the character sequence Sq
(1.2.7) taking training label gt as expected output of the network to predict labels
Figure BDA0002252221870000181
For the network prediction output, an objective loss function between the desired output and the prediction output is designed for the constructed network model: taking the training label gt obtained by calculation in the step (1.2.2) as the expected output of the network, and taking the prediction labels in the steps (1.2.4), (1.2.5) and (1.2.6)
Figure BDA0002252221870000182
For network prediction output, aiming at the network model constructed in (1.2.1), designing an objective loss function between expected output and prediction output, wherein the overall objective loss function consists of a region extraction network, a multidirectional rectangular prediction network, a boundary point prediction network and a sequence identification network, and the overall objective loss function expression is as follows:
L(Prpn,Yrpn,Yor,Ybp,Precog)=Lrpn(Prpn,Yrpn)+α1Lor(Yor)+α2Lbp(Ybp)+α3Lrecog(Precog)
wherein L isrpn(Prpn,Yrpn) Extracting the loss function of the network for the region, Lor(Yor) Detecting the loss function of the network for multidirectional rectangles, Lbp(Ybp) As boundary pointsDetecting loss functions of the network, Lrecog(Precog) Identifying loss function of network for sequence α1,α2,α3Are respectively a loss function Lrcnn、LbpAnd LrecogThe weight coefficient of (1) is simply set to 1;
according to a designed overall target loss function, iterative training is carried out on the model by utilizing a back propagation algorithm, the overall target loss function is minimized, an optimal network model is realized, and aiming at a scene character detection and recognition task, iterative training is firstly carried out on a synthetic text data set (SynthText) in the training process to obtain initial network parameters; training is then performed on the real dataset to fine-tune the network parameters.
The character recognition is carried out on the text picture to be recognized by utilizing the trained model, and the character recognition method comprises the following substeps:
(2.1) sequentially inputting the extracted features of the text picture of the scene to be detected and identified into a region extraction network and a multidirectional rectangular detection network to generate a multidirectional candidate text region, and filtering the non-maximum suppression operation to obtain a more accurate multidirectional candidate text region: for the data set I to be detectedtstIth picture ItstkInputting the text data into the model trained in the step (1.2), extracting a positive candidate text region by the region extraction network after the model passes through the feature pyramid network and the region extraction network, and extracting the same test picture ItstkThe situation that the positive type text quadrangles regressed on each feature map usually overlap with each other occurs, and then the non-maximum suppression operation is carried out on the positions of all the positive type text quadrangles, and the specific steps are as follows: 1) for the predicted text bounding box, if and only if the text classification score PrcnnWhen the detection text box is more than or equal to 0.5, the detection text box is reserved; 2) and (4) performing non-maximum suppression operation (NMS) on the text box reserved in the previous step according to the Jaccard coefficient of 0.2 to obtain the final reserved positive text quadrilateral bounding box. Then extracting features with fixed scale from the filtered positive class text quadrilateral bounding box and inputting the features to a multidirectional rectangular prediction network to predict Yor=(Δxor,Δyor,Δhor,Δworθ). Calculating a predicted multidirectional text bounding box according to the coordinates, the length and the width and the rotation angle of the center point of the predicted multidirectional rectangle; and rotating the multi-directional text features into horizontal features according to the predicted multi-directional text bounding boxes, and inputting the horizontal features into the boundary point detection network. Boundary point detection network predicts regression quantity Y of 7 boundary points of upper and lower boundariesbp={(Δxi,Δyi) And, | i ∈ [0,14) }. And combining 14 preset default boundary points, calculating the coordinates of the boundary points in the horizontal frame by using the formula in (1.2.2), and then rotating the predicted coordinates of the boundary points counterclockwise by theta by using the predicted rotation angle of the multidirectional rectangle to obtain the positions of the boundary points in the original image.
And (2.2) generating a sampling grid by using a thin plate spline interpolation algorithm according to the boundary points of the text example predicted in the step (2.1), and correcting the text characteristic of an arbitrary shape into a horizontal shape. The corrected text feature resolution is 16 x 64, and the feature map is input into a sequence recognition network to obtain a probability distribution sequence { p0,p1,…,pN-1In which p isiThe probability distribution of each step of prediction of the RNN is represented, the dimensionality is 63, and N represents the maximum step size of the RNN and takes 35. In the test process, when the predicted value of the k step is a stop character, stopping prediction, and finally predicting the probability distribution of the sequence to be { p }0,p1,…,pk-1}. According to the probability distribution, the category of the maximum probability obtained in each step is the current predicted character, and the predicted character sequence S is finally obtainedq
Figure BDA0002252221870000211

Claims (10)

1. A scene text end-to-end identification method based on boundary point detection is characterized by comprising the following steps:
(1) training a scene text end-to-end recognition network model based on boundary point detection, comprising the following sub-steps:
(1.1) carrying out word-level labeling on texts in any shapes of all pictures in an original data set, wherein labels are the clockwise vertex coordinates of polygons of text bounding boxes in word level and word character sequences of the texts, and obtaining a standard training data set with labels;
(1.2) defining a scene text end-to-end recognition network model based on boundary point detection, calculating a training label according to (1.1) a standard training data set with labels, designing a loss function, and training the scene text end-to-end recognition network based on boundary point detection by using a reverse conduction method to obtain the scene text end-to-end recognition network model based on boundary point detection; the method comprises the following steps:
(1.2.1) constructing a scene text end-to-end identification network model based on boundary point detection, wherein the identification network model consists of a characteristic pyramid structure network, a region extraction network, a multidirectional rectangular detection network, a boundary point detection network and a sequence identification network based on an attention mechanism;
(1.2.2) generating a horizontal initial bounding box, a multidirectional rectangular bounding box and upper and lower boundary points of a text example on an original image according to a standard training set with labels and a characteristic diagram, and respectively providing training labels for the area extraction network, the multidirectional rectangular detection network, the boundary point detection network and a sequence identification network based on an attention mechanism;
(1.2.3) training data set I with the standardtrAs input for identifying the network model, extracting features by using a feature pyramid network module;
(1.2.4) inputting the features extracted by the feature pyramid network into a region extraction network, and generating a candidate text box by using a region-of-interest alignment method to adjust a feature map through anchor point distribution; generating a candidate text region with a fixed scale of 7 multiplied by 7 by a correct text region selected by the region extraction network through region-of-interest alignment operation, and predicting a multidirectional bounding box of a text example in the candidate text region with the fixed scale by a multidirectional rectangular prediction network;
(1.2.5) after a multidirectional bounding box of each text example is predicted by the multidirectional rectangular prediction network, generating a candidate text region with a fixed scale of 7 x 7 through a rotary region-of-interest alignment operation, and finally learning and predicting boundary points of the text examples by the network;
(1.2.6) after predicting the boundary point of each text example by the boundary point prediction network, generating a sampling grid by a thin-plate spline interpolation algorithm, correcting the text characteristics of any shape into a horizontal characteristic diagram with a fixed scale of 16 x 64, inputting the characteristic diagram into a sequence recognition network based on an attention mechanism to predict the text content, and predicting the text content according to all the prediction probability distributions PrecogTo predict the character sequence Sq
(1.2.7) taking training label gt as expected output of the network to predict labels
Figure FDA0002252221860000021
Designing a target loss function between the expected output and the predicted output for the network prediction output aiming at the constructed network model;
(2) the character recognition is carried out on the text picture to be recognized by utilizing the trained model, and the character recognition method comprises the following substeps:
(2.1) sequentially inputting the extracted features of the text picture of the scene to be detected and identified into a region extraction network and a multidirectional rectangular detection network to generate a multidirectional candidate text region, and filtering the multidirectional candidate text region by carrying out non-maximum suppression operation to obtain a more accurate multidirectional candidate text region; rotating the multi-directional text characteristics into horizontal characteristics according to the predicted multi-directional text bounding boxes, and inputting the horizontal characteristics into a boundary point detection network; calculating coordinates of the boundary points in the horizontal frame by using a formula in (1.2.2) in combination with 14 preset default boundary points, and then rotating the predicted coordinates of the boundary points counterclockwise by theta by using the rotation angle of the multidirectional rectangle predicted in (2.1) to obtain the positions of the boundary points in the original image;
(2.2) generating a sampling grid by using a thin plate spline interpolation algorithm according to the boundary points of the text example predicted in the step (2.1), correcting the text features of any shape into a horizontal shape, inputting the feature map into a sequence recognition network to obtain a probability distribution sequence, acquiring the category of the maximum probability in each step as the current predicted character according to the probability distribution, and finally acquiring a predicted character sequence Sq
2. The method for end-to-end recognition of scene text based on boundary point detection according to claim 1, wherein the scene text end-to-end recognition network model based on boundary point detection in step (1.2.1) is specifically:
the scene text end-to-end recognition network model based on the boundary point detection is composed of a characteristic pyramid structure network, a region extraction network, a multidirectional rectangular detection network, a boundary point detection network and a sequence recognition network based on an attention mechanism; the characteristic pyramid structure network is formed by adding a bottom-up connection, a top-down connection and a transverse connection by taking a ResNet-50 deep convolution neural network as a basic network, and is used for extracting and fusing characteristics with different resolutions from an input standard data set picture; inputting the extracted features of different scales into a region extraction network to obtain a candidate text region, and after the alignment operation of the region of interest, obtaining the candidate text region of a fixed scale; inputting a candidate text region with the resolution of 7 multiplied by 7 extracted by a region extraction network into a rapid region classification regression network, predicting the probability that the input candidate text region is a positive sample through classification branches, providing a more accurate candidate text region, calculating the offset of the candidate text region relative to a real text region through regression branches, and adjusting the position of the candidate text region; the multidirectional rectangle detection network comprises 3 full-connection layers FC1, FC2 and FC3, a prediction vector with the output dimension of 5 is respectively used for representing the offset of the center of a candidate text region from the center of a minimum circumscribed rectangle, the width and the height of the minimum circumscribed rectangle and the rotation angle of the minimum circumscribed rectangle, the boundary point detection network comprises 4 convolutional layers Conv1, Conv2, Conv3 and Conv4 and one full-connection layer, and a vector with the output dimension of 28 is used for respectively representing the offset of 7 boundary points of the upper boundary and the lower boundary of a text example; the attention-based sequence recognition network is composed of three convolutional layers and an attention-based model, and the attention model outputs probability distribution of predicted characters at each step.
3. The method for recognizing the scene text end-to-end based on the boundary point detection as claimed in claim 1 or 2, wherein the step (1.2.2) is specifically as follows:
for the labeled standard training set Itr, the input picture true label contains a polygon P ═ { P ] representing the text region1,p2…pmAnd a character string S ═ S representing the text content1,s2…smFor input picture ItriIn which P isiIs a picture ItriPolygonal bounding box of the middle text region, pij=(xij,yij) Is a polygon PiCoordinates of jth vertex, m represents the number of polygonal text label boxes, siIs a polygon PiThe content of the text in the text;
for a given standard dataset Itr, first the polygon P in the dataset tag is given as { P ═ P1,p2…pmThe smallest horizontal rectangular bounding box, G, which is represented by the center point (x, y) of the rectangle, as well as the height h and width w, translates into a polygonal text label boxd(x, y, h, w); for the area extraction network, labeling bounding box G according to the labeling data setd(x, y, h, w), each pixel on each feature map in the feature maps to be extracted output by the feature pyramid is corresponding to the original image, a plurality of initial bounding boxes are generated according to candidate text regions predicted by the region extraction network, and the initial bounding box Q is calculated0Annotation bounding box G with respect to an annotation data setdWhen all the labeled bounding boxes G are labeleddAnd an initial bounding box Q0All Jaccard coefficients are less than 0.5, then the initial bounding box Q0Labeled negative class non-text, class label PrpnThe value is 0; otherwise, i.e. there is at least one label bounding box GdAnd Q0Has a Jaccard coefficient of not less than 0.5, Q0Marked as positive text, class label PrpnThe value is 1, and the position offset is calculated relative to the labeling box with the maximum Jaccard coefficient, and the formula is as follows:
x=x0+w0Δx
y=y0+h0Δy
w=w0exp(Δw)
h=h0exp(Δh)
wherein x is0、y0Respectively an initial bounding box Q0Abscissa, ordinate, w of the center point of (a)0、h0Respectively an initial bounding box Q0And Δ x, Δ y are Q, respectively0Center point of (D) relative to GdThe horizontal and vertical coordinate position offset of the central point, exp is exponential operation, and the training label of the area extraction network is obtained as follows:
gtrpn=(Δxrpn,Δyrpn,Δhrpn,Δwrpn,Prpn)
for a multidirectional rectangular detection network, first the polygon P in the dataset label is set to { P ═ P1,p2…pmConverting to a minimum multidirectional rectangular bounding box of a polygonal text labeling box, representing the multidirectional rectangular bounding box G by the center point (x, y), height h, width w and rotation angle theta of a rectanglerotate(x, y, h, w, θ); the candidate text region after the network correction by the region extraction is Grpn(xrpn,yrpn,wrpn,hrpn) The predicted positional deviation amount calculation formula is as follows:
x=xrpn+wrpnΔxor
y=yrpn+hrpnΔyor
w=wrpnexp(Δwor)
h=hrpnexp(Δhor)
the training label of the multidirectional rectangular detection network obtained by the formula is as follows:
gtor=(Δxor,Δyor,Δhor,Δwor,θ)
for the boundary point detection network, the training label calculation process of the boundary point detection network is as follows:
a. setting default boundary points:
according to the detected multidirectional rectangular bounding box Grotate(x,y,h,w,θ)Rotating the rectangle clockwise by theta degrees to obtain a horizontal bounding box Ghorizon(x, y, h, w), sampling K boundary points at equal intervals on each long side of the horizontal bounding box to obtain an upper default boundary point sequence and a lower default boundary point sequence: pdu={p1,p2…pKAnd Pdd={p1,p2…pKIs of Pd=Pdu∪Pdd
b. Generating a target boundary point:
first, a polygon P is divided into two sides according to a long side, P1={p1,p2…plAnd P2={pl+1,...,pmP represents a point in the polygon;
according to P1And P2Generating boundary points of an upper boundary and a lower boundary: ptu={p1,p2…pKAnd Ptd={p1,p2…pKIs of Pt=Ptu∪Ptd
c. Calculating training label gt according to the following formulabp={(Δxi,Δyi),|i∈[0,2K-1)}:
Figure FDA0002252221860000061
Figure FDA0002252221860000062
Wherein the content of the first and second substances,and
Figure FDA0002252221860000064
respectively representing the coordinates of the ith target boundary point and the coordinates of the ith default boundary point;
for the sequence recognition network based on the attention mechanism, each text instance in the input image is labeled with a corresponding text instanceIs n, is a character string si={(c0,c1,...,cn-1),|ciE {0, 1., 9, a, B., Z, A, B., Z } } to describe the text content, the training label identifying the network is gtrecog=(onehot(c0),onehot(c1),…,onehot(cn-1) Wherein onehot (c)i) Indicates a character c1Converting into a one-hot coding form;
the final training labels are generated as follows: gt ═ { gtrpn,gtor,gtbp,gtrecog}。
4. The method for recognizing the scene text end-to-end based on the boundary point detection as claimed in claim 1 or 2, wherein the step (1.2.3) is specifically as follows:
standard training data set ItrIn the ResNet-50 network structure of the image input feature pyramid network from bottom to top, a convolutional layer unit which does not change the size of a feature map in the network is defined as a level (levels { P2, P3, P4, P5 and P6 }), and finally output convolutional features F of each level are extracted; and the top-down connection in the feature pyramid network module performs up-sampling on the output convolution features of ResNet-50 to generate multi-scale up-sampling features, and the transverse connection structure in the feature pyramid network module performs fusion on the features of each level up-sampled in the top-down process and the features generated in the bottom-up process to generate final features { F2, F3, F4, F5, F6 }.
5. The method for recognizing the scene text end-to-end based on the boundary point detection as claimed in claim 1 or 2, wherein the step (1.2.4) is specifically as follows:
for input picture ItrkExtracting 5 stage features { F2, F3, F4, F5 and F6} through a feature pyramid network, and defining the feature scale of the anchor at different stages as {32 } according to stages { P2, P3, P4, P5 and P6}2,642,1282,2562,5122Each scale layer has 5 length-width ratios {1:5, 1: }2, 1:1, 2:1, 5:1 }; thus, 25 candidate text boxes { Ftr with different scales and proportions can be extracted1,Ftr2,…,Ftr25Is denoted as FtrpSubscript p ═ 1, …, 25; in the region extraction network, the probability that each candidate text box is a correct text region bounding box is predicted to be P through classificationrpnPredicting candidate textbox offsets by regression: y isrpn=(Δxrpn,Δyrpn,Δhrpn,Δwrpn);
Selecting candidate text boxes predicted as correct text area bounding boxes, and inputting the candidate text boxes to a subsequent multidirectional rectangular detection network, a boundary point detection network and a sequence identification network based on an attention mechanism; multidirectional rectangular prediction network prediction quantity Yor=(Δxor,Δyor,Δhor,Δworθ), which includes 4 prediction offsets and one prediction angle, the network finally learns the multidirectional bounding box of the predicted text instance by computing the loss function and conducting backwards.
6. The method for recognizing the scene text end-to-end based on the boundary point detection as claimed in claim 1 or 2, wherein the step (1.2.5) is specifically as follows:
after the multidirectional rectangular prediction network predicts the multidirectional bounding box of each text example, a candidate text region with a fixed scale of 7 multiplied by 7 is generated through the alignment operation of a rotating interested region, and the boundary point prediction network outputs 28 prediction regression offsets Ybp={(Δxi,Δyi) And i belongs to 0,14), and by calculating a loss function and conducting reversely, the network finally learns and predicts the boundary points of the text instance.
7. The method for recognizing the scene text end-to-end based on the boundary point detection as claimed in claim 1 or 2, wherein the step (1.2.6) is specifically as follows:
generating a sampling grid by a thin plate spline interpolation algorithm, and correcting the text features of any shape into horizontal special features with fixed dimension of 16 x 64The identification network is composed of 3 convolutional layers and an RNN network with a basic unit of GRU, after the 3 convolutional layers, the resolution of text features is 2 x 32, each step length of the RNN model outputs probability distribution with dimension 63(62 characters and a stop symbol), and the value of each dimension is [0,1 ]]In combination with a predicted probability distribution P of all steps, with a sum of 1recogAnd the beam search algorithm to predict the character sequence Sq
8. The method for recognizing the scene text end-to-end based on the boundary point detection as claimed in claim 1 or 2, wherein the step (1.2.7) is specifically as follows:
taking the training label gt obtained by calculation in the step (1.2.2) as the expected output of the network, and taking the prediction labels in the steps (1.2.4), (1.2.5) and (1.2.6)
Figure FDA0002252221860000082
For network prediction output, aiming at the network model constructed in (1.2.1), designing an objective loss function between expected output and prediction output, wherein the overall objective loss function consists of a region extraction network, a multidirectional rectangular prediction network, a boundary point prediction network and a sequence identification network, and the overall objective loss function expression is as follows: l (P)rpn,Yrpn,Yor,Ybp,Precog)=Lrpn(Prpn,Yrpn)+α1Lor(Yor)+α2Lbp(Ybp)+α3Lrecog(Precog) Wherein L isrpn(Prpn,Yrpn) Extracting the loss function of the network for the region, Lor(Yor) Detecting the loss function of the network for multidirectional rectangles, Lbp(Ybp) Detecting loss functions of the network for boundary points, Lrecog(Precog) Identifying loss functions of the network for the sequence, α1,α2,α3Are respectively a lossFunction Lrcnn、LbpAnd LrecogThe weight coefficient of (a);
according to the designed overall target loss function, iterative training is carried out on the model by utilizing a back propagation algorithm, the overall target loss function is minimized, the optimal network model is realized, and aiming at a scene character detection and recognition task, iterative training is firstly carried out on a synthetic text data set in the training process to obtain initial network parameters; training is then performed on the real dataset to fine-tune the network parameters.
9. The method for recognizing the scene text end to end based on the boundary point detection as claimed in claim 1 or 2, wherein the step (2.1) is specifically as follows:
for the data set I to be detectedtstIth picture ItstkInputting the text data into the model trained in the step (1.2), extracting a positive candidate text region by the region extraction network after the model passes through the feature pyramid network and the region extraction network, and extracting the same test picture ItstkThe situation that the positive type text quadrangles regressed on each feature map usually overlap with each other occurs, and then the non-maximum suppression operation is carried out on the positions of all the positive type text quadrangles, and the specific steps are as follows: 1) for the predicted text bounding box, if and only if the text classification score PrcnnWhen the detection text box is more than or equal to 0.5, the detection text box is reserved; 2) carrying out non-maximum suppression operation on the text box reserved in the previous step according to the Jaccard coefficient of 0.2 to obtain a final reserved positive text quadrilateral bounding box; then extracting features with fixed scale from the filtered positive class text quadrilateral bounding box and inputting the features to a multidirectional rectangular prediction network to predict Yor=(Δxor,Δyor,Δhor,ΔworTheta) calculating a predicted multidirectional text bounding box according to the coordinates, the length and the width of the center point and the rotation angle of the predicted multidirectional rectangle; according to the predicted multidirectional text bounding box, the multidirectional text features are rotated into horizontal features and input into a boundary point detection network, and the boundary point detection network predicts regression quantities Y of 7 boundary points of an upper boundary and a lower boundarybp={(Δxi,Δyi) And | i ∈ [0,14) ], calculating the coordinates of the boundary points in the horizontal frame by using a formula in (1.2.2) in combination with 14 preset default boundary points, and then rotating the predicted coordinates of the boundary points counterclockwise by theta by using the predicted rotation angle of the multidirectional rectangle to obtain the positions of the boundary points in the original image.
10. The method for recognizing the scene text end-to-end based on the boundary point detection as claimed in claim 1 or 2, wherein the step (2.2) is specifically as follows:
the corrected text feature resolution is 16 x 64, and the feature map is input into a sequence recognition network to obtain a probability distribution sequence { p0,p1,...,pN-1In which p isiRepresenting the probability distribution of each step of prediction of RNN, wherein N represents the maximum step length of RNN, and in the test process, when the predicted value of the k step is a stop character, stopping prediction, and finally, the probability distribution of the predicted sequence is { p }0,p1,...,pk-1And obtaining the type of the maximum probability in each step as the current predicted character according to the probability distribution, and finally obtaining a predicted character sequence Sq
CN201911038568.1A 2019-10-29 2019-10-29 End-to-end scene text identification method based on boundary point detection Active CN110837835B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911038568.1A CN110837835B (en) 2019-10-29 2019-10-29 End-to-end scene text identification method based on boundary point detection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911038568.1A CN110837835B (en) 2019-10-29 2019-10-29 End-to-end scene text identification method based on boundary point detection

Publications (2)

Publication Number Publication Date
CN110837835A true CN110837835A (en) 2020-02-25
CN110837835B CN110837835B (en) 2022-11-08

Family

ID=69575725

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911038568.1A Active CN110837835B (en) 2019-10-29 2019-10-29 End-to-end scene text identification method based on boundary point detection

Country Status (1)

Country Link
CN (1) CN110837835B (en)

Cited By (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111476235A (en) * 2020-03-31 2020-07-31 成都数之联科技有限公司 Method for synthesizing 3D curved surface text picture
CN111507333A (en) * 2020-04-21 2020-08-07 腾讯科技(深圳)有限公司 Image correction method and device, electronic equipment and storage medium
CN111553349A (en) * 2020-04-26 2020-08-18 佛山市南海区广工大数控装备协同创新研究院 Scene text positioning and identifying method based on full convolution network
CN111553361A (en) * 2020-03-19 2020-08-18 四川大学华西医院 Pathological section label identification method
CN111753714A (en) * 2020-06-23 2020-10-09 中南大学 Multidirectional natural scene text detection method based on character segmentation
CN111767921A (en) * 2020-06-30 2020-10-13 上海媒智科技有限公司 Express bill positioning and correcting method and device
CN111898570A (en) * 2020-08-05 2020-11-06 盐城工学院 Method for recognizing text in image based on bidirectional feature pyramid network
CN112036405A (en) * 2020-08-31 2020-12-04 浪潮云信息技术股份公司 Detection and identification method for handwritten document text
CN112070082A (en) * 2020-08-24 2020-12-11 西安理工大学 Curve character positioning method based on instance perception component merging network
CN112101359A (en) * 2020-11-11 2020-12-18 广州华多网络科技有限公司 Text formula positioning method, model training method and related device
CN112101355A (en) * 2020-09-25 2020-12-18 北京百度网讯科技有限公司 Method and device for detecting text in image, electronic equipment and computer medium
CN112183322A (en) * 2020-09-27 2021-01-05 成都数之联科技有限公司 Text detection and correction method for any shape
CN112200202A (en) * 2020-10-29 2021-01-08 上海商汤智能科技有限公司 Text detection method and device, electronic equipment and storage medium
CN112308051A (en) * 2020-12-29 2021-02-02 北京易真学思教育科技有限公司 Text box detection method and device, electronic equipment and computer storage medium
CN112446372A (en) * 2020-12-08 2021-03-05 电子科技大学 Text detection method based on channel grouping attention mechanism
CN112733822A (en) * 2021-03-31 2021-04-30 上海旻浦科技有限公司 End-to-end text detection and identification method
CN112765955A (en) * 2021-01-22 2021-05-07 中国人民公安大学 Cross-modal instance segmentation method under Chinese reference expression
CN112800801A (en) * 2021-02-03 2021-05-14 珠海格力电器股份有限公司 Method and device for recognizing pattern in image, computer equipment and storage medium
WO2021098861A1 (en) * 2019-11-21 2021-05-27 上海高德威智能交通系统有限公司 Text recognition method, apparatus, recognition device, and storage medium
CN113298167A (en) * 2021-06-01 2021-08-24 北京思特奇信息技术股份有限公司 Character detection method and system based on lightweight neural network model
CN113298054A (en) * 2021-07-27 2021-08-24 国际关系学院 Text region detection method based on embedded spatial pixel clustering
CN113343980A (en) * 2021-06-10 2021-09-03 西安邮电大学 Natural scene text detection method and system
CN113591864A (en) * 2021-07-28 2021-11-02 北京百度网讯科技有限公司 Training method, device and system for text recognition model framework
WO2021232464A1 (en) * 2020-05-20 2021-11-25 南京理工大学 Character offset detection method and system
CN113807336A (en) * 2021-08-09 2021-12-17 华南理工大学 Semi-automatic labeling method, system, computer equipment and medium for image text detection
CN114155540A (en) * 2021-11-16 2022-03-08 深圳市联洲国际技术有限公司 Character recognition method, device and equipment based on deep learning and storage medium
CN114266800A (en) * 2021-12-24 2022-04-01 中设数字技术股份有限公司 Multi-rectangular bounding box algorithm and generation system for graphs
CN115482538A (en) * 2022-11-15 2022-12-16 上海安维尔信息科技股份有限公司 Material label extraction method and system based on Mask R-CNN
CN116884013A (en) * 2023-07-21 2023-10-13 江苏方天电力技术有限公司 Text vectorization method of engineering drawing
CN116958981A (en) * 2023-05-31 2023-10-27 广东南方网络信息科技有限公司 Character recognition method and device
CN117975467A (en) * 2024-04-02 2024-05-03 华南理工大学 Bridge type end-to-end character recognition method
WO2024092484A1 (en) * 2022-11-01 2024-05-10 Boe Technology Group Co., Ltd. Computer-implemented object detection method, object detection apparatus, and computer-readable medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107977620A (en) * 2017-11-29 2018-05-01 华中科技大学 A kind of multi-direction scene text single detection method based on full convolutional network
CN108549893A (en) * 2018-04-04 2018-09-18 华中科技大学 A kind of end-to-end recognition methods of the scene text of arbitrary shape

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107977620A (en) * 2017-11-29 2018-05-01 华中科技大学 A kind of multi-direction scene text single detection method based on full convolutional network
CN108549893A (en) * 2018-04-04 2018-09-18 华中科技大学 A kind of end-to-end recognition methods of the scene text of arbitrary shape

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LUO等: "MORAN: A Multi-Object Rectified Attention Network for scene text recognition", 《PATTERN RECOGNITION》 *
ZHANG等: "Look More Than Once: An Accurate Detector for Text of Arbitrary Shapes", 《PROCEEDINGS OF THE IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION》 *

Cited By (50)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021098861A1 (en) * 2019-11-21 2021-05-27 上海高德威智能交通系统有限公司 Text recognition method, apparatus, recognition device, and storage medium
US11928872B2 (en) 2019-11-21 2024-03-12 Shanghai Goldway Intelligent Transportation System Co., Ltd. Methods and apparatuses for recognizing text, recognition devices and storage media
CN111553361A (en) * 2020-03-19 2020-08-18 四川大学华西医院 Pathological section label identification method
CN111476235A (en) * 2020-03-31 2020-07-31 成都数之联科技有限公司 Method for synthesizing 3D curved surface text picture
CN111476235B (en) * 2020-03-31 2023-04-25 成都数之联科技股份有限公司 Method for synthesizing 3D curved text picture
CN111507333A (en) * 2020-04-21 2020-08-07 腾讯科技(深圳)有限公司 Image correction method and device, electronic equipment and storage medium
CN111507333B (en) * 2020-04-21 2023-09-15 腾讯科技(深圳)有限公司 Image correction method and device, electronic equipment and storage medium
CN111553349B (en) * 2020-04-26 2023-04-18 佛山市南海区广工大数控装备协同创新研究院 Scene text positioning and identifying method based on full convolution network
CN111553349A (en) * 2020-04-26 2020-08-18 佛山市南海区广工大数控装备协同创新研究院 Scene text positioning and identifying method based on full convolution network
WO2021232464A1 (en) * 2020-05-20 2021-11-25 南京理工大学 Character offset detection method and system
CN111753714A (en) * 2020-06-23 2020-10-09 中南大学 Multidirectional natural scene text detection method based on character segmentation
CN111753714B (en) * 2020-06-23 2023-09-01 中南大学 Multidirectional natural scene text detection method based on character segmentation
CN111767921A (en) * 2020-06-30 2020-10-13 上海媒智科技有限公司 Express bill positioning and correcting method and device
CN111898570A (en) * 2020-08-05 2020-11-06 盐城工学院 Method for recognizing text in image based on bidirectional feature pyramid network
CN112070082A (en) * 2020-08-24 2020-12-11 西安理工大学 Curve character positioning method based on instance perception component merging network
CN112070082B (en) * 2020-08-24 2023-04-07 西安理工大学 Curve character positioning method based on instance perception component merging network
CN112036405A (en) * 2020-08-31 2020-12-04 浪潮云信息技术股份公司 Detection and identification method for handwritten document text
CN112101355B (en) * 2020-09-25 2024-04-02 北京百度网讯科技有限公司 Method and device for detecting text in image, electronic equipment and computer medium
CN112101355A (en) * 2020-09-25 2020-12-18 北京百度网讯科技有限公司 Method and device for detecting text in image, electronic equipment and computer medium
CN112183322A (en) * 2020-09-27 2021-01-05 成都数之联科技有限公司 Text detection and correction method for any shape
CN112183322B (en) * 2020-09-27 2022-07-19 成都数之联科技股份有限公司 Text detection and correction method for any shape
CN112200202A (en) * 2020-10-29 2021-01-08 上海商汤智能科技有限公司 Text detection method and device, electronic equipment and storage medium
CN112101359A (en) * 2020-11-11 2020-12-18 广州华多网络科技有限公司 Text formula positioning method, model training method and related device
CN112446372A (en) * 2020-12-08 2021-03-05 电子科技大学 Text detection method based on channel grouping attention mechanism
CN112308051A (en) * 2020-12-29 2021-02-02 北京易真学思教育科技有限公司 Text box detection method and device, electronic equipment and computer storage medium
CN112308051B (en) * 2020-12-29 2021-10-29 北京易真学思教育科技有限公司 Text box detection method and device, electronic equipment and computer storage medium
CN112765955B (en) * 2021-01-22 2023-05-26 中国人民公安大学 Cross-modal instance segmentation method under Chinese finger representation
CN112765955A (en) * 2021-01-22 2021-05-07 中国人民公安大学 Cross-modal instance segmentation method under Chinese reference expression
CN112800801A (en) * 2021-02-03 2021-05-14 珠海格力电器股份有限公司 Method and device for recognizing pattern in image, computer equipment and storage medium
CN112800801B (en) * 2021-02-03 2022-11-11 珠海格力电器股份有限公司 Method and device for recognizing pattern in image, computer equipment and storage medium
CN112733822A (en) * 2021-03-31 2021-04-30 上海旻浦科技有限公司 End-to-end text detection and identification method
CN112733822B (en) * 2021-03-31 2021-07-27 上海旻浦科技有限公司 End-to-end text detection and identification method
CN113298167A (en) * 2021-06-01 2021-08-24 北京思特奇信息技术股份有限公司 Character detection method and system based on lightweight neural network model
CN113343980B (en) * 2021-06-10 2023-06-09 西安邮电大学 Natural scene text detection method and system
CN113343980A (en) * 2021-06-10 2021-09-03 西安邮电大学 Natural scene text detection method and system
CN113298054A (en) * 2021-07-27 2021-08-24 国际关系学院 Text region detection method based on embedded spatial pixel clustering
CN113298054B (en) * 2021-07-27 2021-10-08 国际关系学院 Text region detection method based on embedded spatial pixel clustering
CN113591864A (en) * 2021-07-28 2021-11-02 北京百度网讯科技有限公司 Training method, device and system for text recognition model framework
CN113807336A (en) * 2021-08-09 2021-12-17 华南理工大学 Semi-automatic labeling method, system, computer equipment and medium for image text detection
CN113807336B (en) * 2021-08-09 2023-06-30 华南理工大学 Semi-automatic labeling method, system, computer equipment and medium for image text detection
CN114155540B (en) * 2021-11-16 2024-05-03 深圳市联洲国际技术有限公司 Character recognition method, device, equipment and storage medium based on deep learning
CN114155540A (en) * 2021-11-16 2022-03-08 深圳市联洲国际技术有限公司 Character recognition method, device and equipment based on deep learning and storage medium
CN114266800B (en) * 2021-12-24 2023-05-05 中设数字技术股份有限公司 Method and system for generating multiple rectangular bounding boxes of plane graph
CN114266800A (en) * 2021-12-24 2022-04-01 中设数字技术股份有限公司 Multi-rectangular bounding box algorithm and generation system for graphs
WO2024092484A1 (en) * 2022-11-01 2024-05-10 Boe Technology Group Co., Ltd. Computer-implemented object detection method, object detection apparatus, and computer-readable medium
CN115482538A (en) * 2022-11-15 2022-12-16 上海安维尔信息科技股份有限公司 Material label extraction method and system based on Mask R-CNN
CN116958981B (en) * 2023-05-31 2024-04-30 广东南方网络信息科技有限公司 Character recognition method and device
CN116958981A (en) * 2023-05-31 2023-10-27 广东南方网络信息科技有限公司 Character recognition method and device
CN116884013A (en) * 2023-07-21 2023-10-13 江苏方天电力技术有限公司 Text vectorization method of engineering drawing
CN117975467A (en) * 2024-04-02 2024-05-03 华南理工大学 Bridge type end-to-end character recognition method

Also Published As

Publication number Publication date
CN110837835B (en) 2022-11-08

Similar Documents

Publication Publication Date Title
CN110837835B (en) End-to-end scene text identification method based on boundary point detection
CN108549893B (en) End-to-end identification method for scene text with any shape
US10762376B2 (en) Method and apparatus for detecting text
WO2020108311A1 (en) 3d detection method and apparatus for target object, and medium and device
Ma et al. ReLaText: Exploiting visual relationships for arbitrary-shaped scene text detection with graph convolutional networks
Zang et al. Vehicle license plate recognition using visual attention model and deep learning
Rekha et al. Hand gesture recognition for sign language: A new hybrid approach
CN111488826A (en) Text recognition method and device, electronic equipment and storage medium
Chiang et al. Recognizing text in raster maps
CN112541491B (en) End-to-end text detection and recognition method based on image character region perception
CN112446370B (en) Method for identifying text information of nameplate of power equipment
CN110598690A (en) End-to-end optical character detection and identification method and system
Cao et al. Robust vehicle detection by combining deep features with exemplar classification
CN112766184A (en) Remote sensing target detection method based on multi-level feature selection convolutional neural network
CN111476210A (en) Image-based text recognition method, system, device and storage medium
Wang et al. Spatially prioritized and persistent text detection and decoding
Katper et al. Deep neural networks combined with STN for multi-oriented text detection and recognition
Zhang et al. A vertical text spotting model for trailer and container codes
CN113420648B (en) Target detection method and system with rotation adaptability
Mohammad et al. Contour-based character segmentation for printed Arabic text with diacritics
Ghadhban et al. Segments interpolation extractor for finding the best fit line in Arabic offline handwriting recognition words
Turk et al. Computer vision for mobile augmented reality
CN114330247A (en) Automatic insurance clause analysis method based on image recognition
CN111476226B (en) Text positioning method and device and model training method
Shi et al. Fuzzy support tensor product adaptive image classification for the internet of things

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant