CN111476226A - Text positioning method and device and model training method - Google Patents

Text positioning method and device and model training method Download PDF

Info

Publication number
CN111476226A
CN111476226A CN202010132023.3A CN202010132023A CN111476226A CN 111476226 A CN111476226 A CN 111476226A CN 202010132023 A CN202010132023 A CN 202010132023A CN 111476226 A CN111476226 A CN 111476226A
Authority
CN
China
Prior art keywords
output
sampling
convolution
unit
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010132023.3A
Other languages
Chinese (zh)
Other versions
CN111476226B (en
Inventor
尹世豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New H3C Big Data Technologies Co Ltd
Original Assignee
New H3C Big Data Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by New H3C Big Data Technologies Co Ltd filed Critical New H3C Big Data Technologies Co Ltd
Priority to CN202010132023.3A priority Critical patent/CN111476226B/en
Publication of CN111476226A publication Critical patent/CN111476226A/en
Application granted granted Critical
Publication of CN111476226B publication Critical patent/CN111476226B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a text positioning method, a text positioning device and a model training method, which are used for improving the positioning effect of text lines in pictures. The invention provides a text positioning model based on a full convolution neural network, which integrates the technologies of a residual error network, a transposition convolution, feature fusion, batch normalization and the like, and enhances the representation capability of the model. The text positioning method, the model and the model training method provided by the invention can position the text line region in the image by predicting the two-dimensional Gaussian distribution map of the text line region, and can realize accurate positioning of the text line at a horizontal and any inclination angle in a complex background.

Description

Text positioning method and device and model training method
Technical Field
The invention relates to the technical field of artificial intelligence image-text processing, in particular to a text positioning method, a text positioning device and a model training method.
Background
Optical Character Recognition (OCR) is a technology for recognizing Character information in an image by an image processing technology, and is widely applied to the fields of certificate Recognition, license plate Recognition, paper information electronization, and the like. The complete OCR process is generally divided into two steps: character positioning and character recognition. Character positioning refers to accurately positioning coordinate information of characters in an image, and character recognition refers to a process of identifying what characters are in a positioned image area. The accuracy of character positioning is directly related to the precision of subsequent character recognition, and character positioning is generally divided into single character positioning and text line positioning according to the difference of subsequent character recognition models.
A text line detection algorithm based on a general object detection algorithm such as a Faster Region-based Convolutional neural network (fast RCNN). The basic steps Of detecting text in an image as a target detection algorithm based on a convolutional neural Network are that the image is input into the convolutional neural Network to extract a feature map, then the feature map is input into a Region candidate Network (RPN) to generate a plurality Of candidate frames, the candidate frames are converted into a feature map with a fixed size through Region Of Interest (ROI) ROI Pooling (Pooling), and finally whether the candidate frames are text Regions or not is sequentially judged. The text detection model needs to define candidate boxes with different sizes in advance to adapt to target areas with different sizes. The sizes and lengths of text lines in actual picture samples are different, the size of a defined candidate box is difficult to cover all situations, and the positioning accuracy of the text lines is poor. For skewed lines of text, too many extraneous regions may be included in the detection result.
The Text line positioning algorithm based on the connection Text candidate Network (CTPN) predicts the vertical position of a Text based on the differential thought in mathematics, and the algorithm flow comprises the steps of firstly extracting the output of the 5 th convolutional layer of the VGG16 Network as a feature map, then adopting a 3x3 sliding window to extract features, inputting the extracted features into a bidirectional long-Short Term Memory Network (L ong-Short Term Memory, L STM), outputting 512-dimensional feature vectors, then obtaining the position of a Text box through classification and regression, and finally obtaining a target box containing the Text line through a Text line construction algorithm.
Disclosure of Invention
The invention provides a text positioning method, a text positioning device and a model training method, which are used for improving the positioning effect of text lines in pictures.
Based on the embodiment of the invention, the invention provides a text positioning device, which comprises a down-sampling module, an up-sampling module and an output layer module;
the down-sampling module consists of a backbone unit and N down-sampling units:
a trunk unit composed of a plurality of convolution layers for extracting low-level features of an input picture; the feature graph output by the main unit is used as the input of the first down-sampling unit, and is fused with the output feature graph of the (M-1) th up-sampling unit to be used as the input of the (M) th up-sampling unit;
the down-sampling unit is used for down-sampling the input feature map; the feature diagram size output by each down-sampling unit is reduced proportionally to the width and height of the input feature diagram, except for the Nth down-sampling unit, the output of each down-sampling unit is used as the input of the next down-sampling unit, and the output of the last down-sampling unit is used as the input of the first up-sampling unit;
the up-sampling module consists of M up-sampling units:
the up-sampling unit is used for up-sampling the input characteristic diagram; the size of the characteristic diagram output by each up-sampling unit is proportionally enlarged relative to the width and height of the input characteristic diagram; the amplification ratio of the up-sampling unit is the same as that of the down-sampling unit; except the up-sampling unit connected with the output layer module in the up-sampling path, the feature graph output by each up-sampling unit is fused with the feature graph with the same dimensionality output by the main unit and each down-sampling unit except the Nth down-sampling unit in the down-sampling path and then used as the input of the next up-sampling unit, and the output of the Mth up-sampling unit is used as the input of the output layer module;
the output layer module is composed of a plurality of convolution layers and is used for gradually reducing the number of channels of the characteristic diagram, and the output of the output layer module is a two-dimensional Gaussian distribution diagram.
Based on the embodiment of the present invention, further, the downsampling unit includes a first convolution subunit near the output side, a third convolution subunit near the input side, and a second convolution subunit located in the middle:
the first convolution subunit and the second convolution subunit do not change the width and the height of the feature map, and the third convolution subunit reduces the width and the height of the feature map to 1/2;
the first convolution subunit comprises two paths, one path comprises three convolution layers, the other path comprises one convolution layer and an average pooling layer, and the outputs of the two paths are subjected to bitwise addition and fusion to obtain the output of the first convolution subunit;
the second convolution subunit comprises two paths, one path comprises three convolution layers, the other path does not comprise any operator, and the two paths are subjected to bitwise addition and fusion to obtain the output of the second convolution subunit;
the third convolution subunit comprises two paths, one path consists of a convolution layer and a transposed convolution layer, the other path consists of a convolution layer and a nearest neighbor upsampling layer, and the two paths are subjected to bitwise addition and fusion to obtain the output of the third convolution subunit;
the output of the third convolution subunit is used as the input of the second convolution subunit, the output of the second convolution subunit is used as the input of the first convolution subunit, and the output of the first convolution subunit is the output of the downsampling unit.
Based on the embodiment of the invention, further, the up-sampling unit comprises two paths, one path comprises an upper convolutional layer and a lower convolutional layer, and a transposition convolutional layer is arranged in the middle; the other path comprises a nearest neighbor upsampling layer and a convolutional layer; the convolution layer in the up-sampling unit is used for scaling the channel number of the feature map, and the transposed convolution layer and the nearest neighbor up-sampling layer are used for expanding the width and the height of the feature map; the height and width of the characteristic diagram output by the up-sampling unit are increased by 2 times compared with the input.
Based on the embodiment of the invention, further, the output layer module is composed of a plurality of convolution layers, and nonlinearity is introduced while the number of channels of the characteristic diagram is gradually reduced, so that the characteristic combination characterization capability is increased; and the output of the last convolution layer in the output layer module is subjected to activation function processing, and then the output value is mapped to the interval of (0, 1) and then a two-dimensional Gaussian distribution map is output.
Based on the embodiment of the present invention, further, except for the last convolution layer in the output layer module, the outputs of the other convolution layers of the output layer module and all convolution layers included in the trunk unit, the down-sampling unit and the up-sampling unit are processed by batch normalization and activation functions.
Based on the embodiment of the invention, the invention also provides a text positioning method, which comprises the following steps:
extracting low-level features of the prediction map using a skeleton unit, and down-sampling the feature map output by the skeleton unit using a plurality of down-sampling units step by step to reduce the size of the feature map proportionally;
the feature maps output by the main unit and the down-sampling units are up-sampled by using a plurality of up-sampling units so as to proportionally increase the size of the feature maps, and the feature maps with the same dimensionality output in the down-sampling paths are fused layer by layer in the up-sampling paths;
reducing the number of channels of the feature map output by the last up-sampling unit layer by using a plurality of convolutional layers, mapping an output value to a (0, 1) interval by using an activation function in the last convolutional layer, and outputting a two-dimensional Gaussian prediction result map;
and determining the area of the text line according to the brightness value of the pixel points in the two-dimensional Gaussian prediction result image.
Based on the embodiment of the invention, a jump connection structure in a residual error network is further introduced into a down-sampling unit, and an additional average pooling path is introduced into the down-sampling unit; two ways of increasing the size of the feature map are used for fusion in the upsampling unit, namely transposition convolution and nearest neighbor upsampling.
Based on the embodiment of the present invention, further, the step of determining the region where the text line is located by the brightness value of the pixel point in the two-dimensional gaussian prediction result image includes:
according to a preset threshold value, setting pixels which are larger than or equal to the threshold value in the two-dimensional Gaussian prediction result image to be 1, setting pixels which are smaller than the threshold value to be 0, and binarizing the two-dimensional Gaussian prediction result image to obtain a binarization result image;
performing connected domain analysis on the binarization result graph to obtain a connected region of the text pixel points;
and determining a minimum external rectangle for each connected region to obtain four vertexes of the rectangle, and determining the text line region according to the four vertexes.
Based on the embodiment of the present invention, further, the down-sampling unit performs down-sampling by a multiple of 1/2, and the corresponding up-sampling unit performs up-sampling by a multiple of 2.
Based on the embodiment of the invention, the invention also provides a training method of the text positioning model, which is applied to the text positioning model and comprises the following steps:
initializing parameters of the text positioning model, and setting a loss function adopted in a training stage and an optimization algorithm used in an optimization stage;
inputting the prediction sample graph into a text positioning model, and extracting low-level features of the prediction sample graph by a main unit in the text positioning model; downsampling the feature map output by the main unit step by a plurality of downsampling units in the text positioning model to reduce the size of the feature map in proportion;
a plurality of up-sampling units in the text positioning model up-sample the feature maps output by the main unit and the down-sampling units to proportionally increase the size of the feature maps, and feature maps with the same dimensionality output in the down-sampling paths are fused layer by layer in the up-sampling paths;
reducing the number of channels of the characteristic diagram output by the up-sampling unit layer by a plurality of convolutional layers in the text positioning model, and mapping an output value to a (0, 1) interval by using an activation function in the last convolutional layer to output a two-dimensional Gaussian distribution diagram;
and updating parameters of the text positioning model by using a model optimization algorithm according to the error between the two-dimensional Gaussian distribution graph and the label graph output by the text positioning model, then inputting a new training sample graph again for iteration, and storing the model file after the iteration is terminated.
Based on the embodiment of the invention, further, in the training process, the mean square error is used as the loss function of the text positioning model, namely, the sum of squares of the difference values of the two-dimensional Gaussian distribution diagram output by the model and the label diagram is used as the loss function of the model; in the back propagation process, a gradient descent optimization algorithm or an Adam optimization algorithm is adopted.
Based on the embodiment of the invention, further, the training method further comprises:
the number of the prediction sample images is increased in a mode that codes randomly generate text lines on the background image, and the text positioning model is trained.
Based on the embodiment of the present invention, an embodiment of the present invention further provides a text positioning apparatus, where the text positioning apparatus includes: a processor, a non-transitory storage medium, the processor reading and executing instruction codes from the non-transitory storage medium to implement the aforementioned text location method.
According to the technical scheme, the text positioning model is provided on the basis of the full convolution neural network, the technologies of residual error network, transposition convolution, feature fusion, batch normalization and the like are fused, and the representation capability of the model is enhanced. The text positioning method, the model and the model training method provided by the invention can position the text line region in the image by predicting the two-dimensional Gaussian distribution map of the text line region, and can realize accurate positioning of the text line at a horizontal and any inclination angle in a complex background.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments of the present invention or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present invention, and for those skilled in the art, other drawings may be obtained according to the drawings of the embodiments of the present invention.
FIG. 1 is a schematic diagram of a training sample and label graph provided in accordance with an embodiment of the present invention;
FIG. 2 is a diagram illustrating a structure of a text positioning model according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a trunk unit in a downsampling module according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a down-sampling unit in a down-sampling module according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an upsampling unit in an upsampling module according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of an output layer module according to an embodiment of the present invention;
FIG. 7 is a flowchart illustrating steps of a text positioning method according to an embodiment of the present invention;
fig. 8 is a schematic diagram of a hardware structure of a text positioning device according to an embodiment of the present invention.
Detailed Description
The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the embodiments of the invention. As used in the examples and claims of the present invention, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. The term "and/or" as used herein is meant to encompass any and all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used to describe various information in embodiments of the present invention, the information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of embodiments of the present invention. Depending on the context, moreover, the word "if" as used may be interpreted as "at … …" or "when … …" or "in response to a determination".
The appearance of the CTPN algorithm improves the performance of the general target detection algorithm in a text line positioning task, the general target detection algorithm positions a text region by predicting the vertical position of a target frame, but the accuracy of a constructed text line region is not high for an inclined text line.
The invention provides a new text line positioning model based on a full convolution neural network, and the model structure integrates technical characteristics of a residual error network, transposition convolution, characteristic fusion, batch normalization and the like, and enhances the representation capability of the model. The model can position the text line region in the image by predicting the Gaussian map of the text line region, and can realize accurate positioning of the text line at a horizontal angle and an arbitrary inclination angle in a complex background.
Before the supervised deep learning model is formally applied to an application scene, a large number of training samples are required to train the model so as to fix parameters of the model, and after a model file is fixed, the model file can be deployed to equipment for text detection and positioning of the application scene.
The text line positioning model provided by the invention also needs to provide a certain number of training samples with labels, and because the text line positioning model is applied to positioning of text lines in pictures, and a two-dimensional Gaussian map is output by the model, a label map generated by labeling the training samples is also the two-dimensional Gaussian map, and the pixel value range in the label map is 0-1.
Fig. 1 is a schematic diagram of a training sample and a label graph provided in an embodiment of the present invention, where the left graph is a training sample of an input model, the right graph is a label graph generated after labeling text lines in the training sample, each text line area in the label graph is labeled as a two-dimensional gaussian distribution, and the brighter area in the label graph has a pixel value closer to 1, i.e., the brighter area in the label graph corresponds to an area where the text line in an original image is located.
Fig. 2 is a schematic diagram of a text positioning model structure according to an embodiment of the present invention. The text positioning model provided by the invention can be realized by software or hardware, and each functional module or unit in the model can correspond to a functional chip or a programmable logic unit in a hardware device, so that the text positioning model provided by the invention can also be called a text positioning device, and when the text positioning model is solidified or installed in a hardware device, the device becomes a text positioning device capable of achieving the purpose of the invention.
Taking the embodiment of fig. 2 as an example, the text positioning model 200 provided by the present invention includes 3 modules: a downsampling module 210, an upsampling module 220, and an output layer module 230.
The down-sampling module consists of a backbone unit Stem and N down-sampling units DownBlock:
a trunk unit composed of a plurality of convolution layers for extracting low-level features of an input picture; the feature graph output by the main unit is used as the input of the first down-sampling unit, and is fused with the output feature graph of the (M-1) th up-sampling unit to be used as the input of the (M) th up-sampling unit;
the down-sampling unit is used for down-sampling the input feature map; the feature diagram size output by each down-sampling unit is reduced proportionally to the width and height of the input feature diagram, except for the Nth down-sampling unit, the output of each down-sampling unit is used as the input of the next down-sampling unit, and the output of the last down-sampling unit is used as the input of the first up-sampling unit;
the up-sampling module consists of M up-sampling units UpBlock:
the up-sampling unit is used for up-sampling the input characteristic diagram; the size of the characteristic diagram output by each up-sampling unit is proportionally enlarged relative to the width and height of the input characteristic diagram; the amplification ratio of the up-sampling unit is the same as that of the down-sampling unit; in the invention, a processing path of a down-sampling unit is called as a down-sampling path, a processing path of an up-sampling unit is called as an up-sampling path, except for an up-sampling unit connected with an output layer module, a feature map output by each up-sampling unit is fused with a main unit in the down-sampling path and feature maps with the same dimensionality output by each down-sampling unit except an Nth down-sampling unit and then is used as the input of a next up-sampling unit, and the output of an Mth up-sampling unit is used as the input of the output layer module;
the output layer module is composed of a plurality of convolution layers and is used for gradually reducing the number of channels of the characteristic diagram, and the output of the output layer module is a two-dimensional Gaussian distribution diagram.
The fusion mode of the two characteristic graphs/characteristic vectors in the invention can be as follows: the two feature maps with the same dimension add elements at corresponding positions (namely, add the elements according to bits) to obtain a new feature map, and the dimension of the fused feature map is unchanged. Taking the feature maps output by the trunk unit 215 and the fourth upsampling unit 224 in fig. 2 as an example, a "+" in the figure represents that the feature map output by the trunk unit and the feature map output by the fourth upsampling unit are fused, and the fused feature map is used as an input of the fifth upsampling unit 225.
The invention does not limit the number of the down sampling units in the down sampling path and the number of the up sampling units in the up sampling path, and can be determined according to the practical application scene and the required output effect requirement, as long as the invention meets the matching logic and the output requirement between the down sampling unit and the up sampling unit.
Assuming that b represents the number of samples of the input model, h and w represent the height and width of the image, and the number of channels of the training sample picture of the input model is 3, the dimension of the model input data can be represented as (b, h, w,3), and the model output is a two-dimensional gaussian distribution diagram with the dimension of (b, h, w, 1).
In the embodiment shown in fig. 2, the down-sampling module includes 4 down blocks, the up-sampling module includes 5 upblocks, the dimension of the feature Map output after the input data with dimension (b, h, w,3) is processed by Stem and 4 down blocks is changed to (b, h/32, w/32,1024), then the feature Map is up-sampled to (b, h, w,32) in size by 5 upblocks, and finally the number of channels is reduced to 1 by the output layer module ConvBlock, so as to obtain the two-dimensional gaussian distribution Map with dimension (b, h, w, 1).
Fig. 3 is a schematic diagram of a structure of the trunk unit 215 in the downsampling module 210 in the embodiment of fig. 2, where the trunk unit 215 is composed of 3 convolutional layers for extracting low-level features in an image, taking a first convolutional layer Conv 32(3 x3, s ═ 2) in the drawing as an example, where a number 32 after Conv represents the number of convolutional kernels when performing a convolution operation, 3 × 3 represents a size of the convolutional kernels, s ═ 2 represents a step size of the convolutional kernels when sliding on a feature map is 2, and the labeling meanings of other drawings are the same and are not repeated.
Fig. 4 is a schematic structural diagram of a down-sampling unit 211-214 in the down-sampling module 210 in the embodiment of fig. 2, the down-sampling unit is used for down-sampling the feature map, the height and width of the feature map after each down-sampling is reduced to 1/2 times, and n in the down blocks 1-4 takes values of 64, 128, 256 and 512, respectively.
In the embodiment of the invention, when a convolutional neural network with a large number of training layers is used, the training effect is often poor due to attenuation during gradient back transmission, in order to improve the efficiency of training the deep convolutional neural network, a jump connection structure in a residual error network (namely, no convolutional layer or only one path of convolution with the size of 1 × 1 for dimension lifting and lowering) is introduced into a down-sampling unit, and meanwhile, in order to reduce the potential feature loss in the down-sampling process, an additional average pooling path is added when the size of a feature map is reduced.
The down-sampling unit comprises three convolution subunits, namely a first convolution subunit, a second convolution subunit and a third convolution subunit from the output side. The first convolution subunit and the second convolution subunit do not change the width and the height of the feature map, and the third convolution subunit reduces the width and the height of the feature map to 1/2;
the first convolution subunit comprises two paths, one path comprises three convolution layers, the other path comprises one convolution layer and an average pooling layer, and the outputs of the two paths are subjected to bitwise addition and fusion to obtain the output of the first convolution subunit;
the second convolution subunit comprises two paths, one path comprises three convolution layers, the other path does not comprise any operator, and the two paths are subjected to bitwise addition and fusion to obtain the output of the second convolution subunit;
the third convolution subunit comprises two paths, one path is composed of a convolution layer and a transposed convolution layer, the other path is composed of a convolution layer and a nearest neighbor upsampling layer, and the two paths are subjected to bitwise addition and fusion to obtain the output of the third convolution subunit.
The output of the third convolution subunit is used as the input of the second convolution subunit, the output of the second convolution subunit is used as the input of the first convolution subunit, and the output of the first convolution subunit is the output of the downsampling unit.
FIG. 5 is a schematic structural diagram of an upsampling unit 221-225 in the upsampling module 220 in the embodiment of FIG. 2. The up-sampling unit is used for increasing the size of the feature map, and the width and the height of the feature map after each up-sampling are increased by 2 times. Since the detail in the final result is coarse due to the excessive up-sampling multiple, the embodiment adopts 2-fold amplification, and the detail information after up-sampling can be guaranteed to the maximum extent.
There are various upsampling methods, and the nearest neighbor upsampling speed is fast, but it is difficult to restore the detailed features of the picture. Transposed convolution upsampling contains learnable parameters that can be learned to detail features during the training process, but possibly introduces noise information. In order to obtain a better up-sampling result, the method integrates two modes of increasing the size of the characteristic graph, namely transposition convolution and nearest neighbor up-sampling in the up-sampling unit, and the final effect is better than that of a single up-sampling mode.
The values of n used for calculating the number of convolution kernels in UpBlock 1-5 in the embodiment of FIG. 2 are 512, 256, 128, 64 and 32 respectively. The value of n directly affects the number of parameters in the model, i.e. the complexity of the model, and the value of n is determined after the complexity and the characterization capability of the model are weighed. The invention does not limit the value of n of the convolution layer in the up-sampling unit, and in practical application, the characterization capability and the model complexity can be balanced and determined automatically according to practical conditions. In this embodiment, each upsampling unit UpBlock includes two paths, one path is configured with two convolutional layers and one transpose convolutional layer (ConvTrans), where the upper and lower layers are convolutional layers and the middle layer is a transpose convolutional layer. The other path includes a nearest neighbor upsampled layer (denoted as upsamplle in the figure) and a convolutional layer. The convolution layer in the up-sampling unit is used for scaling the channel number of the feature map, and the transposition convolution and nearest neighbor up-sampling are used for expanding the width and height of the feature map.
Fig. 6 is a schematic structural diagram of the output layer module 230 in the embodiment of fig. 2. The output layer module 230 is composed of a plurality of convolution layers, and is used for gradually reducing the number of channels of the feature map and simultaneously introducing nonlinearity, increasing the feature combination characterization capability, and finally obtaining the output of the model. The output of the last convolutional layer in the output layer module 230 is processed by an activation function, such as a sigmoid function, and then the two-dimensional gaussian distribution map is output after the output value is mapped to the interval of (0, 1).
It should be noted that, in the above-mentioned drawings, except for the last convolution layer in the output layer module 230, the outputs of other convolution layers of the output layer module and all convolution layers included in the trunk unit 215, the downsampling unit 211-214 and the upsampling unit 221 are processed by a BN (Batch normalization) and an activation function (e.g., Re L U function).
Fig. 7 is a flowchart of steps of a text positioning method provided by the present invention, and the method is implemented based on a text positioning model provided by the present invention to position a text line in a prediction graph in an application scene, and the method is based on a full convolution neural network, and integrates methods such as a residual error network, a transposed convolution, a feature fusion, a batch normalization, and the like to enhance the representation capability of the model, and can position a text line region in an image by a gaussian map of the prediction text line region, and can implement accurate positioning of a text line at a horizontal level and an arbitrary inclination angle in a complex background. After the text positioning model is trained, the method can be applied to an application scene, in the application scene, after a prediction graph of a text to be positioned is input into the model, the model outputs a two-dimensional Gaussian prediction result graph, and then the result graph is subjected to post-processing to obtain a position area of the text line in the prediction graph. The text positioning method comprises the following steps:
step 701, extracting low-level features of the prediction graph by using a main unit, and down-sampling the feature graph output by the main unit step by using a plurality of down-sampling units to reduce the size of the feature graph in proportion;
step 702, a plurality of upsampling units are used for upsampling the feature maps output by the main unit and the downsampling units so as to proportionally increase the size of the feature maps, and the feature maps with the same dimensionality output in the downsampling paths are fused layer by layer in the upsampling paths;
step 703, reducing the number of channels of the feature map output by the last upsampling unit layer by using a plurality of convolutional layers, and outputting a two-dimensional Gaussian prediction result map in the interval of mapping the output value to (0, 1) by using an activation function in the last convolutional layer;
and 704, determining the area of the text line according to the brightness value of the pixel point in the two-dimensional Gaussian prediction result image.
In step 702, in order to improve the efficiency of training the deep convolutional neural network, a skip connection structure in the residual network is introduced into the down-sampling unit. Meanwhile, in order to reduce the potential feature loss in the down-sampling process, an additional average pooling path is added when the feature map size is reduced.
In an embodiment of the present invention, in step 703, in order to better restore the detail features of the picture and take the sampling speed into account, two ways of increasing the feature size, namely transpose convolution and nearest neighbor upsampling, are fused and used in the upsampling unit.
In an embodiment of the present invention, in order to avoid that the sampling multiple is too large, which results in the loss of details in the original feature map, in step 703, the down-sampling unit performs down-sampling by 1/2 times; accordingly, in step 704, the upsampling unit upsamples by a factor of 2.
In the embodiment of the present invention, the outputs of all convolution layers used in step 701, step 702, and step 703 except the last convolution layer in step 703 are processed by batch normalization and activation functions. The purpose of the normalization is to facilitate model convergence.
In an embodiment of the present invention, in an application scenario, in order to obtain a bounding box of a text line in a two-dimensional gaussian prediction result graph, a result graph output by a model is post-processed, and detailed processing steps are as follows:
and 7041, according to a preset threshold value, setting pixels which are larger than or equal to the threshold value in the two-dimensional Gaussian prediction result image to be 1, setting pixels which are smaller than the threshold value to be 0, and binarizing the two-dimensional Gaussian prediction result image to obtain a binarization result image.
And 7042, analyzing the connected domain of the binarization result image to obtain the connected domain of the text pixel points.
And 7043, determining a minimum circumscribed rectangle for each connected region to obtain four vertexes of the rectangle, and determining the text line region according to the four vertexes.
Based on the post-processing steps, NMS processing is not needed, the size of a candidate box is not needed to be defined in advance, the accurate positioning of the text line at any angle can be realized by predicting the Gaussian map of the text line area, the threshold value can be flexibly set, and the speed is higher.
The training process of the text positioning model belongs to the supervised learning category in deep learning, and is an End-to-End (End-to-End) model. And continuously updating trainable parameters in the model according to the difference between the output of the model and the expected output in the training process, so that the output of the model approaches to the expected output as much as possible, and finishing the training of the model.
The following describes, by way of example, a training process of a text localization model provided by the present invention, including:
s1, initializing parameters of a text positioning model, and setting a loss function adopted in a training stage and an optimization algorithm used in an optimization stage;
in one embodiment of the invention, before the model training is started, when the parameters in the model are initialized, a distribution is determined according to the number of input and output neurons in each layer, and then the random values are taken from the distribution for initialization. The weights of the convolutional layers in the model are randomly distributed from a uniform distribution [ -a, a [ ]]Is selected from a group consisting of a number d of input channels in the layerinAnd the number of channels d of the outputoutDetermining, namely:
Figure BDA0002396037930000131
the training stage adopts Mean Square Error (MSE) as a loss function, namely the sum of squares of differences of a two-dimensional Gaussian distribution graph output by the model and a label graph, and is used for quantifying the difference between the output of the model and an expected output, and the expression is as follows:
Figure BDA0002396037930000132
wherein, P represents a prediction sample graph, T represents a label graph, and (i, j) represents the position coordinates of the pixel points in the characteristic graph.
S2, inputting the prediction sample picture into a text positioning model, and extracting low-level features of the prediction sample picture by a main unit; downsampling the feature map output by the main unit step by a plurality of downsampling units to reduce the size of the feature map proportionally;
s3, a plurality of up-sampling units up-sample the feature maps output by the main unit and the down-sampling units to proportionally increase the size of the feature maps, and feature maps with the same dimensionality output in the down-sampling paths are fused layer by layer in the up-sampling paths;
s4, reducing the number of channels of the characteristic graph output by the up-sampling unit layer by layer through the plurality of convolutional layers, mapping an output value to a (0, 1) interval by using an activation function in the last convolutional layer, and outputting a two-dimensional Gaussian distribution map;
and S5, updating parameters of the text positioning model by using a model optimization algorithm according to the error between the two-dimensional Gaussian distribution graph and the label graph output by the text positioning model, then inputting a new training sample graph again for iteration, and storing the model file after the iteration is terminated.
In an embodiment of the present invention, a gradient descent optimization algorithm or an Adam optimization algorithm is adopted in a back propagation process in a training process of the text positioning model. The Adam optimization algorithm is an extension of a random gradient descent optimization algorithm, a single learning rate is different from that of a traditional random gradient descent optimization algorithm, Adam designs independent self-adaptive learning rates for different parameters by calculating first moment estimation and second moment estimation of a gradient, and the convergence rate is faster and more stable. The learning rate in the training phase was set to 0.0001.
Models based on deep learning mostly need a large amount of data to train the models, but large-scale data labeling is time-consuming and labor-consuming. In order to maximize the utilization of labeled data and reduce overfitting of a model, an embodiment of the present invention performs data amplification processing on input sample data in a training phase, that is, performs multiple transformation on an existing prediction sample map to obtain more prediction sample maps, where the transformation methods may include: random rotation, random brightness and contrast conversion, random scaling, random cutting, random noise addition and the like are carried out, so that more prediction sample images are obtained, and after a large number of training samples are used for training the model, the prediction effect of the model is better.
In one embodiment of the invention, a mode of combining manual marking and automatic generation is adopted to generate a prediction sample picture and a corresponding label picture, manual marking is carried out on the collected prediction sample picture containing text line data, meanwhile, a large number of pictures containing complex background and without text lines are collected, and the text lines are randomly generated on the background picture through codes, so that more standard training samples are obtained.
Fig. 8 is a schematic diagram of a hardware structure of a text positioning apparatus according to an embodiment of the present invention, where the text positioning apparatus 800 includes: a processor 801 such as a Central Processing Unit (CPU), an internal bus 802, and a non-transitory storage medium 804. The processor 801 and the non-transitory storage medium 804 can communicate with each other via the internal bus 802. If the functions of the steps of the text positioning method provided by the present invention form corresponding software modules, and the software formed by these software modules is installed in the non-transitory storage medium 804 of the text positioning device 800, a text positioning device capable of implementing the text positioning method and model functions of the present invention can be manufactured, and the processor 801 reads and executes the machine executable instructions corresponding to the text positioning device or text positioning model 805 stored in the non-transitory storage medium 804, so as to implement the functions and effects of the text positioning method provided by the present invention.
The text line positioning method and the text positioning model provided by the invention can detect the text line area in the RGB three-channel color image on the basis of the full convolution neural network. A large number of samples of the current line under a complex background are randomly generated in the data set making stage, and meanwhile, data amplification methods such as random rotation are combined to increase the number of samples. The structure of the model integrates methods such as residual error network, transposition convolution, feature fusion, batch normalization and the like to enhance the characterization capability of the model. Compared with the existing text line positioning model, the text line detection and positioning method can realize text line detection and positioning in a complex background, can realize efficient detection of the inclined text line, and has higher accuracy and better robustness. The text line detection method can be used as a text detector in Optical Character Recognition (OCR), and the detected text area is sent into a character recognition model to realize character recognition.
The above description is only an example of the present invention, and is not intended to limit the present invention. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims (13)

1. The text positioning device is characterized by comprising a down-sampling module, an up-sampling module and an output layer module;
the down-sampling module consists of a backbone unit and N down-sampling units:
a trunk unit composed of a plurality of convolution layers for extracting low-level features of an input picture; the feature graph output by the main unit is used as the input of the first down-sampling unit, and is fused with the output feature graph of the (M-1) th up-sampling unit to be used as the input of the (M) th up-sampling unit;
the down-sampling unit is used for down-sampling the input feature map; the feature diagram size output by each down-sampling unit is reduced proportionally to the width and height of the input feature diagram, except for the Nth down-sampling unit, the output of each down-sampling unit is used as the input of the next down-sampling unit, and the output of the last down-sampling unit is used as the input of the first up-sampling unit;
the up-sampling module consists of M up-sampling units:
the up-sampling unit is used for up-sampling the input characteristic diagram; the size of the characteristic diagram output by each up-sampling unit is proportionally enlarged relative to the width and height of the input characteristic diagram; the amplification ratio of the up-sampling unit is the same as that of the down-sampling unit; except the up-sampling unit connected with the output layer module in the up-sampling path, the feature graph output by each up-sampling unit is fused with the feature graph with the same dimensionality output by the main unit and each down-sampling unit except the Nth down-sampling unit in the down-sampling path and then used as the input of the next up-sampling unit, and the output of the Mth up-sampling unit is used as the input of the output layer module;
the output layer module is composed of a plurality of convolution layers and is used for gradually reducing the number of channels of the characteristic diagram, and the output of the output layer module is a two-dimensional Gaussian distribution diagram.
2. The apparatus of claim 1, wherein the downsampling unit comprises a first convolution sub-unit near the output side, a third convolution sub-unit near the input side, and a second convolution sub-unit in between:
the first convolution subunit and the second convolution subunit do not change the width and the height of the feature map, and the third convolution subunit reduces the width and the height of the feature map to 1/2;
the first convolution subunit comprises two paths, one path comprises three convolution layers, the other path comprises one convolution layer and an average pooling layer, and the outputs of the two paths are subjected to bitwise addition and fusion to obtain the output of the first convolution subunit;
the second convolution subunit comprises two paths, one path comprises three convolution layers, the other path does not comprise any operator, and the two paths are subjected to bitwise addition and fusion to obtain the output of the second convolution subunit;
the third convolution subunit comprises two paths, one path consists of a convolution layer and a transposed convolution layer, the other path consists of a convolution layer and a nearest neighbor upsampling layer, and the two paths are subjected to bitwise addition and fusion to obtain the output of the third convolution subunit;
the output of the third convolution subunit is used as the input of the second convolution subunit, the output of the second convolution subunit is used as the input of the first convolution subunit, and the output of the first convolution subunit is the output of the downsampling unit.
3. The apparatus of claim 2,
the up-sampling unit comprises two paths, one path comprises an upper convolutional layer and a lower convolutional layer, and the middle part of the upper convolutional layer is provided with a transposition convolutional layer; the other path comprises a nearest neighbor upsampling layer and a convolutional layer; the convolution layer in the up-sampling unit is used for scaling the channel number of the feature map, and the transposed convolution layer and the nearest neighbor up-sampling layer are used for expanding the width and the height of the feature map;
the height and width of the characteristic diagram output by the up-sampling unit are increased by 2 times compared with the input.
4. The apparatus of claim 3,
the output layer module consists of a plurality of convolution layers, introduces nonlinearity while gradually reducing the number of channels of the characteristic diagram, and increases the characteristic combination characterization capability; and the output of the last convolution layer in the output layer module is subjected to activation function processing, and then the output value is mapped to the interval of (0, 1) and then a two-dimensional Gaussian distribution map is output.
5. The apparatus of claim 4,
except the last convolution layer in the output layer module, the outputs of other convolution layers of the output layer module and all convolution layers contained in the trunk unit, the down-sampling unit and the up-sampling unit are subjected to batch normalization and activation function processing.
6. A method for text localization, the method comprising:
extracting low-level features of the prediction map using a skeleton unit, and down-sampling the feature map output by the skeleton unit using a plurality of down-sampling units step by step to reduce the size of the feature map proportionally;
the feature maps output by the main unit and the down-sampling units are up-sampled by using a plurality of up-sampling units so as to proportionally increase the size of the feature maps, and the feature maps with the same dimensionality output in the down-sampling paths are fused layer by layer in the up-sampling paths;
reducing the number of channels of the feature map output by the last up-sampling unit layer by using a plurality of convolutional layers, mapping an output value to a (0, 1) interval by using an activation function in the last convolutional layer, and outputting a two-dimensional Gaussian prediction result map;
and determining the area of the text line according to the brightness value of the pixel points in the two-dimensional Gaussian prediction result image.
7. The method of claim 6,
a jump connection structure in a residual error network is introduced into a down-sampling unit, and an additional average pooling path is introduced into the down-sampling unit;
two ways of increasing the size of the feature map are used for fusion in the upsampling unit, namely transposition convolution and nearest neighbor upsampling.
8. The method of claim 6, wherein the step of determining the region of the text line according to the brightness values of the pixels in the two-dimensional Gaussian prediction result map comprises:
according to a preset threshold value, setting pixels which are larger than or equal to the threshold value in the two-dimensional Gaussian prediction result image to be 1, setting pixels which are smaller than the threshold value to be 0, and binarizing the two-dimensional Gaussian prediction result image to obtain a binarization result image;
performing connected domain analysis on the binarization result graph to obtain a connected region of the text pixel points;
and determining a minimum external rectangle for each connected region to obtain four vertexes of the rectangle, and determining the text line region according to the four vertexes.
9. The method according to claim 6, wherein the downsampling unit downsamples by a factor of 1/2, and the corresponding upsampling unit upsamples by a factor of 2.
10. A training method of a text positioning model is characterized in that the method is applied to the text positioning model and comprises the following steps:
initializing parameters of the text positioning model, and setting a loss function adopted in a training stage and an optimization algorithm used in an optimization stage;
inputting the prediction sample graph into a text positioning model, and extracting low-level features of the prediction sample graph by a main unit in the text positioning model; downsampling the feature map output by the main unit step by a plurality of downsampling units in the text positioning model to reduce the size of the feature map in proportion;
a plurality of up-sampling units in the text positioning model up-sample the feature maps output by the main unit and the down-sampling units to proportionally increase the size of the feature maps, and feature maps with the same dimensionality output in the down-sampling paths are fused layer by layer in the up-sampling paths;
reducing the number of channels of the characteristic diagram output by the up-sampling unit layer by a plurality of convolutional layers in the text positioning model, and mapping an output value to a (0, 1) interval by using an activation function in the last convolutional layer to output a two-dimensional Gaussian distribution diagram;
and updating parameters of the text positioning model by using a model optimization algorithm according to the error between the two-dimensional Gaussian distribution graph and the label graph output by the text positioning model, then inputting a new training sample graph again for iteration, and storing the model file after the iteration is terminated.
11. The method of claim 10, wherein the text positioning model is a text positioning model,
in the training process, the mean square error is used as a loss function of the text positioning model, namely the sum of squares of differences between a two-dimensional Gaussian distribution diagram output by the model and a label diagram is used as the loss function of the model;
in the back propagation process, a gradient descent optimization algorithm or an Adam optimization algorithm is adopted.
12. The method of claim 10, further comprising:
the number of the prediction sample images is increased in a mode that codes randomly generate text lines on the background image, and the text positioning model is trained.
13. A text pointing device, comprising: a processor, a non-transitory storage medium, from which the processor reads and executes instruction code to implement the method of any one of claims 6 to 10.
CN202010132023.3A 2020-02-29 2020-02-29 Text positioning method and device and model training method Active CN111476226B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010132023.3A CN111476226B (en) 2020-02-29 2020-02-29 Text positioning method and device and model training method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010132023.3A CN111476226B (en) 2020-02-29 2020-02-29 Text positioning method and device and model training method

Publications (2)

Publication Number Publication Date
CN111476226A true CN111476226A (en) 2020-07-31
CN111476226B CN111476226B (en) 2022-08-30

Family

ID=71748053

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010132023.3A Active CN111476226B (en) 2020-02-29 2020-02-29 Text positioning method and device and model training method

Country Status (1)

Country Link
CN (1) CN111476226B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112132142A (en) * 2020-09-27 2020-12-25 平安医疗健康管理股份有限公司 Text region determination method, text region determination device, computer equipment and storage medium
CN112967296A (en) * 2021-03-10 2021-06-15 重庆理工大学 Point cloud dynamic region graph convolution method, classification method and segmentation method

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108154145A (en) * 2018-01-24 2018-06-12 北京地平线机器人技术研发有限公司 The method and apparatus for detecting the position of the text in natural scene image
CN108805131A (en) * 2018-05-22 2018-11-13 北京旷视科技有限公司 Text line detection method, apparatus and system
US10262235B1 (en) * 2018-02-26 2019-04-16 Capital One Services, Llc Dual stage neural network pipeline systems and methods
US20190130204A1 (en) * 2017-10-31 2019-05-02 The University Of Florida Research Foundation, Incorporated Apparatus and method for detecting scene text in an image
CN109948607A (en) * 2019-02-21 2019-06-28 电子科技大学 Candidate frame based on deep learning deconvolution network generates and object detection method
CN110222680A (en) * 2019-05-19 2019-09-10 天津大学 A kind of domestic waste article outer packing Method for text detection
CN110322495A (en) * 2019-06-27 2019-10-11 电子科技大学 A kind of scene text dividing method based on Weakly supervised deep learning

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190130204A1 (en) * 2017-10-31 2019-05-02 The University Of Florida Research Foundation, Incorporated Apparatus and method for detecting scene text in an image
CN108154145A (en) * 2018-01-24 2018-06-12 北京地平线机器人技术研发有限公司 The method and apparatus for detecting the position of the text in natural scene image
US10262235B1 (en) * 2018-02-26 2019-04-16 Capital One Services, Llc Dual stage neural network pipeline systems and methods
CN108805131A (en) * 2018-05-22 2018-11-13 北京旷视科技有限公司 Text line detection method, apparatus and system
CN109948607A (en) * 2019-02-21 2019-06-28 电子科技大学 Candidate frame based on deep learning deconvolution network generates and object detection method
CN110222680A (en) * 2019-05-19 2019-09-10 天津大学 A kind of domestic waste article outer packing Method for text detection
CN110322495A (en) * 2019-06-27 2019-10-11 电子科技大学 A kind of scene text dividing method based on Weakly supervised deep learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
YIN SHI-HAO ET AL.: "Traffic sign recognition based on deep convolutional neural network", 《OPTOELECTRONICS LETTERS》 *
贺通姚剑: "基于全卷积网络的场景文本检测", 《黑龙江科技信息》 *
骆遥 等: "基于深度全卷积神经网络的文字区域定位方法", 《无线互联科技》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112132142A (en) * 2020-09-27 2020-12-25 平安医疗健康管理股份有限公司 Text region determination method, text region determination device, computer equipment and storage medium
CN112967296A (en) * 2021-03-10 2021-06-15 重庆理工大学 Point cloud dynamic region graph convolution method, classification method and segmentation method
CN112967296B (en) * 2021-03-10 2022-11-15 重庆理工大学 Point cloud dynamic region graph convolution method, classification method and segmentation method

Also Published As

Publication number Publication date
CN111476226B (en) 2022-08-30

Similar Documents

Publication Publication Date Title
CN109584248B (en) Infrared target instance segmentation method based on feature fusion and dense connection network
CN112966684B (en) Cooperative learning character recognition method under attention mechanism
US11657602B2 (en) Font identification from imagery
CN109299274B (en) Natural scene text detection method based on full convolution neural network
CN110428428B (en) Image semantic segmentation method, electronic equipment and readable storage medium
WO2023015743A1 (en) Lesion detection model training method, and method for recognizing lesion in image
CN111553397B (en) Cross-domain target detection method based on regional full convolution network and self-adaption
CN111027493B (en) Pedestrian detection method based on deep learning multi-network soft fusion
US20170124415A1 (en) Subcategory-aware convolutional neural networks for object detection
CN103049763B (en) Context-constraint-based target identification method
CN112116599B (en) Sputum smear tubercle bacillus semantic segmentation method and system based on weak supervised learning
CN106326858A (en) Road traffic sign automatic identification and management system based on deep learning
CN111950453A (en) Optional-shape text recognition method based on selective attention mechanism
CN113159120A (en) Contraband detection method based on multi-scale cross-image weak supervision learning
CN112287941B (en) License plate recognition method based on automatic character region perception
CN114155527A (en) Scene text recognition method and device
CN112733614B (en) Pest image detection method with similar size enhanced identification
CN113609896A (en) Object-level remote sensing change detection method and system based on dual-correlation attention
CN115131797B (en) Scene text detection method based on feature enhancement pyramid network
CN107506792B (en) Semi-supervised salient object detection method
CN116645592B (en) Crack detection method based on image processing and storage medium
CN111274964B (en) Detection method for analyzing water surface pollutants based on visual saliency of unmanned aerial vehicle
CN112488229A (en) Domain self-adaptive unsupervised target detection method based on feature separation and alignment
CN112541491A (en) End-to-end text detection and identification method based on image character region perception
CN112016512A (en) Remote sensing image small target detection method based on feedback type multi-scale training

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant