CN111476226B

CN111476226B - Text positioning method and device and model training method

Info

Publication number: CN111476226B
Application number: CN202010132023.3A
Authority: CN
Inventors: 尹世豪
Original assignee: New H3C Big Data Technologies Co Ltd
Current assignee: New H3C Big Data Technologies Co Ltd
Priority date: 2020-02-29
Filing date: 2020-02-29
Publication date: 2022-08-30
Anticipated expiration: 2040-02-29
Also published as: CN111476226A

Abstract

The invention provides a text positioning method, a text positioning device and a model training method, which are used for improving the positioning effect of text lines in pictures. The invention provides a text positioning model based on a full convolution neural network, which integrates the technologies of a residual error network, a transposition convolution, feature fusion, batch normalization and the like, and enhances the representation capability of the model. The text positioning method, the model and the model training method can position the text line region in the image by predicting the two-dimensional Gaussian distribution map of the text line region, and can realize accurate positioning of the text line with horizontal and arbitrary inclination angles in a complex background.

Description

Text positioning method and device and model training method

Technical Field

The invention relates to the technical field of artificial intelligence image-text processing, in particular to a text positioning method, a text positioning device and a model training method.

Background

Optical Character Recognition (OCR) is a technology for recognizing Character information in an image by an image processing technology, and is widely applied to the fields of certificate Recognition, license plate Recognition, paper information electronization, and the like. The complete OCR process is generally divided into two steps: character positioning and character recognition. Character positioning refers to accurately positioning coordinate information of characters in an image, and character recognition refers to a process of identifying what characters are in a positioned image area. The accuracy of character positioning is directly related to the precision of subsequent character recognition, and character positioning is generally divided into single character positioning and text line positioning according to the difference of subsequent character recognition models.

A text line detection algorithm based on a general target detection algorithm such as a Faster Region-based Convolutional Neural network (fast RCNN). The basic steps Of detecting text in an image as a target detection algorithm based on a convolutional neural Network are that the image is input into the convolutional neural Network to extract a feature map, then the feature map is input into a Region candidate Network (RPN) to generate a plurality Of candidate frames, the candidate frames are converted into a feature map with a fixed size through Region Of Interest (ROI) ROI Pooling (Pooling), and finally whether the candidate frames are text Regions or not is sequentially judged. The text detection model needs to define candidate boxes with different sizes in advance to adapt to target areas with different sizes. The sizes and lengths of text lines in actual picture samples are different, the size of a defined candidate box is difficult to cover all situations, and the positioning accuracy of the text lines is poor. For skewed lines of text, too many extraneous regions may be included in the detection result.

A Text line positioning algorithm based on a connection Text candidate Network (CTPN) predicts the vertical position of a Text based on a differential thought in mathematics, and the algorithm flow is as follows: firstly, extracting the output of the 5 th convolutional layer of the VGG16 network as a feature map, then extracting features by adopting a 3x3 sliding window, inputting the extracted features into a bidirectional Long-Short Term Memory network (LSTM), outputting a feature vector with 512 dimensions, then obtaining the position of a text box through classification and regression, and finally obtaining a target box containing a text line through a text line construction algorithm. The complexity of the text detection model is high, and the detection effect on the inclined text lines is poor.

Disclosure of Invention

The invention provides a text positioning method, a text positioning device and a model training method, which are used for improving the positioning effect of text lines in pictures.

Based on the embodiment of the invention, the invention provides a text positioning device, which comprises a down-sampling module, an up-sampling module and an output layer module;

the down-sampling module consists of a trunk unit and N down-sampling units:

a trunk unit composed of a plurality of convolution layers for extracting low-level features of an input picture; the feature graph output by the main unit is used as the input of the first down-sampling unit, and is fused with the output feature graph of the (M-1) th up-sampling unit to be used as the input of the (M) th up-sampling unit;

the down-sampling unit is used for down-sampling the input feature map; the feature diagram size output by each down-sampling unit is reduced proportionally to the width and height of the input feature diagram, except for the Nth down-sampling unit, the output of each down-sampling unit is used as the input of the next down-sampling unit, and the output of the last down-sampling unit is used as the input of the first up-sampling unit;

the up-sampling module consists of M up-sampling units:

the up-sampling unit is used for up-sampling the input characteristic diagram; the size of the characteristic diagram output by each up-sampling unit is proportionally enlarged relative to the width and height of the input characteristic diagram; the amplification ratio of the up-sampling unit is the same as that of the down-sampling unit; except the up-sampling unit connected with the output layer module in the up-sampling path, the feature graph output by each up-sampling unit is fused with the feature graph with the same dimensionality output by the main unit and each down-sampling unit except the Nth down-sampling unit in the down-sampling path and then used as the input of the next up-sampling unit, and the output of the Mth up-sampling unit is used as the input of the output layer module;

the output layer module is composed of a plurality of convolution layers and is used for gradually reducing the number of channels of the characteristic diagram, and the output of the output layer module is a two-dimensional Gaussian distribution diagram.

Based on the embodiment of the present invention, further, the downsampling unit includes a first convolution subunit near the output side, a third convolution subunit near the input side, and a second convolution subunit located in the middle:

the first convolution subunit and the second convolution subunit do not change the width and the height of the feature map, and the third convolution subunit reduces the width and the height of the feature map to 1/2;

the first convolution subunit comprises two paths, one path comprises three convolution layers, the other path comprises one convolution layer and an average pooling layer, and the outputs of the two paths are subjected to bitwise addition and fusion to obtain the output of the first convolution subunit;

the second convolution subunit comprises two paths, one path comprises three convolution layers, the other path does not comprise any operator, and the two paths are subjected to bitwise addition and fusion to obtain the output of the second convolution subunit;

the third convolution subunit comprises two paths, one path consists of a convolution layer and a transposed convolution layer, the other path consists of a convolution layer and a nearest neighbor upsampling layer, and the two paths are subjected to bitwise addition and fusion to obtain the output of the third convolution subunit;

the output of the third convolution subunit is used as the input of the second convolution subunit, the output of the second convolution subunit is used as the input of the first convolution subunit, and the output of the first convolution subunit is the output of the downsampling unit.

Based on the embodiment of the present invention, further, the up-sampling unit includes two paths, one path includes an upper convolution layer and a lower convolution layer, and a transpose convolution layer is disposed in the middle; the other path comprises a nearest neighbor upsampling layer and a convolutional layer; the convolution layer in the up-sampling unit is used for scaling the channel number of the feature map, and the transposition convolution layer and the nearest neighbor up-sampling layer are used for expanding the width and the height of the feature map; the height and width of the characteristic diagram output by the up-sampling unit are increased by 2 times compared with the input.

Based on the embodiment of the invention, further, the output layer module is composed of a plurality of convolution layers, and nonlinearity is introduced while the number of channels of the characteristic diagram is gradually reduced, so that the characteristic combination characterization capability is increased; and the output of the last convolution layer in the output layer module is subjected to activation function processing, and then the output value is mapped to the interval of (0, 1) and then a two-dimensional Gaussian distribution map is output.

Based on the embodiment of the present invention, further, except for the last convolution layer in the output layer module, the outputs of the other convolution layers of the output layer module and all convolution layers included in the trunk unit, the down-sampling unit and the up-sampling unit are processed by batch normalization and activation functions.

Based on the embodiment of the invention, the invention also provides a text positioning method, which comprises the following steps:

extracting low-level features of the prediction map using a skeleton unit, and down-sampling the feature map output by the skeleton unit using a plurality of down-sampling units step by step to reduce the size of the feature map proportionally;

the characteristic diagrams output by the main unit and the downsampling unit are upsampled by using a plurality of upsampling units to proportionally increase the size of the characteristic diagram, and the characteristic diagrams with the same dimensionality output in the downsampling path are fused layer by layer in the upsampling path;

reducing the number of channels of the feature map output by the last up-sampling unit layer by using a plurality of convolutional layers, mapping an output value to a (0, 1) interval by using an activation function in the last convolutional layer, and outputting a two-dimensional Gaussian prediction result map;

and determining the area of the text line according to the brightness value of the pixel points in the two-dimensional Gaussian prediction result image.

Based on the embodiment of the invention, a jump connection structure in a residual error network is further introduced into a down-sampling unit, and an additional average pooling path is introduced into the down-sampling unit; two ways of increasing the size of the feature map are used for fusion in the upsampling unit, namely transposition convolution and nearest neighbor upsampling.

Based on the embodiment of the present invention, further, the step of determining the area where the text line is located by using the brightness value of the pixel point in the two-dimensional gaussian prediction result graph includes:

according to a preset threshold value, setting pixels which are larger than or equal to the threshold value in the two-dimensional Gaussian prediction result image to be 1, setting pixels which are smaller than the threshold value to be 0, and binarizing the two-dimensional Gaussian prediction result image to obtain a binarization result image;

performing connected domain analysis on the binarization result graph to obtain a connected region of the text pixel points;

and determining a minimum external rectangle for each connected region to obtain four vertexes of the rectangle, and determining the text line region according to the four vertexes.

Based on the embodiment of the present invention, further, the down-sampling unit performs down-sampling by a multiple of 1/2, and the corresponding up-sampling unit performs up-sampling by a multiple of 2.

Based on the embodiment of the invention, the invention also provides a training method of the text positioning model, which is applied to the text positioning model and comprises the following steps:

initializing parameters of the text positioning model, and setting a loss function adopted in a training stage and an optimization algorithm used in an optimization stage;

inputting the prediction sample graph into a text positioning model, and extracting low-level features of the prediction sample graph by a main unit in the text positioning model; downsampling the feature map output by the main unit step by a plurality of downsampling units in the text positioning model to reduce the size of the feature map in proportion;

a plurality of up-sampling units in the text positioning model up-sample the feature maps output by the main unit and the down-sampling units to proportionally increase the size of the feature maps, and feature maps with the same dimensionality output in the down-sampling paths are fused layer by layer in the up-sampling paths;

reducing the number of channels of the characteristic diagram output by the up-sampling unit layer by a plurality of convolutional layers in the text positioning model, and mapping an output value to a (0, 1) interval by using an activation function in the last convolutional layer to output a two-dimensional Gaussian distribution diagram;

and updating parameters of the text positioning model by using a model optimization algorithm according to the error between the two-dimensional Gaussian distribution graph and the label graph output by the text positioning model, then inputting a new training sample graph again for iteration, and storing the model file after the iteration is terminated.

Based on the embodiment of the invention, further, in the training process, the mean square error is used as the loss function of the text positioning model, namely, the sum of squares of the difference values of the two-dimensional Gaussian distribution diagram output by the model and the label diagram is used as the loss function of the model; in the back propagation process, a gradient descent optimization algorithm or an Adam optimization algorithm is adopted.

Based on the embodiment of the invention, further, the training method further comprises:

the number of the prediction sample images is increased in a mode that codes randomly generate text lines on the background image, and the text positioning model is trained.

Based on the embodiment of the present invention, an embodiment of the present invention further provides a text positioning apparatus, where the text positioning apparatus includes: a processor, a non-transitory storage medium, the processor reading and executing instruction codes from the non-transitory storage medium to implement the aforementioned text location method.

According to the technical scheme, the text positioning model is provided on the basis of the full convolution neural network, the technologies of residual error network, transposition convolution, feature fusion, batch normalization and the like are fused, and the representation capability of the model is enhanced. The text positioning method, the model and the model training method provided by the invention can position the text line region in the image by predicting the two-dimensional Gaussian distribution map of the text line region, and can realize accurate positioning of the text line at a horizontal and any inclination angle in a complex background.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments of the present invention or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present invention, and for those skilled in the art, other drawings may be obtained according to the drawings of the embodiments of the present invention.

FIG. 1 is a schematic diagram of a training sample and a label graph provided in accordance with an embodiment of the present invention;

FIG. 2 is a diagram illustrating a structure of a text positioning model according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a trunk unit in a downsampling module according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a down-sampling unit in a down-sampling module according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an upsampling unit in an upsampling module according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an output layer module according to an embodiment of the present invention;

FIG. 7 is a flowchart illustrating steps of a text positioning method according to an embodiment of the present invention;

fig. 8 is a schematic diagram of a hardware structure of a text positioning device according to an embodiment of the present invention.

Detailed Description

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the embodiments of the invention. As used in the examples and claims of the present invention, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. As used herein, the term "and/or" is intended to encompass any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used to describe various information in embodiments of the present invention, the information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of embodiments of the present invention. Depending on the context, moreover, the word "if" as used may be interpreted as "at … …" or "when … …" or "in response to a determination".

With the rapid development of deep learning in recent years, a great deal of major breakthroughs appear in the computer vision technology, and particularly in the OCR field, the accuracy of character detection and recognition is greatly improved. The general target detection algorithms such as fast RCNN, YOLO, and SSD are often not effective when used for text detection, because text lines are generally long and thin regions with large aspect ratios in pictures, and the predefined candidate box size is difficult to cope with real samples in real life. The advent of the CTPN algorithm improved the performance of the general target detection algorithm in the text line locating task, which locates text regions by predicting the vertical position of the target box, but for tilted text lines, it constructed text line regions with a low accuracy.

The invention provides a new text line positioning model based on a full convolution neural network, and the model structure integrates technical characteristics of a residual error network, transposition convolution, characteristic fusion, batch normalization and the like, and enhances the representation capability of the model. The model can position the text line region in the image by predicting the Gaussian map of the text line region, and can realize accurate positioning of the text line at a horizontal angle and an arbitrary inclination angle in a complex background.

Before the supervised deep learning model is formally applied to an application scene, a large number of training samples are required to train the model so as to fix parameters of the model, and after a model file is fixed, the model file can be deployed to equipment for text detection and positioning of the application scene.

The text line positioning model provided by the invention also needs to provide a certain number of training samples with labels, and because the text line positioning model is applied to positioning of text lines in pictures, and a two-dimensional Gaussian map is output by the model, a label map generated by labeling the training samples is also the two-dimensional Gaussian map, and the pixel value range in the label map is 0-1.

Fig. 1 is a schematic diagram of a training sample and a label graph provided in an embodiment of the present invention, where the left graph is a training sample of an input model, the right graph is a label graph generated after labeling text lines in the training sample, each text line area in the label graph is labeled as a two-dimensional gaussian distribution, and the brighter area in the label graph has a pixel value closer to 1, i.e., the brighter area in the label graph corresponds to an area where the text line in an original image is located.

Fig. 2 is a schematic diagram of a text positioning model structure according to an embodiment of the present invention. The text positioning model provided by the invention can be realized by software or hardware, and each functional module or unit in the model can correspond to a functional chip or a programmable logic unit in a hardware device, so that the text positioning model provided by the invention can also be called a text positioning device, and when the text positioning model is solidified or installed in a hardware device, the device becomes a text positioning device capable of achieving the purpose of the invention.

Taking the embodiment of fig. 2 as an example, the text positioning model 200 provided by the present invention includes 3 modules: a downsampling module 210, an upsampling module 220, and an output layer module 230.

The down-sampling module consists of a backbone unit Stem and N down-sampling units DownBlock:

the up-sampling module consists of M up-sampling units UpBlock:

the up-sampling unit is used for up-sampling the input characteristic diagram; the size of the characteristic diagram output by each up-sampling unit is proportionally enlarged relative to the width and height of the input characteristic diagram; the amplification ratio of the up-sampling unit is the same as that of the down-sampling unit; in the invention, a processing path of a down-sampling unit is called as a down-sampling path, a processing path of an up-sampling unit is called as an up-sampling path, except for an up-sampling unit connected with an output layer module, a feature map output by each up-sampling unit is fused with a main unit in the down-sampling path and feature maps with the same dimensionality output by each down-sampling unit except an Nth down-sampling unit and then is used as the input of a next up-sampling unit, and the output of an Mth up-sampling unit is used as the input of the output layer module;

The fusion mode of the two characteristic graphs/characteristic vectors in the invention can be as follows: the two feature maps with the same dimension add elements at corresponding positions (namely, add the elements according to bit) to obtain a new feature map, and the dimension of the fused feature map is unchanged. Taking the feature maps output by the trunk unit 215 and the fourth upsampling unit 224 in fig. 2 as an example, a "+" in the figure represents that the feature map output by the trunk unit and the feature map output by the fourth upsampling unit are fused, and the fused feature map is used as an input of the fifth upsampling unit 225.

The invention does not limit the number of the down sampling units in the down sampling path and the number of the up sampling units in the up sampling path, and can be determined according to the practical application scene and the required output effect requirement, as long as the invention meets the matching logic and the output requirement between the down sampling unit and the up sampling unit.

Assuming that b represents the number of samples of the input model, h and w represent the height and width of the image, and the number of channels of the training sample picture of the input model is 3, the dimension of the model input data can be represented as (b, h, w,3), and the model output is a two-dimensional gaussian distribution diagram with the dimension of (b, h, w, 1).

In the embodiment shown in fig. 2, the down-sampling module includes 4 down blocks, the up-sampling module includes 5 upblocks, the dimension of the feature Map output after the input data with dimension (b, h, w,3) is processed by Stem and 4 down blocks is changed to (b, h/32, w/32,1024), then the feature Map is up-sampled to (b, h, w,32) in size by 5 upblocks, and finally the number of channels is reduced to 1 by the output layer module ConvBlock, so as to obtain the two-dimensional gaussian distribution Map with dimension (b, h, w, 1).

Fig. 3 is a schematic structural diagram of the trunk unit 215 in the downsampling module 210 in the embodiment of fig. 2, and the trunk unit 215 is composed of 3 convolutional layers for extracting low-level features in an image. Taking the first convolution layer Conv 32(3 x3, s ═ 2) in the figure as an example, where the number 32 after Conv represents the number of convolution kernels when performing convolution operation, 3 × 3 represents the size of the convolution kernels, and s ═ 2 represents that the step length when the convolution kernels slide on the feature map is 2, and the labeling meanings of other figures are the same and are not repeated.

Fig. 4 is a schematic structural diagram of a down-sampling unit 211-214 in the down-sampling module 210 in the embodiment of fig. 2, the down-sampling unit is used for down-sampling the feature map, the height and width of the feature map after each down-sampling is reduced to 1/2 times, and n in the down blocks 1-4 takes values of 64, 128, 256 and 512, respectively.

In the embodiment of the invention, when a convolutional neural network with a large number of training layers is used, the training effect is often poor due to attenuation during gradient back transmission, and in order to improve the efficiency of training a deep convolutional neural network, a jump connection structure in a residual error network (i.e. no convolutional layer or only one path with a size of 1 × 1 convolution for ascending and descending dimensions in a graph) is introduced into a down-sampling unit. Meanwhile, in order to reduce the potential feature loss in the down-sampling process, an additional average pooling path is added when the feature map size is reduced.

The down-sampling unit comprises three convolution subunits, namely a first convolution subunit, a second convolution subunit and a third convolution subunit from the output side. The first convolution subunit and the second convolution subunit do not change the width and height of the feature map, and the third convolution subunit reduces the width and height of the feature map to 1/2;

the third convolution subunit comprises two paths, one path is composed of a convolution layer and a transposed convolution layer, the other path is composed of a convolution layer and a nearest neighbor upsampling layer, and the two paths are subjected to bitwise addition and fusion to obtain the output of the third convolution subunit.

FIG. 5 is a schematic structural diagram of an upsampling unit 221-225 in the upsampling module 220 in the embodiment of FIG. 2. The up-sampling unit is used for increasing the size of the feature map, and the width and the height of the feature map after each up-sampling are increased by 2 times. Since the detail in the final result is coarse due to the excessive up-sampling multiple, the embodiment adopts 2-fold amplification, and the detail information after up-sampling can be guaranteed to the maximum extent.

There are various upsampling ways, and the nearest neighbor upsampling speed is fast, but it is difficult to restore the detailed features of the picture. Transposed convolution upsampling contains learnable parameters that can be learned to detail features during the training process, but possibly introduces noise information. In order to obtain a better up-sampling result, the method integrates two modes of increasing the size of the characteristic graph, namely transposition convolution and nearest neighbor up-sampling in the up-sampling unit, and the final effect is better than that of a single up-sampling mode.

The values of n used for calculating the number of convolution kernels in UpBlock 1-5 in the embodiment of FIG. 2 are 512, 256, 128, 64 and 32 respectively. The value of n directly affects the number of parameters in the model, i.e. the complexity of the model, and the value of n is determined after the complexity and the characterization capability of the model are weighed. The invention does not limit the value of n of the convolution layer in the up-sampling unit, and in practical application, the characterization capability and the model complexity can be balanced and determined automatically according to practical conditions. In this embodiment, each upsampling unit UpBlock includes two paths, one path is configured with two convolutional layers and one transpose convolutional layer (ConvTrans), where the upper and lower layers are convolutional layers and the middle layer is a transpose convolutional layer. The other path includes a nearest neighbor upsampled layer (denoted as upsamplle in the figure) and a convolutional layer. The convolution layer in the up-sampling unit is used for scaling the channel number of the feature map, and the transposition convolution and nearest neighbor up-sampling are used for expanding the width and height of the feature map.

Fig. 6 is a schematic structural diagram of the output layer module 230 in the embodiment of fig. 2. The output layer module 230 is composed of a plurality of convolution layers, and is used for gradually reducing the number of channels of the feature map and simultaneously introducing nonlinearity, increasing the feature combination characterization capability, and finally obtaining the output of the model. The output of the last convolutional layer in the output layer module 230 is processed by an activation function, such as a sigmoid function, and then the two-dimensional gaussian distribution map is output after the output value is mapped to the interval of (0, 1).

The text positioning model provided by the invention integrates the feature maps with corresponding sizes in the down-sampling path during up-sampling, thereby further improving the multiplexing of the feature maps. It should be noted that in the above-mentioned figures, except for the last convolution layer in the output layer module 230, the outputs of the convolution layers of the output layer module and all convolution layers included in the trunk unit 215, the down-sampling unit 211-214, and the up-sampling unit 221-225 are processed by BN (Batch normalization) and an activation function (e.g., ReLU function). The BN layer is a common module in the convolutional neural network and is used for normalizing the output of each convolutional layer in the model so as to achieve the purpose of promoting the convergence of the model. In order to ensure the simplicity and readability of the model structure diagram, a batch normalization layer and an activation function processing layer are not drawn in each convolution layer.

Fig. 7 is a flowchart of steps of a text positioning method provided by the present invention, and the method is implemented based on the text positioning model provided by the present invention to position a text line in a prediction graph in an application scene, and the method is based on a full convolution neural network, and integrates methods such as a residual error network, a transposed convolution, a feature fusion, and batch normalization to enhance the representation capability of the model, and can position a text line region in an image by predicting a gaussian map of the text line region, and can implement accurate positioning of a text line at a horizontal and arbitrary inclination angle in a complex background. After the text positioning model is trained, the method can be applied to an application scene, in the application scene, after a prediction graph of a text to be positioned is input into the model, the model outputs a two-dimensional Gaussian prediction result graph, and then the result graph is subjected to post-processing to obtain a position area of the text line in the prediction graph. The text positioning method comprises the following steps:

step 701, extracting low-level features of the prediction graph by using a main unit, and down-sampling the feature graph output by the main unit step by using a plurality of down-sampling units to reduce the size of the feature graph in proportion;

step 702, a plurality of upsampling units are used for upsampling the feature maps output by the main unit and the downsampling units so as to proportionally increase the size of the feature maps, and the feature maps with the same dimensionality output in the downsampling paths are fused layer by layer in the upsampling paths;

step 703, reducing the number of channels of the feature map output by the last upsampling unit layer by using a plurality of convolutional layers, and outputting a two-dimensional Gaussian prediction result map in the interval of mapping the output value to (0, 1) by using an activation function in the last convolutional layer;

and 704, determining the area of the text line according to the brightness value of the pixel point in the two-dimensional Gaussian prediction result image.

In step 702, in order to improve the efficiency of training the deep convolutional neural network, a skip connection structure in the residual network is introduced into the down-sampling unit. Meanwhile, in order to reduce the potential feature loss in the down-sampling process, an additional average pooling path is added when the feature map size is reduced.

In step 703, in an embodiment of the present invention, in order to better restore the detail features of the picture and take the sampling speed into consideration, two ways of increasing the feature map size, namely, using a transposed convolution and nearest neighbor upsampling, are fused and used in the upsampling unit.

In an embodiment of the present invention, in order to avoid that the sampling multiple is too large, which results in the loss of details in the original feature map, in step 703, the down-sampling unit performs down-sampling by 1/2 times; accordingly, in step 704, the upsampling unit upsamples by a factor of 2.

In the embodiment of the present invention, the outputs of all convolution layers used in step 701, step 702, and step 703 except the last convolution layer in step 703 are processed by batch normalization and activation functions. The purpose of the normalization is to facilitate model convergence.

In an embodiment of the present invention, in an application scenario, in order to obtain a bounding box of a text line in a two-dimensional gaussian prediction result graph, a result graph output by a model is post-processed, and the detailed processing steps are as follows:

and 7041, according to a preset threshold value, setting pixels which are larger than or equal to the threshold value in the two-dimensional Gaussian prediction result image to be 1, setting pixels which are smaller than the threshold value to be 0, and binarizing the two-dimensional Gaussian prediction result image to obtain a binarization result image.

And 7042, analyzing the connected domain of the binarization result image to obtain the connected domain of the text pixel points.

And 7043, determining a minimum circumscribed rectangle for each connected region to obtain four vertexes of the rectangle, and determining the text line region according to the four vertexes.

Based on the post-processing steps, NMS processing is not needed, the size of a candidate box is not needed to be defined in advance, the accurate positioning of the text line at any angle can be realized by predicting the Gaussian map of the text line area, the threshold value can be flexibly set, and the speed is higher.

The training process of the text positioning model belongs to the supervised learning category in deep learning, and is an End-to-End (End-to-End) model. And continuously updating trainable parameters in the model according to the difference between the output of the model and the expected output in the training process, so that the output of the model approaches to the expected output as much as possible, and finishing the training of the model.

The following describes a training process of the text positioning model provided by the present invention by way of example, including:

s1, initializing parameters of a text positioning model, and setting a loss function adopted in a training stage and an optimization algorithm used in an optimization stage;

in one embodiment of the invention, before the model training is started, when the parameters in the model are initialized, a distribution is determined according to the number of input and output neurons in each layer, and then the random values are taken from the distribution for initialization. The weights of convolutional layers in the model are randomly distributed from a uniform distribution [ -a, a [ ]]Is selected from where a is the number d of input channels in the layer _in And the number of channels d of the output _out Determining, namely:

the training stage adopts Mean Square Error (MSE) as a loss function, namely the sum of squares of differences of a two-dimensional Gaussian distribution graph output by the model and a label graph, and is used for quantifying the difference between the output of the model and an expected output, and the expression is as follows:

wherein, P represents a prediction sample graph, T represents a label graph, and (i, j) represents the position coordinates of the pixel points in the characteristic graph.

S2, inputting the prediction sample picture into a text positioning model, and extracting low-level features of the prediction sample picture by a main unit; downsampling the feature map output by the main unit step by a plurality of downsampling units to reduce the size of the feature map proportionally;

s3, a plurality of up-sampling units up-sample the feature maps output by the main unit and the down-sampling units to proportionally increase the size of the feature maps, and feature maps with the same dimensionality output in the down-sampling paths are fused layer by layer in the up-sampling paths;

s4, reducing the number of channels of the characteristic graph output by the up-sampling unit layer by layer through the plurality of convolutional layers, mapping an output value to a (0, 1) interval by using an activation function in the last convolutional layer, and outputting a two-dimensional Gaussian distribution map;

and S5, updating parameters of the text positioning model by using a model optimization algorithm according to the error between the two-dimensional Gaussian distribution graph and the label graph output by the text positioning model, then inputting a new training sample graph again for iteration, and storing the model file after the iteration is terminated.

In an embodiment of the present invention, a gradient descent optimization algorithm or an Adam optimization algorithm is adopted in a back propagation process in a training process of the text positioning model. The Adam optimization algorithm is an extension of a random gradient descent optimization algorithm, a single learning rate is different from that of a traditional random gradient descent optimization algorithm, and Adam designs independent self-adaptive learning rates for different parameters by calculating first moment estimation and second moment estimation of a gradient, so that the convergence rate is faster and more stable. The learning rate in the training phase was set to 0.0001.

Models based on deep learning mostly need a large amount of data to train the models, but large-scale data labeling is time-consuming and labor-consuming. In order to maximize the utilization of labeled data and reduce overfitting of a model, an embodiment of the present invention performs data amplification processing on input sample data in a training phase, that is, performs multiple transformation on an existing prediction sample map to obtain more prediction sample maps, where the transformation methods may include: random rotation, random brightness and contrast conversion, random scaling, random cutting, random noise addition and the like are carried out, so that more prediction sample images are obtained, and after a large number of training samples are used for training the model, the prediction effect of the model is better.

In one embodiment of the invention, a mode of combining manual marking and automatic generation is adopted to generate a prediction sample picture and a corresponding label picture, manual marking is carried out on the collected prediction sample picture containing text line data, meanwhile, a large number of pictures containing complex background and without text lines are collected, and the text lines are randomly generated on the background picture through codes, so that more standard training samples are obtained.

Fig. 8 is a schematic diagram of a hardware structure of a text positioning apparatus according to an embodiment of the present invention, where the text positioning apparatus 800 includes: a processor 801 such as a Central Processing Unit (CPU), an internal bus 802, and a non-transitory storage medium 804. The processor 801 and the non-transitory storage medium 804 can communicate with each other via the internal bus 802. If the functions of the steps of the text positioning method provided by the present invention form corresponding software modules, and the software formed by these software modules is installed in the non-transitory storage medium 804 of the text positioning device 800, a text positioning device capable of implementing the text positioning method and model functions of the present invention can be manufactured, and the processor 801 reads and executes the machine executable instructions corresponding to the text positioning device or text positioning model 805 stored in the non-transitory storage medium 804, so as to implement the functions and effects of the text positioning method provided by the present invention.

The text line positioning method and the text positioning model provided by the invention can detect the text line area in the RGB three-channel color image on the basis of the full convolution neural network. A large number of samples of the current line under a complex background are randomly generated in the data set making stage, and meanwhile, data amplification methods such as random rotation are combined to increase the number of samples. The structure of the model integrates methods such as residual error network, transposition convolution, feature fusion, batch normalization and the like to enhance the characterization capability of the model. Compared with the existing text line positioning model, the text line detection and positioning method can realize text line detection and positioning in a complex background, can realize efficient detection of the inclined text line, and has higher accuracy and better robustness. The text line detection method can be used as a text detector in Optical Character Recognition (OCR), and the detected text area is sent into a character recognition model to realize character recognition.

The above description is only an example of the present invention, and is not intended to limit the present invention. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims

1. A text positioning device is characterized by comprising a down-sampling module, an up-sampling module and an output layer module;

the down-sampling module consists of a backbone unit and N down-sampling units:

the up-sampling module consists of M up-sampling units:

the output layer module consists of a plurality of convolution layers and is used for gradually reducing the number of channels of the characteristic diagram, and the output of the output layer module is a two-dimensional Gaussian distribution diagram;

the down-sampling unit comprises a first convolution subunit close to the output side, a third convolution subunit close to the input side and a second convolution subunit in the middle:

2. The apparatus of claim 1,

the up-sampling unit comprises two paths, wherein one path comprises an upper convolution layer and a lower convolution layer, and a transposition convolution layer is arranged in the middle; the other path comprises a nearest neighbor upsampling layer and a convolutional layer; the convolution layer in the up-sampling unit is used for scaling the channel number of the feature map, and the transposed convolution layer and the nearest neighbor up-sampling layer are used for expanding the width and the height of the feature map;

the height and width of the characteristic diagram output by the up-sampling unit are increased by 2 times compared with the input.

3. The apparatus of claim 2,

the output layer module consists of a plurality of convolution layers, introduces nonlinearity while gradually reducing the number of channels of the characteristic diagram, and increases the characteristic combination characterization capability; and the output of the last convolution layer in the output layer module is subjected to activation function processing, and then the output value is mapped to the interval of (0, 1) and then the two-dimensional Gaussian distribution map is output.

4. The apparatus of claim 3,

except the last convolution layer in the output layer module, the outputs of other convolution layers of the output layer module and all convolution layers contained in the trunk unit, the down-sampling unit and the up-sampling unit are subjected to batch normalization and activation function processing.

5. A method for locating text, the method comprising:

the feature maps output by the main unit and the down-sampling units are up-sampled by using a plurality of up-sampling units so as to proportionally increase the size of the feature maps, and the feature maps with the same dimensionality output in the down-sampling paths are fused layer by layer in the up-sampling paths;

the down-sampling unit comprises a first convolution sub-unit close to the output side, a third convolution sub-unit close to the input side and a second convolution sub-unit located in the middle:

the output of the third convolution subunit is used as the input of the second convolution subunit, the output of the second convolution subunit is used as the input of the first convolution subunit, and the output of the first convolution subunit is the output of the downsampling unit;

6. The method of claim 5,

a jump connection structure in a residual error network is introduced into a down sampling unit, and an additional average pooling path is introduced into the down sampling unit;

two ways of increasing the feature size are to use transposed convolution and nearest neighbor upsampling fusion in the upsampling unit.

7. The method of claim 5, wherein the step of determining the region of the text line according to the brightness values of the pixels in the two-dimensional Gaussian prediction result map comprises:

8. The method according to claim 5, wherein said downsampling unit downsamples by a factor of 1/2, and wherein a corresponding upsampling unit upsamples by a factor of 2.

9. A training method of a text positioning model is characterized in that the method is applied to the text positioning model and comprises the following steps:

the down-sampling unit is used for down-sampling the input feature map; the feature diagram size output by each downsampling unit is reduced in proportion to the width and height of the input feature diagram, except for the Nth downsampling unit, the output of each downsampling unit serves as the input of the next downsampling unit, and the output of the last downsampling unit serves as the input of the first upsampling unit;

the output of the third convolution subunit is used as the input of a second convolution subunit, the output of the second convolution subunit is used as the input of a first convolution subunit, and the output of the first convolution subunit is the output of the downsampling unit;

and updating parameters of the text positioning model by using a model optimization algorithm according to the error between the two-dimensional Gaussian distribution diagram output by the text positioning model and the label diagram, then inputting a new training sample diagram again for iteration, and storing the model file after the iteration is terminated.

10. The method of claim 9, wherein the text positioning model is a text positioning model,

in the training process, the mean square error is used as a loss function of the text positioning model, namely the sum of squares of differences between a two-dimensional Gaussian distribution diagram output by the model and a label diagram is used as the loss function of the model;

in the back propagation process, a gradient descent optimization algorithm or an Adam optimization algorithm is adopted.

11. The method of claim 9, further comprising:

12. A text pointing device, the text pointing device comprising: a processor, a non-transitory storage medium, from which the processor reads and executes instruction code to implement the method of any one of claims 5 to 8.