Disclosure of Invention
In order to solve the problem of scene text recognition of a background information noisy picture, the invention aims to provide a scene text recognition method based on fine character segmentation, which can effectively solve the recognition problem of irregular texts such as bending, inclination and the like in the background information noisy picture.
The technical scheme provided by the invention is a scene text recognition method based on fine character segmentation, which is realized by a processor executing a program instruction, and comprises the following steps:
receiving an input picture containing scene text with a specified size;
processing the input picture into a text segmentation graph with the same size by using a character segmentation network based on a full connection structure in a scene text recognition network, wherein the text segmentation graph comprises character distribution characteristic information of the input picture in each pixel;
and obtaining a text recognition result of the input picture according to the text segmentation map by using an attention-based text recognition network in the scene text recognition network.
Preferably, the character segmentation network is a full connection network based on ResNet.
Preferably, the character segmentation network outputs a plurality of output feature maps at the lowest layer of the downsampling stage by fusing the plurality of output feature maps at the highest layer of the output feature maps, which are upsampled to have the same size as the output feature maps, before outputting the character segmentation network.
Preferably, the method for obtaining the text recognition result of the input picture by the text recognition network includes: obtaining a feature map V of the text segmentation map through a feature extractor, and then identifying the feature map V by using an attention-based codec structure.
Preferably, the feature map V includes the feature vectors after the maximum pooling along the text extension direction of the input picture.
Preferably, the codec structure based on attention mechanism includes: an encoder consisting of two layers of LSTM, and a decoder consisting of two layers of LSTM; the encoder and the decoder do not share parameters. Further preferably, the encoder receives a column of feature maps V at each time step, and then performs maximum pooling in a direction orthogonal to the text extension direction; the encoder, after W steps of the same width as the signature V, the final hidden state of the LSTM second layer is output to the decoder; the final hidden state is considered as a fixed-size representation of the input image, embodying the overall characteristics of the input picture. Further preferably, the decoder receives the overall characteristics of the encoder output at its time step 0. The START token, i.e. the "START" token with fixed code, is then input to the LSTM in step 1. Starting at step 2, the output of the previous step is fed back to the LSTM input until it terminates and outputs after receiving the END token, i.e., the encoded fixed "END" token.
Preferably, the method for implementing the attention mechanism by the text recognition network is that the information of the adjacent area of the text segmentation graph is considered according to the following mathematical model and participates in the decoding of the decoder:
wherein v is
ijRepresenting a local feature, N, at position (i, j) in the text-segmentation graph V
ijIs a local feature, h ', of eight adjacent points near position (i, j), i.e. another 8 points within a 3 x 3 range around position (i, j)'
tIs the hidden information of the decoder at time step t, for use as information to guide decoding; w
v,W
h,
And W
*Is a linear transformation matrix that needs to be updated and trained in relation to the respective subscript; alpha is alpha
ijIs the attention weight at location (i, j); g
tIs a weighted sum of local features, which is considered as glimpse in the attention mechanism.
Preferably, in the training phase, in the loss function of the scene text recognition network, the character segmentation loss function is a cross entropy loss function based on two classifications, and the text recognition loss function is a cross entropy loss function based on a real character label.
The invention effectively improves the accuracy of text recognition by combining the tasks of semantic segmentation and text recognition on the text picture. By segmenting the text picture, the foreground (characters) and the background of the picture are finely segmented, background information is filtered, and the interference of the background information on the text recognition process is reduced, so that the difficulty of text recognition is reduced. In the text recognition stage, the information of the picture is converted into an attention weight matrix by using a two-dimensional attention mechanism, and the characteristics of the region can be automatically positioned at each predicted moment, so that the recognition effect is improved, and the problem of poor recognition effect under the condition of bending or inclining is solved. The method can identify the horizontal standard scene text, the whole system has stronger practicability, and the problem of identifying the scene text under various conditions such as bending, inclining, horizontal and the like can be solved.
Detailed Description
It is first noted that, unless specifically stated otherwise, the term "model" as referred to herein in relation to deep-learning network models, text-recognition network models, and the like, refers to a suite of mathematical algorithms implemented by sequences of computer program instructions that, when read by a processor and used to implement processing of computer data to achieve a specified technical effect, through various configuration parameters and defined specified input data. In the following embodiments of the inventive concept disclosed herein, specific implementation function codes may be implemented specifically by means of common general knowledge in the art after understanding the specific concept, and thus, details of these embodiments are not repeated.
Referring to fig. 1, a scene text recognition method based on fine character segmentation in various embodiments of the present invention relies on a neural network model for scene text recognition, hereinafter referred to as a scene text recognition network, which includes two main network structures: the first part is a full convolution network FCN based on ResNet, which is used for finely dividing an input picture and obtaining character distribution characteristics of picture size, namely character division characteristic information, and the invention is called as a character division network; the second part is a character recognition network for recognizing the segmented picture with character segmentation characteristic information, and the recognition of the scene text is realized through the structure of coding and decoding and an attention mechanism. In the first part, the invention adopts an FCN structure based on ResNet to Segment the picture, firstly carries out multilayer convolution downsampling of ResNet structure on the input to obtain an intermediate feature, and then carries out deconvolution upsampling on the feature map to restore the feature map to a text segmentation map (Segment map) with the same size as the input image, wherein the text segmentation map comprises character information of each pixel in the predicted picture. And in the second part, a text segmentation graph obtained by the segmentation idea is input into a text recognition network for recognition, and in the text recognition network, feature information is firstly extracted through CNN and then is recognized by a coding and decoding structure based on an attention mechanism. It should be noted that, unlike other text recognition methods based on segmentation in the prior art, the main idea of the present invention is to use the pixel-level character segmentation information to perform supervised training on the character segmentation network in order to extract the information of the text more accurately by ignoring the background information of the picture as much as possible, so as to greatly reduce the difficulty of text recognition by the text recognition network.
Referring to fig. 2 to 5, the embodiment is a named entity recognition method based on a conditional random field, and is implemented by a system of a scene character recognition network executed in a processor, where the processor receives a scene text picture, and by operating an implementation program of the scene character recognition network, the entire system uses a Full Convolution Network (FCN) based on ResNet to finely divide an input picture, and then inputs a result obtained by the division into a text recognition model based on a coding and decoding structure to perform recognition, and obtains a text recognition result of the scene text picture.
Each neural network module configured by the scene character recognition network in the embodiment and the working principle thereof are respectively explained below through a specific network structure and a specific processing process.
In the first part of the scene text recognition network, the text image is segmented in the present embodiment using a full convolution network FCN based on ResNet34, as an example. Firstly, any Input picture is scaled into a picture with uniform size of 192 × 64 through a preprocessing step, namely the width W is 192, the height H is 64, and the Input picture is expanded on three channels of RGB to obtain a digitized representation of the Input picture (Input Image), a tensor of 192 × 64 × 3; referring to fig. 3, in the FCN based on ResNet34 in the present embodiment, a ResNet34 backbone structure is adopted in the downsampling process, including four layers of layers, and 48 × 16 × 64, 24 × 8 × 128, 12 × 4 × 256, and 6 × 2 × 512 feature maps are obtained from layers 1 to 4, respectively, where the 6 × 2 × 512 feature map is an intermediate feature in the present invention, and a dimension of the feature map is obtained in a lower dimension in a channel dimension, but is not reduced to one dimension. Then, the text segmentation graph with the size of 192 × 64 × 1 is obtained by sequentially performing up-sampling in a deconvolution mode, the width and the height of the text segmentation graph are the same as those of the input picture, and the number of channels is reduced to 1. In the invention, in order to consider more detailed information in the previous residual error processing, in the process of upsampling, the character segmentation network successively adds the output information of conv, layer1, layer2 and layer3, specifically, the embodiment adopts the mode that the output of layer4 is subjected to twice upsampling and then is added and fused with the output of layer3 to obtain a new output, the output is subjected to twice upsampling and then is added and fused with the output of layer2, and the like, and finally the output of 192 × 64 × 1 is obtained, namely the text segmentation graph of the invention. The up-sampling method used here is specifically deconvolution with a step size of 2 and a convolution kernel size of 3 × 3. And finally, performing upsampling on the feature map with the size of 96 × 32 × 64 by 2 times to obtain 192 × 64 × 1 output, namely the text segmentation map of the invention. And finally, the number of channels of the output text segmentation graph is 1, namely the classification number is 1, and the channels are used for distinguishing characters and backgrounds in the text graph, so that the interference of background information on the identification process is reduced.
In the second part of the scene text recognition network, exemplarily, referring to fig. 4, the segmentation results represented by the resulting text segmentation graph are input to a feature extractor, also based on ResNet34, for feature extraction. For each residual block, the project short mode (implemented by 1 × 1 convolution) is used if the input and output sizes are different, and the identity short mode is used if the sizes are the same. The convolution kernels are all 3 x 3 in size. In addition to two 2 x 2 max pooling layers, this embodiment also uses a 2 x 1 max pooling layer with a dimension of 1 in the horizontal direction and 2 in the vertical direction, i.e. a larger max pooling in the vertical axis, to retain more information in the horizontal direction, which will help to identify narrow shaped characters such as "i", "l", etc. for text extending in the horizontal direction of the picture. A feature diagram V obtained by the feature extractor, and V epsilon RW×H×DThe feature map V of the present embodiment is a tensor with a size of 48 × 8 × 512, a width W of 48 and a height H of 8, and is used to extract the overall features of the whole image, and at the same time, as the input content of the 2D attention network of the present embodiment, the extracted feature map V of the image is input to the codec structure based on the 2D attention mechanism for recognition.
In the second part of the scene text recognition network, the encoder is exemplarily composed of two layers of LSTM models, each layer having a hidden layer size of 512. At each time step, the LSTM encoder receives a column of tensors of the signature V along the horizontal width W, which in this example is 8 x 512 in size, and then performs maximum pooling along the vertical axis. The final hidden state h of the LSTM second layer after W steps of the same width as the signature VwIs treated as a fixed-size representation of the input image, embodies the overall characteristics of the input picture, and is provided to the decoder.
In the second part of the scene text recognition network, exemplarily,the decoder is another LSTM model with 2 layers, each with a hidden state layer size of 512. The encoder and decoder do not share parameters. Initially, the overall characteristic h of the encoder LSTM outputw(holistic feature) is fed to the decoder LSTM at time step 0. The "START" token is then entered into the LSTM at step 1. Starting at step 2, the output of the previous step is fed to the LSTM until the "END" token has been received. All LSTM inputs are represented by one-hot vector and then linearly transformed ψ (). During training, the input of the decoder LSTM is replaced with a real character sequence. The output is calculated by the following transformation:
yt=φ(h′t,gt)=softmax(Wo[h′t;gt]) (1)
wherein, h'tIs the current hidden state, gtIs the output of the attention module. WoIs a linear transformation matrix that can embed features into a class 94 output space. These 94 classes correspond to 10 digits, 52 capital and lowercase english letters, 31 punctuation marks, and an "END" symbol, respectively. It will be readily appreciated that in some embodiments, if chinese characters are involved, the classification may be increased from 94 to more output space, such as increasing the one-hot encoding size of 7000 chinese characters as input to the decoder.
In the second part of the scene text recognition network, referring to fig. 5, the 2D attention module used in this embodiment uses a robust 2D attention mechanism for considering information of adjacent regions, and the formula is as follows:
wherein v is
ijRepresenting a local feature, N, at position (i, j) in the feature map V
ijIs a local feature, h ', of eight adjacent points near position (i, j), i.e. another 8 points within a 3 x 3 range around position (i, j)'
tIs the hidden information of decoder LSTM at time step t, for use as the information to guide decoding; w
v,W
h,
And W
*Is a linear transformation matrix that needs to be updated and trained in relation to the respective subscript; alpha is alpha
ijIs the attention weight at location (i, j); g
tIs a weighted sum of local features, which is considered as glimpse in the attention mechanism.
During training, the input of a decoder is a word vector obtained after a real label is embedded (embedding) by a word, the embedding operation can map a word into a word vector space to obtain a vector, and the vector is used as the input; at the time of testing, because the real label information is unknown, the output of the previous decoder is used as the output of the time and is used as the input of the time after being embedding. Only during the training phase is back propagation involved.
Exemplarily, the present embodiment provides implementation steps of a scene text recognition method based on fine character segmentation, and the specific process includes the following steps 100 to 400:
and step 100, making character segmentation labels for scene text pictures of the training data set.
Exemplarily, the embodiment selects the synthetic scene text data set synthttext to create the training set suitable for the embodiment of the method. In the prior art, only scene text pictures and boundary box information of each embedded character are provided in a synthesized scene text data set synthttext, and before training, weak labels for character segmentation are directly made by using the boundary boxes. The invention considers that the boundary box of a character not only contains the character, but also contains a plurality of background area information, and meanwhile, the weak label of semantic segmentation made according to the boundary box is very rough. In this embodiment, according to a mask (mask) of each obtained character in a syhttext synthesizing process, a label for character segmentation is made, and the size of the label is consistent with the size of a text picture. In the present embodiment, the output size of the character segmentation of the neural network model is 192 × 64, and therefore, when the character segmentation labeling is performed on the text picture, the label of the character segmentation is also scaled to the size of 192 × 64.
And 200, preprocessing a scene text picture.
In order to make the picture size of the input model 192 × 64, the picture size is adjusted to 192 × 64 using a bilinear interpolation method. The data enhancement mode used in training is random cutting, and changing the brightness, contrast, saturation and tone of the image. It should be noted that, in order to keep consistent character segmentation labels after performing some data enhancement operations of spatial variation type on the text picture, such as random cropping, the embodiment also performs adaptive adjustment on the character segmentation labels by using the data enhancement library Augmentor, where the adaptive adjustment means that the same operation, such as rotating by the same angle, needs to be performed on the character style labels at the same time after performing the data enhancement operation on the text picture.
And step 300, training a neural network model.
In this embodiment, the loss function used in training the neural network model includes two parts, one part is the loss of character segmentation, and the other part is the loss of text recognition.
Specifically, the output of the character segmentation network is a matrix of the same size as the input image, and the value of each position represents the probability that the current position is a character region. The loss function used is a two-class cross entropy loss function:
losssegment=-(ylog(x)+(1-y)log(1-x)) (5)
wherein x is a predicted value, y is a label as a true value, and the value is 0 or 1.
The output of the decoder LSTM passes through a full-connection layer, the output dimension of the decoder LSTM is equal to the number of all letter types, then softmax operation is executed, the output vector is converted into the probability distribution of each letter, wherein the letter with the maximum probability distribution value is regarded as the prediction result of the layer, and by analogy, the prediction results obtained at a plurality of time steps are all letters on the scene text. The identified loss function uses a cross-entropy loss function:
wherein y is a predicted 94-dimensional vector, and gt is a real character label. The final loss function is:
Loss=losspred+αlosssegment (7)
where α is a coefficient, exemplarily, α is 1 in the present embodiment.
The optimizer chooses ADADELTA to calculate the gradient and does back propagation. The trained batch size is set to 56, and one epoch requires 129276 iterations, for a total of 6 epochs.
Step 400, model application.
After the training process, a plurality of models can be obtained, and the optimal model (with the minimum loss function value) is selected for application, so that the picture data processing does not need data enhancement, only the image needs to be adjusted to 192 × 64, and normalization can be used as the input of the model. The parameters of the whole network model are fixed, so long as the image data is input and propagated forwards. And sequentially obtaining segmentation maps, and inputting the segmentation maps into the feature extractor and the encoder. Then automatically transmitted into a decoding network for automatic decoding, and a character recognition result can be directly obtained through the whole model. When a large number of scene text pictures need to be tested, all the pictures can be integrated into one data file, for example, an lmdb format file can be used when the RGB values of all the pictures are stored by adopting a data table, so that all the pictures can be conveniently read at one time.
The implementation steps are only exemplary, the implementation time of each step depends on the precondition of the step rather than the time sequence of the step, for example, the training of the neural network in the word vector module can be done in advance, and the step is not necessarily implemented after the word sequence.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the several embodiments provided in the present application, it should be understood that the disclosed system and method may be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the modules is only one logical division, and other divisions may be realized in practice, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In another aspect, the shown or discussed couplings or direct couplings or communication connections between each other may be through interfaces, indirect couplings or communication connections of devices or units, such as calls to external neural network units, and may be in a local, remote or mixed resource configuration form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing device, or each module may exist alone physically, or two or more modules are integrated into one processing device. The integrated module can be realized in a form of hardware or a form of a software functional unit.
The integrated module, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-0 nlymetry Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The scene text recognition method based on fine character segmentation provided by the invention is described in detail above, a specific example is applied in the text to explain the principle and the implementation mode of the invention, and the description of the above embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.