CN111291629A

CN111291629A - Method and device for recognizing text in image, computer equipment and computer storage medium

Info

Publication number: CN111291629A
Application number: CN202010051888.7A
Authority: CN
Inventors: 杨紫崴
Original assignee: Ping An Medical and Healthcare Management Co Ltd
Current assignee: Shenzhen Ping An Medical Health Technology Service Co Ltd
Priority date: 2020-01-17
Filing date: 2020-01-17
Publication date: 2020-06-16

Abstract

The application discloses a method and a device for recognizing a text in an image and a computer storage medium, which relate to the technical field of text recognition. The method comprises the following steps: acquiring a text sample image of a needle-like printing font after scene processing; respectively inputting the character sample images of the needle-like printing fonts serving as training data into network models of different architectures for training to obtain a text region detection model and a text recognition model; when an image text detection request is received, inputting the image requested to be detected to the text region identification model, and determining the position information of the text region corresponding to the image; and inputting the position information of the text region corresponding to the image and the image requested to be detected into the text recognition model together to obtain the text information in the image.

Description

Method and device for recognizing text in image, computer equipment and computer storage medium

Technical Field

The present invention relates to the field of text recognition technologies, and in particular, to a method and an apparatus for recognizing text in an image, a computer device, and a computer storage medium.

Background

At present, the OCR recognition technology can well recognize characters in pictures, is applied to various fields such as certificate recognition and bill recognition, and replaces manual input to a great extent. The trouble of manual entry is greatly saved. And a large amount of marked data is an important part of a model training process in an OCR (optical character recognition) technology, and higher manpower, material resources and time cost are required.

The method utilizes an algorithm to generate character sample data simulating a real scene to augment the labeled data, and can achieve the scale of the labeled data required by model training to a certain extent. However, the character sample data generated by the algorithm to simulate the scene is usually continuous strokes, and it is difficult to cover the character sample data in some specific scenes, such as the character sample printed by a pin printer, which is a stroke composed of individual dot matrixes, so that the character sample data in the model training process lacks diversity, the trained model cannot be well fitted to the actual scene, and the accuracy of text recognition is affected.

Disclosure of Invention

In view of the above, the present invention provides a method, an apparatus, a computer device and a computer storage medium for recognizing a text in an image, and mainly aims to solve the problem that the accuracy of text recognition is low because a trained model cannot well fit an actual scene due to the lack of diversity of text samples in the training process of the existing text recognition model.

According to an aspect of the present invention, there is provided a method for recognizing text in an image, the method comprising:

acquiring a text sample image of a needle-like printing font after scene processing;

respectively inputting the character sample images of the needle-like printing fonts serving as training data into network models of different architectures for training to obtain a text region detection model and a text recognition model;

when an image text detection request is received, inputting the image requested to be detected to the text region identification model, and determining the position information of the text region corresponding to the image;

and inputting the position information of the text region corresponding to the image and the image requested to be detected into the text recognition model together to obtain the text information in the image.

Further, the acquiring of the text sample image of the needle-like printing font after the scenarization processing specifically includes:

acquiring a printing sample image generated by using a printing mode, and setting an attribute value corresponding to the printing sample image;

and performing scene processing on the printing sample image by changing the attribute value corresponding to the pixel in the printing sample image to obtain a character sample image of the needle-like printing character body.

Further, the obtaining a text sample image of a needle-like printed character by changing the attribute value corresponding to the pixel in the print sample image and performing a scenization process on the print sample image specifically includes:

determining an optimal threshold value for dividing corresponding color attribute values of pixels in the printing sample image by using a maximum inter-class variance method;

performing binarization processing on the printing sample image by taking the optimal threshold value as a dividing basis to obtain background pixels and foreground pixels of the printing sample image after binarization processing;

dividing background pixels of the print sample image after the binarization processing into a plurality of background parts according to a preset proportion;

and performing scene processing on the printing sample image according to the pixel value of the corresponding parameter of each background part to obtain a character sample image of the needle-like printing character body.

Further, the determining an optimal threshold value for dividing the color attribute values corresponding to the pixels in the print sample image by using the maximum inter-class variance method specifically includes:

dividing color attribute values corresponding to pixels in the printing sample image into two groups by using an assumed gray value, and calculating inter-class variance, wherein one group of color attribute values is taken as the assumed gray value, and the other group of color attribute values is not greater than the assumed gray value;

and determining the assumed gray value at the maximum value of the inter-class variance as the optimal threshold value of the color attribute value by changing the assumed gray value.

Further, the performing a scenization process on the print sample image according to the pixel value of the parameter corresponding to each background portion to obtain a text sample image of the needle-like print body specifically includes:

obtaining a character sample image of a pin-like printing font after a scene with increased contrast by adjusting the pixel value of the corresponding contrast of each background part, so that the character sample image covers scenes with different contrasts;

the fuzzy processing is carried out on the pixel values of the corresponding parameters of each background part, so that the character sample image of the pin-like printing font with the increased fuzzy effect is obtained, and the scene with the fuzzy effect is covered by the character sample image.

Further, the step of inputting the character sample image with the needle-like printing font as training data into network models of different architectures for training respectively to obtain a text region detection model and a text recognition model specifically includes:

marking the position information of the text area in the character sample image of the needle-like printing font, and inputting the marked position information into a first network model for training to obtain a text area detection model;

and labeling the text information in the text area in the character sample image of the needle-like printing font, and inputting the labeled text information into a second network model for training to obtain a text recognition model.

Further, the first network model includes a multilayer structure, and the method includes the steps of labeling the position information of the text region in the text sample image of the needle-like printing font, inputting the labeled position information into the first network model for training to obtain a text region detection model, and specifically includes:

extracting image area characteristics corresponding to the character sample image of the needle-like printing font through the convolution layer of the first network model;

generating horizontal text sequence characteristics according to image region characteristics corresponding to the character sample images through a decoding layer of the first network model;

and determining a text region in the text sample image according to the horizontal text sequence characteristics through a prediction layer of the first network model, and processing the text region to obtain a candidate text line.

According to another aspect of the present invention, there is provided an apparatus for recognizing text in an image, the apparatus comprising:

the acquiring unit is used for acquiring a character sample image of the pin-like printing font after the scene processing;

the training unit is used for inputting the character sample images of the needle-like printing fonts serving as training data into network models of different architectures respectively for training to obtain a text region detection model and a text recognition model;

the determining unit is used for inputting the image requested to be detected to the text region identification model when the image text detection request is received, and determining the position information of the text region corresponding to the image;

and the identification unit is used for inputting the position information of the text region corresponding to the image and the image requested to be detected into the text identification model together to obtain the text information in the image.

Further, the acquisition unit includes:

the device comprises a setting module, a processing module and a display module, wherein the setting module is used for acquiring a printing sample image generated by a printing mode and setting an attribute value corresponding to the printing sample image;

and the processing module is used for performing scene processing on the printing sample image by changing the attribute value corresponding to the pixel in the printing sample image to obtain the character sample image of the needle-like printing character body.

Further, the processing module comprises:

the determining submodule is used for determining an optimal threshold value for dividing the corresponding color attribute values of the pixels in the printing sample image by using a maximum inter-class variance method;

the first processing submodule is used for carrying out binarization processing on the printing sample image by taking the optimal threshold value as a dividing basis to obtain background pixels and foreground pixels of the printing sample image after the binarization processing;

the dividing submodule is used for dividing the background pixels of the print sample image after the binarization processing into a plurality of background parts according to a preset proportion;

and the second processing submodule is used for performing scene processing on the printing sample image according to the pixel value of the corresponding parameter of each background part to obtain a character sample image of the needle-like printing character body.

Further, the determining sub-module is specifically configured to divide the color attribute values corresponding to the pixels in the print sample image into two groups by using an assumed gray scale value, and calculate an inter-class variance, where one group of color attribute values is taken as the assumed gray scale value, and the other group of color attribute values is not greater than the assumed gray scale value;

the determining sub-module is specifically further configured to determine, as the optimal threshold of the color attribute value, the assumed gray value at the time of the maximum value of the inter-class variance by changing the assumed gray value.

Further, the second processing sub-module is specifically configured to obtain a text sample image of a pin-like printing font after a scene with increased contrast by adjusting pixel values of the corresponding contrasts of the background portions, so that the text sample image covers scenes with different contrasts;

the second processing sub-module is specifically configured to perform blurring processing on the pixel values of the parameters corresponding to the background portions to obtain a text sample image with increased blurring effect and similar to a pin print font, so that the text sample image covers a scene with blurring effect.

Further, the training unit comprises:

the first training module is used for marking the position information of the text area in the character sample image of the needle-like printing font and inputting the marked position information into a first network model for training to obtain a text area detection model;

and the second training module is used for labeling the text information in the text area in the character sample image of the needle-like printing font and inputting the labeled text information into a second network model for training to obtain a text recognition model.

Further, the first network model includes a multi-layer structure,

the first training module is specifically configured to extract image area features corresponding to the text sample image of the pin-like printing font through the convolution layer of the first network model;

the first training module is specifically further configured to generate a horizontal text sequence feature according to an image region feature corresponding to a text sample image through a decoding layer of the first network model;

the first training module is specifically further configured to determine a text region in the text sample image according to the horizontal text sequence feature through a prediction layer of the first network model, and process the text region to obtain a candidate text line.

According to yet another aspect of the invention, a computer device is provided, comprising a memory storing a computer program and a processor implementing the steps of the method for recognition of text in an image when executing the computer program.

According to a further aspect of the invention, a computer storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method for recognition of text in an image.

By means of the technical scheme, the text in the image recognition method and the text in the image recognition device are provided, the text sample image of the needle-like printing font after the scene processing is obtained, and the text sample image after the scene processing is covered with richer picture features, so that the trained text region detection model and the trained text recognition model have higher scene recognition capability, and text information in different scene images can be recognized in the process of recognizing the text in the image. Compared with the method for recognizing the text in the image in the prior art, the method has the advantages that the sample data collected in the actual scene is expanded, a large amount of labor cost is not needed to be consumed to collect the sample, the sample collection process is simplified, the labeling time of the sample data is saved, the model trained by the expanded sample data can be used for well fitting the actual scene, and the accuracy of recognizing the text in the image is improved.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a schematic flow chart illustrating a method for recognizing text in an image according to an embodiment of the present invention;

FIG. 2 is a flow chart illustrating another method for recognizing text in an image according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram illustrating an apparatus for recognizing text in an image according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram illustrating another apparatus for recognizing text in an image according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The embodiment of the invention provides a method for recognizing a text in an image, which can enable a trained model to well fit an actual scene and improve the accuracy of text recognition in the image, and as shown in fig. 1, the method comprises the following steps:

101. and acquiring a text sample image of the needle-like printing font after the scene processing.

The text sample image of the print-like font may be a text sample image generated by a printing manner, such as an invoice image, a document image, and the like.

In the process of acquiring the character sample images for class printing, a conventional bill image is usually selected as the character sample image for class printing, and in order to enrich the diversity of the character sample images, different shooting devices can be used for shooting the bill images at different shooting backgrounds, light rays, brightness, shooting angles and the like aiming at the conventional bill images, so that the character sample images with different changes of the backgrounds, the light rays, the brightness and the like combined are generated, and the character sample images can be combined with an actual application scene in the subsequent training process.

It can be understood that certain scene processing can be added to the text sample image by adjusting the shooting scene in the shooting bill image, and the text sample image with different background colors can be selected to perform gray scale processing on the background color of the text sample image, so as to adjust the contrast between the foreground and the background of the text sample image, so that the text sample image covers the scenes with different contrasts, and the image background part after the gray scale processing can be subjected to fuzzy processing, so that the text sample image covers the scenes with the fuzzy effect, and certainly, the processing such as noise increase or scale reduction is performed, and the scene processing mode is not limited here.

102. And respectively inputting the character sample images of the needle-like printing fonts serving as training data into network models of different architectures for training to obtain a text region detection model and a text recognition model.

The Network model for training the Text region detection model may use an open source DetectingText-in-natural Image with connectivity Text forward Network (CTPN) framework. The specific process of training the textbox detection model may be as follows: the method comprises the steps of firstly preparing training data, namely a character sample image of a similar printing font and annotation data corresponding to the character sample image, wherein coordinate information corresponding to a text region in the image is recorded in the annotation data, converting the coordinate information corresponding to the text region in the annotation data into a small anchor with the width of 8 before the training data is input into a CTPN network, and predicting and identifying information in each small text region by splitting the text region into small text region sets, so that the accuracy of text region detection can be greatly improved. The CTPN network structure adopts a form of CNN + BLSTM + RPN, the CNN is used for extracting spatial features of receptive fields, the receptive fields are regions of input images corresponding to responses of certain nodes (convolved by convolution kernels), the BLSTM can generate horizontal text sequence features based on the spatial features of the receptive fields, the RPN comprises two parts, an anchor classification and a bounding box regression, whether each region is a text region can be determined through the anchor classification, and a group of vertical strip-shaped candidate text lines can be obtained after the bounding box regression processing.

It should be noted that, the text region output by the pre-trained text region detection model is not directly the text region in the target recognition image, but is a set of candidate text lines in vertical stripes forming the text region in the target recognition image, and the text region in the target recognition image and the position information of the text region may be determined by connecting the set of candidate text lines in vertical stripes to form the text region by using a text line construction algorithm.

The Network model for training the Text Recognition model can adopt An End-to-End Train available neural Network for Imaged-based Sequence Recognition and Its application scene Text Recognition (CRNN) algorithm to Train the Recognition model, and after the character sample image of the pin-like printing font and the position information of the Text region marked in the character sample image pass through the Text Recognition model, a Text Recognition result corresponding to each Text region in the character sample image of the pin-like printing font is output. The process of specifically training the CRNN model may be as follows: firstly, training data are stored in a label mode by adopting a character sample image of a pin-like printing font and text information of a text area in the character sample image. The CRNN network structure here adopts a form of CNN + RNN + CTC, where CNN is used to extract spatial features of receptive fields in an image, RNN can predict label distribution of each frame in the image based on the spatial features of the receptive fields, and CTC can integrate the label distribution of each frame into a final label sequence. For example, the size of the input picture resize to W × 32, and the predicted value output by the text recognition model represents text information corresponding to a text region in the target recognition image.

It should be noted that the training data used for training the text region detection model and the text recognition model has rich features of the printed image, so that the trained text region detection model and the trained text recognition model can more fully cover the application scene of the image with printed fonts, and the detection effect of the text region in the image and the recognition effect of the text information in the text region are improved.

103. When an image text detection request is received, inputting the image requested to be detected into the text region identification model, and determining the position information of the text region corresponding to the image.

It can be understood that each image has a corresponding output file through the text region detection model, and the output file stores the position information of all candidate text boxes in the image and whether the candidate text line is a label of the text region, where the candidate text box is equivalent to a vertical strip-shaped box split from the text region.

Specifically, in the process of determining the position information of the text region corresponding to the image, a series of candidate text boxes output by the text region detection model may be marked as text documents, and in the process of generating the position information of the text region corresponding to the image from the candidate text lines corresponding to the image based on the text line construction algorithm, whether the candidate text boxes are labels of the text region or not is considered, so that the series of text documents are connected into a large text region according to whether the candidate text boxes are labels of the text region or not, the text region corresponding to the image is formed, and the position information of the text region corresponding to the image is determined.

104. And inputting the position information of the text region corresponding to the image and the image requested to be detected into the text recognition model together to obtain the text information in the image.

It can be understood that the trained text recognition model has the capability of recognizing text information in a text region, and in the process of training the text recognition model, the text sample image of the needle-like printing font and the label of the text information in the text region in the text sample image are used, and parameters of the text recognition model are continuously adjusted through forward propagation and reverse deviation correction, so that the text information in the text region in the printing font image can be accurately recognized through the image of the text recognition model.

According to the method for recognizing the text in the image, the text sample image of the needle-like printing font after the scene processing is obtained, and the text sample image after the scene processing is covered with richer picture characteristics, so that the text region detection model and the text recognition model obtained through training have higher scene recognition capability, and therefore text information in different scene images can be recognized in the process of recognizing the text in the image. Compared with the method for recognizing the text in the image in the prior art, the method has the advantages that the sample data collected in the actual scene is expanded, a large amount of labor cost is not needed to be consumed to collect the sample, the sample collection process is simplified, the labeling time of the sample data is saved, the model trained by the expanded sample data can be used for well fitting the actual scene, and the accuracy of recognizing the text in the image is improved.

The embodiment of the invention provides another method for recognizing texts in images, so that a trained model can well fit an actual scene, and the accuracy of text recognition in the images is improved, as shown in fig. 2, the method comprises the following steps:

201. acquiring a printing sample image generated by using a printing mode, and setting an attribute value corresponding to the printing sample image.

Generally, when a printed text image generated by a printing mode identifies text information through a text identification model, because the text information in the printed text image is strokes composed of individual dot matrixes and the text image generated by a later text generation algorithm is continuous strokes, the printed text cannot be identified well in the text identification model.

In order to better identify the text information in the printed text image, the printed sample image generated by using the printing mode can be acquired as the training data of the text identification model, and the diversity of the training data can be enriched by setting the attribute value corresponding to the printed sample image.

Specifically, the attribute values corresponding to the printing sample images are set, the background images of the printing sample images can be randomly selected, the color mean value Bcolor of the background images is calculated and recorded, the color of text information in the printing sample images can be randomly selected, the color of the text information is recorded as Tcolor, the size and the interval of text fonts in the printing sample images can be randomly selected, the printing sample images can be made to be more fit with the printing text images under the actual application scene through setting the attribute values corresponding to the printing sample images, and therefore the generated printing sample images have a better training effect in the subsequent model training process.

For example, for font selection in the print sample image, a song body, an imitation song, a gothic font, etc. may be set in the print image, and a print sample image of one font may be randomly selected from these several fonts when each print sample image is generated; the same applies to the font color and the rotation angle of the print sample image, and the random value of a certain interval is randomly adjusted to set the attribute value corresponding to the print image sample.

202. And performing scene processing on the printing sample image by changing the attribute value corresponding to the pixel in the printing sample image to obtain a character sample image of the needle-like printing character body.

For the embodiment of the present invention, specifically, attribute values corresponding to pixels in a print sample image are changed, an optimal threshold value for dividing color attribute values corresponding to pixels in the print sample image is determined by using a maximum inter-class variance method, and binarization processing is performed on the print sample image by using the optimal threshold value as a dividing basis to obtain background pixels and foreground pixels of the print sample image after binarization processing.

Specifically, the assumed gray value can be used to divide the color attribute values corresponding to the pixels in the print sample image into two groups, calculate the inter-class variance, where one group of color attribute values is assumed gray value, and the other group of color attribute values is not greater than assumed gray value, and determine the maximum value of the inter-class variance as the optimal threshold value of the color attribute value by changing the assumed gray value.

It should be noted that, in the above, the optimal threshold value of the color attribute value corresponding to the pixel in the divided print sample image is determined by using the maximum inter-class variance method, and black and white may also be directly selected as the color attribute value corresponding to the pixel in the divided print sample image, for example, the print sample image is subjected to binarization processing to obtain the binarization mask images of the image foreground (white) and the image background (black), respectively.

In order to change the background color of the print image, after the print sample image is subjected to binarization processing, Tcolor is an optimal threshold value of a color attribute, pixels on a mask are traversed line by line from the upper left corner, the total number N of each continuous white pixel is recorded, the length of a pixel needing a breakpoint is assumed to be M, the current traversal pixel value is set to be P, and a corresponding region on the print sample image with P/(2M) > M is set to be a background Bcolor.

For the embodiment of the present invention, the print sample image is specifically subjected to a scenarization process to obtain a text sample image of the needle-like printed character, the background pixels of the print sample image after the binarization process are divided into a plurality of background portions according to a preset proportion, and the print sample image is subjected to a scenarization process for the pixel values of the parameters corresponding to the background portions to obtain a text sample image of the needle-like printed character.

For example, M pixels that become the background color on the original image may be divided into three parts according to a ratio of 1:4:1, and the set colors of the three parts are (Bcolor + Tcolor)/2, Bcolor, (Bcolor + Tcolor)/2, respectively.

In order to change the contrast of the printing sample image, the character sample image of the similar needle printing font after the scene with increased contrast can be obtained by adjusting the pixel value of the corresponding contrast of each background part, so that the character sample image covers the scenes with different contrasts;

the formula for specifically adjusting the contrast of the image of the print sample is as follows:

g(x)＝alpha*f(x)+beta

wherein, alpha: random value of 0.2-0.8, beta: random values of 0.2-0.8;

by multiplying the print sample image by alpha and adding beta, a print sample image with different contrast can be obtained, so that the generated print sample image can cover more contrast scenes.

In order to change the blurring effect of the printing sample image, the blurring processing is carried out on the pixel values of the corresponding parameters of each background part to obtain the character sample image of the pin-like printing font after the blurring effect is increased, so that the character sample image covers the scene of the blurring effect.

Specifically, the method for adjusting the blurring degree of the print sample image may include motion blurring and gaussian blurring, and for increasing the motion blurring effect of the print sample image, the print sample image may be subjected to the following operations:

firstly, determining a transformation matrix:

M＝cv2.getRotationMatrix2D(center，angle，scale)

wherein the content of the first and second substances,

α＝scale·cosangle

β＝scale·sinangle

then, obtaining a convolution kernel of the motion blur through affine transformation:

cv2.warpAffine(src，dst,M，dsize)

kernel(x，y)＝src(M11*x+M12*y+M13,M21*x+M22*y+M23)

then performing convolution operation;

cv2.filter2D(src，dst，ddepth，kernel，anchor＝(-1,-1))

dst(x，y)＝Σkernel(x′,y′)*src(x+x′-anchor.x，y+y′-anchor.y)

finally, after the printing sample image and the Gaussian kernel are subjected to convolution operation, the motion blur effect can be increased, wherein the blur angle and the blur degree of the motion blur can be randomly generated through random so as to ensure the diversity of the printing sample image.

For increasing the gaussian blur effect of a print sample image, the print image sample may be subjected to the following operations:

the gaussian kernel is first determined by:

Cv2.getGaussianKernel(ksize，sigma)

where i 0. ksize-1, α is a scaling factor such that Σ Gi 1

Finally, after the convolution operation is carried out on the printing sample image and the Gaussian kernel, the effect of Gaussian blur can be increased.

203. And marking the position information of the text area in the character sample image of the needle-like printing font, and inputting the marked position information into a first network model for training to obtain a text area detection model.

In order to facilitate the definition of the boundary of the text region, different regions may exist in the text sample image of the pin-like printing font, for example, the text region, the picture region, the blank region, and the like, and the non-text region is not a target region for text region detection, so that the text region needs to be labeled.

The first network model can adopt a CTPN network frame and comprises a 3-layer structure, the first layer is a convolution structure, namely a CNN structure, and spatial information of a receptive field can be learned by extracting image region characteristics corresponding to a text sample image through a convolution layer; the second layer is a decoding layer, namely a BLSTM structure, and generates horizontal text sequence characteristics according to image area characteristics corresponding to character sample images through the decoding layer, so that the sequence characteristics of horizontal texts can be well dealt with; and the third layer is a prediction layer, namely an RPN structure, determines a text region in the text sample image according to the horizontal text sequence characteristics through the prediction layer, and processes the text region to obtain candidate text lines.

Specifically, the prediction layer of the first network model comprises a classification part and a regression part, and in the process of determining the text region in the text sample image according to the horizontal text sequence characteristics through the prediction layer of the network model and processing the text region to obtain candidate text lines, the classification part of the prediction layer of the network model can classify each region in the text sample image according to the horizontal text sequence characteristics to determine the text region in the text sample image; and performing frame regression processing on the text region in the character sample image through a regression part of a prediction layer of the network model to obtain candidate text lines.

In the specific implementation process, in the convolutional layer part, the CTPN may select feature maps of conv5 in the VGG model as the final feature of the image, where the size of the feature maps is H × W × C; then, due to the sequence relation among texts, a 3 × 3 area around each point on the feature maps can be extracted by adopting a 3 × 3 sliding window at the decoding layer to be used as the feature vector representation of the point, at the moment, the size of the image is changed into H × W × 9C, then each line is used as the length of the sequence, the height is used as the batch _ size, a 128-dimensional Bi-LSTM is transmitted in, and the output of the decoding layer is W × H × 256; and finally, outputting a decoding layer and accessing the decoding layer into a prediction layer, wherein the prediction layer comprises two parts, namely anchor classification and bounding box regression, whether each region in the image is a text region can be determined through the anchor classification, and a group of vertical strip-shaped candidate text lines can be obtained after the bounding box regression processing and carry a label of whether the candidate text lines are the text regions.

Further, in order to ensure the accuracy of the prediction of the trained text region detection model, the preset loss function can perform parameter adjustment on the multilayer structure in the text region detection model based on the deviation between the result output by the text region detection model and the data labeled by the real text region. For the embodiment of the invention, the pre-trained loss function mainly comprises 3 parts, wherein the first part is a loss function for detecting whether the Anchor is a text region; the second part is a loss function for detecting regression of the anchor's y-coordinate offset; the third part is the loss function of the x-coordinate offset regression used to detect Anchor.

204. And labeling the text information in the text area in the character sample image of the needle-like printing font, and inputting the labeled text information into a second network model for training to obtain a text recognition model.

The second network model can adopt a CRNN network architecture and comprises 3 layers of structures, the first layer is a convolution structure, namely a CNN structure, and the spatial information of the receptive field can be learned by extracting image region characteristics corresponding to the text sample image through the convolution layer; the second layer structure is a circulation layer, namely an RNN structure, and the label distribution of each frame in the image is predicted through the circulation layer according to the image area characteristics corresponding to the character sample image; the third layer structure is a transcription layer, namely a CTC structure, the label distribution of each frame in the image is integrated and the like through the transcription layer to form a final label sequence, and a text recognition result corresponding to each text region in the character sample image is output.

In the specific implementation process, in the convolutional layer part, a feature sequence of an input text sample image can be automatically extracted, vectors in the extracted feature sequence are generated sequentially from left to right on a feature map, and each feature vector represents a feature on a certain width on the image. In the cyclic layer part, the RNN cyclic neural network can be used for forming, the label distribution (probability list of real results) of each feature vector in the feature sequence is predicted, the error of the cyclic layer is propagated reversely, finally regret is converted into the feature sequence, the feature sequence is fed back to the convolutional layer, and the cyclic layer part can be used as a bridge connected between the convolutional layer and the cyclic layer by defining a self-defined inner layer. In the transcription layer part, a CTC model can be used to transform all possible outcomes of a predicted signature sequence into a final outcome by integrating them, typically connected at the last layer of the RNN network for sequence learning and training. For a sequence with the length of T, each sample point T (T is far larger than T) outputs a softmax vector at the last layer of the RNN, which represents the prediction probability of the sample point, and after the probabilities of all the sample points are transmitted to the CTC model, the most probable label is output, and the final sequence label can be obtained through space (blank) removal and deduplication operations.

205. When an image text detection request is received, inputting the image requested to be detected into the text region identification model, and determining the position information of the text region corresponding to the image.

It can be understood that each printed sample image has a corresponding output file through the text region detection model, the output file stores the position information of all candidate text lines in the image and whether the candidate text lines are labels of the text regions, the candidate text lines are equivalent to vertical strip lines split from the text regions, the candidate text lines are connected to form the text regions in the image based on a text line construction algorithm, and the position information of the text regions corresponding to the image is determined by combining the position information of each candidate text line.

206. And inputting the position information of the text region corresponding to the image and the image requested to be detected into the text recognition model together to obtain the text information in the image.

Further, as a specific implementation of the method shown in fig. 1, an embodiment of the present invention provides an apparatus for recognizing a text in an image, where as shown in fig. 3, the apparatus includes: an acquisition unit 31, a training unit 32, a determination unit 33, and a recognition unit 34.

An acquiring unit 31, which may be configured to acquire a text sample image of the pin-like print font after the scenarization processing;

the training unit 32 may be configured to input the text sample images with the print-like font as training data into network models with different architectures, respectively, for training, so as to obtain a text region detection model and a text recognition model;

a determining unit 33, configured to, when receiving an image text detection request, input the image requested to be detected to the text region identification model, and determine position information of a text region corresponding to the image;

the identifying unit 34 may be configured to input the position information of the text region corresponding to the image and the image requested to be detected to the text recognition model together, so as to obtain text information in the image.

According to the device for recognizing the text in the image, provided by the embodiment of the invention, the text sample image of the needle-like printing font after the scene processing is obtained, and the text sample image after the scene processing is covered with richer picture characteristics, so that the text region detection model and the text recognition model obtained by training have higher scene recognition capability, and therefore, in the process of recognizing the text in the image, text information in different scene images can be realized. Compared with the method for recognizing the text in the image in the prior art, the method has the advantages that the sample data collected in the actual scene is expanded, a large amount of labor cost is not needed to be consumed to collect the sample, the sample collection process is simplified, the labeling time of the sample data is saved, the model trained by the expanded sample data can be used for well fitting the actual scene, and the accuracy of recognizing the text in the image is improved.

As a further description of the device for recognizing a text in an image shown in fig. 3, fig. 4 is a schematic structural diagram of another device for recognizing a text in an image according to an embodiment of the present invention, and as shown in fig. 4, the obtaining unit 31 includes:

a setting module 311, configured to obtain a print sample image generated by using a printing method, and set an attribute value corresponding to the print sample image;

the processing module 312 may be configured to perform a scenization process on the print sample image by changing the attribute value corresponding to the pixel in the print sample image, so as to obtain a text sample image of the needle-like print body.

Further, the processing module 312 includes:

a determining sub-module 3121 configured to determine an optimal threshold for dividing the color attribute values corresponding to the pixels in the print sample image by using a maximum inter-class variance method;

the first processing sub-module 3122 is configured to perform binarization processing on the print sample image by using the optimal threshold as a division basis, so as to obtain background pixels and foreground pixels of the print sample image after binarization processing;

the dividing submodule 3123 may be configured to divide the background pixels of the print sample image after the binarization processing into a plurality of background portions according to a preset ratio;

the second processing sub-module 3124 may be configured to perform scene processing on the print sample image according to the pixel values of the parameters corresponding to each background portion, so as to obtain a text sample image of the needle-like print body.

Further, the determining sub-module 3121 may be specifically configured to divide the color attribute values corresponding to the pixels in the print sample image into two groups by using an assumed gray value, and calculate an inter-class variance, where one group of the color attribute values is taken as the assumed gray value, and another group of the color attribute values is taken as not greater than the assumed gray value;

the determining sub-module 3121 may be further configured to determine, as the optimal threshold of the color attribute value, the assumed gray value when the inter-class variance is maximum by changing the assumed gray value.

Further, the second processing sub-module 3124 may be specifically configured to obtain a text sample image of a quasi-pin print font after a scene with increased contrast by adjusting pixel values of corresponding contrasts of the background portions, so that the text sample image covers scenes with different contrasts;

the second processing sub-module 3124 may be further configured to perform blurring processing on the pixel values of the parameters corresponding to the background portions to obtain a text sample image with increased blurring effect and similar to the pin print font, so that the text sample image covers a scene with blurring effect.

Further, the training unit 33 includes:

the first training module 321 may be configured to label the position information of the text region in the text sample image of the print-like font and input the labeled position information into a first network model for training to obtain a text region detection model;

the second training module 322 may be configured to label text information in a text region in the text sample image of the print-like font and input the labeled text information to the second network model for training, so as to obtain a text recognition model.

Further, the first network model includes a multi-layer structure,

the first training module 321 may be specifically configured to extract, through the convolution layer of the first network model, image region features corresponding to the text sample image of the pin-printing-like font;

the first training module 321 may be further specifically configured to generate, by using a decoding layer of the first network model, a horizontal text sequence feature according to an image region feature corresponding to a text sample image;

the first training module 321 may be further configured to determine a text region in the text sample image according to the horizontal text sequence feature through a prediction layer of the first network model, and process the text region to obtain a candidate text line.

It should be noted that other corresponding descriptions of the functional units related to the device for recognizing a text in an image provided in this embodiment may refer to the corresponding descriptions in fig. 1 and fig. 2, and are not described herein again.

Based on the above methods shown in fig. 1 and fig. 2, correspondingly, the present embodiment further provides a storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the method for recognizing text in the image shown in fig. 1 and fig. 2.

Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the implementation scenarios of the present application.

Based on the method shown in fig. 1 and fig. 2 and the virtual device embodiment shown in fig. 3 and fig. 4, in order to achieve the above object, an embodiment of the present application further provides a computer device, which may specifically be a personal computer, a server, a network device, and the like, where the entity device includes a storage medium and a processor; a storage medium for storing a computer program; a processor for executing a computer program to implement the method for recognizing text in images as shown in fig. 1 and fig. 2.

Optionally, the computer device may also include a user interface, a network interface, a camera, Radio Frequency (RF) circuitry, sensors, audio circuitry, a WI-FI module, and so forth. The user interface may include a Display screen (Display), an input unit such as a keypad (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, etc. The network interface may optionally include a standard wired interface, a wireless interface (e.g., a bluetooth interface, WI-FI interface), etc.

Those skilled in the art will appreciate that the physical device structure of the text recognition device in the image provided in the present embodiment does not constitute a limitation to the physical device, and may include more or less components, or combine some components, or arrange different components.

The storage medium may further include an operating system and a network communication module. The operating system is a program that manages the hardware and software resources of the computer device described above, supporting the operation of information handling programs and other software and/or programs. The network communication module is used for realizing communication among components in the storage medium and other hardware and software in the entity device.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus a necessary general hardware platform, and can also be implemented by hardware. Compared with the prior art, the method and the device have the advantages that sample data collected by the actual scene are expanded, a large amount of labor cost is not needed to be consumed to collect the sample, the sample collection process is simplified, the labeling time of the sample data is saved, the model trained by the expanded sample data can be used for well fitting the actual scene, and the accuracy of text recognition in the image is improved.

Those skilled in the art will appreciate that the figures are merely schematic representations of one preferred implementation scenario and that the blocks or flow diagrams in the figures are not necessarily required to practice the present application. Those skilled in the art will appreciate that the modules in the devices in the implementation scenario may be distributed in the devices in the implementation scenario according to the description of the implementation scenario, or may be located in one or more devices different from the present implementation scenario with corresponding changes. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.

The above application serial numbers are for description purposes only and do not represent the superiority or inferiority of the implementation scenarios. The above disclosure is only a few specific implementation scenarios of the present application, but the present application is not limited thereto, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present application.

Claims

1. A method for recognizing text in an image, the method comprising:

2. The method according to claim 1, wherein the acquiring of the text sample image of the pin-like printing font after the scenarization processing specifically comprises:

3. The method according to claim 2, wherein the obtaining a text sample image of the needle-like printed body by performing the scenization processing on the print sample image by changing the attribute values corresponding to the pixels in the print sample image includes:

4. The method according to claim 3, wherein the determining an optimal threshold value for dividing the color attribute values corresponding to the pixels in the print sample image by using a maximum inter-class variance method specifically comprises:

5. The method according to claim 3, wherein the performing a scene process on the print sample image for the pixel value of the parameter corresponding to each background portion to obtain a text sample image of the needle-like print body specifically includes:

6. The method according to any one of claims 1 to 5, wherein the step of inputting the text sample images of the pin-like printing fonts as training data into network models of different architectures for training to obtain a text region detection model and a text recognition model specifically comprises:

7. The method according to claim 6, wherein the first network model includes a multilayer structure, and the step of inputting the position information of the text region in the text sample image of the needle-like printing font after labeling into the first network model for training to obtain the text region detection model specifically includes:

8. An apparatus for recognizing text in an image, the apparatus comprising:

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer storage medium on which a computer program is stored, characterized in that the computer program, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.