WO2021218706A1

WO2021218706A1 - Text identification method and apparatus, device, and storage medium

Info

Publication number: WO2021218706A1
Application number: PCT/CN2021/088389
Authority: WO
Inventors: 王文佳; 刘学博; 谢恩泽
Original assignee: 北京市商汤科技开发有限公司
Priority date: 2020-04-30
Filing date: 2021-04-20
Publication date: 2021-11-04
Also published as: CN111553290A; JP2022550195A

Abstract

Embodiments of the present invention provide a text identification method and apparatus, a device, and a storage medium. By obtaining a feature map of a first text image, and processing the first text image according to at least one feature sequence comprised in the feature map, a second text image of which the resolution is greater than that of the first text image is obtained. Because image blocks in the first text image have correlations, by using the mode above, the first text image having the relatively low resolution can be recovered into the second text image having the relatively high resolution by effectively utilizing the correlations between the text, and then text content in the first text image can be identified by carrying out text identification on the second text image.

Description

Text recognition method, device, equipment and storage medium

Cross-references to related applications

This disclosure claims the priority of a Chinese patent application filed on April 30, 2020 with an application number of 202010362519X and an invention title of "text recognition method, device, equipment, and storage medium". The entire content of the Chinese patent application is based on The way of reference is incorporated into this article.

Technical field

The present disclosure relates to the field of computer vision technology, and in particular to text recognition methods, devices, equipment, and storage media.

Background technique

Low-resolution text images are very common in daily life. For example, the resolution of a text image collected by a terminal device equipped with an image collection device such as a mobile phone may be low. These images have lost detailed content information, resulting in a low recognition accuracy of the text in the image. The traditional text recognition method generally reconstructs the texture of the image first, and then performs text recognition based on the reconstructed image. However, the recognition accuracy of this method is low.

Summary of the invention

The present disclosure provides a text recognition method, device, equipment and storage medium.

According to a first aspect of the embodiments of the present disclosure, there is provided a text recognition method, the method includes: acquiring a feature map of a first text image, the feature map includes at least one feature sequence, and the feature sequence is used to represent all The correlation between at least two image blocks in the first text image; processing the first text image according to the at least one feature sequence to obtain a second text image, the resolution of the second text image is greater than The resolution of the first text image; performing text recognition on the second text image.

In some embodiments, the acquiring a feature map of the first text image includes: acquiring a plurality of channel maps of the first text image and a binary image corresponding to the first text image; Perform feature extraction on the image and the binary image to obtain the feature map of the first text image.

In some embodiments, the obtaining a feature map of the first text image includes: inputting the first text image into a pre-trained neural network, and obtaining a feature map output by the neural network.

In some embodiments, the neural network obtains the feature map based on the following method: generating an intermediate image according to the first text image, the number of channels of the intermediate image is greater than the number of channels of the first text image; Perform feature extraction on the intermediate image to obtain the feature map.

In some embodiments, the neural network includes at least one convolutional neural network and a bidirectional long and short-term memory network, and an output end of the at least one convolutional neural network is connected to an input end of the bidirectional long and short-term memory network; Obtaining the feature sequence of the first text image includes: inputting the first text image into the at least one convolutional neural network, acquiring an intermediate image output by the at least one convolutional neural network; inputting the intermediate image into the A bidirectional long and short-term memory network to obtain the feature map output by the bidirectional long and short-term memory network.

In some embodiments, the neural network includes multiple sub-networks connected in sequence; the inputting the first text image into a pre-trained neural network and obtaining the feature map output by the neural network includes: The i-th output image output by the i-th sub-network in the plurality of sub-networks is input to the i+1-th sub-network in the plurality of sub-networks to generate the i+1-th intermediate image through the i+1-th sub-network, and compare Feature extraction is performed on the i+1th intermediate image to obtain the i+1th output image; the number of channels of the i+1th intermediate image is greater than the number of channels of the ith output image; the Nth output image is determined to be all The feature map; where i and N are positive integers, N is the total number of sub-networks, 1≤i≤N-1, N≥2, where the first output image is obtained by the following method: the first sub-network is based on the The first text image generates a first intermediate image, and performs feature extraction on the first intermediate image to obtain a first output image.

In some embodiments, the method further includes: before processing the first text image according to the at least one feature sequence, processing the first text image so that the The number of channels is the same as that of the feature map.

In some embodiments, the method further includes: after obtaining the second text image, processing the second text image so that the number of channels of the second text image is equal to the number of channels of the first text image. The number is the same; the performing text recognition on the second text image includes: performing text recognition on the processed second text image.

In some embodiments, the method further includes: training the neural network based on at least one set of training images, each set of training images includes a first training image and a second training image, the first training image and the The second training image includes the same text; wherein the resolution of the first training image is less than the first resolution threshold, the resolution of the second training image is greater than the second resolution threshold, and the first resolution threshold is less than Or equal to the second resolution threshold.

In some embodiments, the training the neural network based on at least one set of training images includes: inputting the first training image into the neural network, and obtaining an output image of the neural network; based on the The second training image corresponding to the first training image and the output image determine a loss function; and the neural network is supervised and trained based on the loss function.

In some embodiments, the loss function includes at least one of a first loss function and a second loss function; the first loss function is based on each corresponding pixel in the first training image and the second training image And/or, the second loss function is determined based on the difference between the gradient fields of each corresponding pixel in the first training image and the second training image.

In some embodiments, the method further includes aligning the first training image and the second training image before training the neural network based on the at least one set of training images.

In some embodiments, the aligning the first training image and the second training image includes: processing the first training image through a pre-trained spatial transformation network to transform the first training image The text in is aligned with the text in the second training image.

In some embodiments, the first training image is obtained by photographing the subject at the first position by a first image acquisition device with a first focal length; the second training image is obtained by a first image acquisition device with a second focal length. The second image acquisition device is obtained by photographing the subject at the first position; the first focal length is smaller than the second focal length.

According to a second aspect of the embodiments of the present disclosure, there is provided a text recognition device, the device includes: an acquisition module for acquiring a feature map of a first text image, the feature map includes at least one feature sequence, the feature The sequence is used to represent the correlation between at least two image blocks in the first text image; the first processing module is used to process the first text image according to the at least one characteristic sequence to obtain the second text Image, the resolution of the second text image is greater than the resolution of the first text image; a text recognition module is used to perform text recognition on the second text image.

In some embodiments, the acquisition module includes: a first acquisition unit configured to acquire multiple channel diagrams of the first text image and a binary image corresponding to the first text image; and a feature extraction unit configured to Perform feature extraction on the multiple channel images and the binary image to obtain a feature map of the first text image.

In some embodiments, the acquisition module is configured to: input the first text image into a pre-trained neural network, and acquire a feature map output by the neural network.

In some embodiments, the neural network includes at least one convolutional neural network and a bidirectional long and short-term memory network, and an output end of the at least one convolutional neural network is connected to an input end of the bidirectional long and short-term memory network; The acquisition module includes: a second acquisition unit, configured to input the first text image into the at least one convolutional neural network, and acquire an intermediate image output by the at least one convolutional neural network; and a third acquisition unit, configured to The intermediate image is input to the two-way long and short-term memory network, and the feature map output by the two-way long and short-term memory network is obtained.

In some embodiments, the neural network includes multiple sub-networks connected in sequence; the acquisition module is configured to: input the i-th output image output by the i-th sub-network of the multiple sub-networks into the multiple sub-networks To generate the i+1th intermediate image through the i+1th subnetwork, and perform feature extraction on the i+1th intermediate image to obtain the i+1th output image; The number of channels of the i+1th intermediate image is greater than the number of channels of the i-th output image; the Nth output image is determined as the feature map; where i and N are positive integers, and N is the total number of sub-networks, 1≤ i≤N-1, N≥2, where the first output image is obtained by the following method: the first sub-network generates a first intermediate image according to the first text image, and performs feature extraction on the first intermediate image, Obtain the first output image.

In some embodiments, the device further includes: a second processing module, configured to process the first text image before processing the first text image according to the at least one characteristic sequence, so that The number of channels of the first text image is the same as the number of channels of the feature map.

In some embodiments, the device further includes: a third processing module, configured to process the second text image after obtaining the second text image, so that the number of channels of the second text image is equal to that of the second text image. The number of channels of the first text image is the same; the text recognition module is used to perform text recognition on the processed second text image.

In some embodiments, the device further includes: a training module for training the neural network based on at least one set of training images, and each set of training images includes a first training image and a second training image. The training image and the second training image include the same text; wherein the resolution of the first training image is less than the first resolution threshold, the resolution of the second training image is greater than the second resolution threshold, and the first A resolution threshold is less than or equal to the second resolution threshold.

In some embodiments, the training module includes: an input unit, configured to input the first training image into the neural network, and obtain an output image of the neural network; and a determining unit, configured based on the first training image The second training image corresponding to the training image and the output image determine a loss function; the training unit is configured to perform supervised training on the neural network based on the loss function.

In some embodiments, the device further includes: an alignment module for performing training on the first training image and the second training image before training the neural network based on the at least one set of training images Aligned.

In some embodiments, the alignment module is used to process the first training image through a pre-trained spatial transformation network, so as to compare the text in the first training image with the text in the second training image. The text is aligned.

According to a third aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the method described in any of the embodiments is implemented.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a computer device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor. The processor implements any implementation when the program is executed. The method described in the example.

According to a fifth aspect of the embodiments of the present disclosure, there is provided a computer program, wherein when the computer program is executed by a processor, the method described in any one of the embodiments is implemented.

The embodiment of the present disclosure obtains a feature map of a first text image, and processes the first text image according to at least one feature sequence included in the feature map to obtain a second text with a larger resolution than the first text image Image, due to the correlation between the image blocks in the first text image, the correlation between the text can be effectively used through the above method to restore the first text image with a lower resolution to a second text image with a higher resolution. For the text image, text recognition is performed on the second text image, thereby recognizing the text content in the first text image, which improves the accuracy of text recognition.

It should be understood that the above general description and the following detailed description are only exemplary and explanatory, rather than limiting the present disclosure.

Description of the drawings

The drawings herein are incorporated into the specification and constitute a part of the specification. These drawings illustrate embodiments that conform to the present disclosure, and are used together with the specification to explain the technical solutions of the present disclosure.

Fig. 1A is a first schematic diagram of a text image according to an embodiment of the present disclosure.

Fig. 1B is a second schematic diagram of a text image according to an embodiment of the present disclosure.

Fig. 1C is a third schematic diagram of a text image according to an embodiment of the present disclosure.

Fig. 2 is a flowchart of a text recognition method according to an embodiment of the present disclosure.

FIG. 3 is a schematic diagram of the misalignment phenomenon between images in an embodiment of the present disclosure.

FIG. 4 is a schematic diagram of the overall flow of a text recognition method according to an embodiment of the present disclosure.

Fig. 5 is a block diagram of a text recognition device according to an embodiment of the present disclosure.

Fig. 6 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure.

Detailed ways

The exemplary embodiments will be described in detail here, and examples thereof are shown in the accompanying drawings. When the following description refers to the accompanying drawings, unless otherwise indicated, the same numbers in different drawings represent the same or similar elements. The implementation manners described in the following exemplary embodiments do not represent all implementation manners consistent with the present disclosure. On the contrary, they are merely examples of devices and methods consistent with some aspects of the present disclosure as detailed in the appended claims.

The terms used in the present disclosure are only for the purpose of describing specific embodiments, and are not intended to limit the present disclosure. The singular forms of "a", "said" and "the" used in the present disclosure and appended claims are also intended to include plural forms, unless the context clearly indicates other meanings. It should also be understood that the term "and/or" as used herein refers to and includes any or all possible combinations of one or more associated listed items. In addition, the term "at least one" herein means any one of a plurality of types or any combination of at least two of the plurality of types.

It should be understood that, although the terms first, second, third, etc. may be used in the present disclosure to describe various information, the information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other. For example, without departing from the scope of the present disclosure, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information. Depending on the context, the word "if" as used herein can be interpreted as "when" or "when" or "in response to determination".

In order to enable those skilled in the art to better understand the embodiments of the present disclosure, and to make the above objectives, features, and advantages of the embodiments of the present disclosure more obvious and understandable, the embodiments of the present disclosure are described in further detail below with reference to the accompanying drawings.

In daily life, it is often necessary to recognize text information from text images, that is, to perform text recognition. The resolution of some text images (for example, text images collected by terminal devices equipped with image collection devices such as mobile phones) will be relatively low. These lower-resolution images lose detailed content information, resulting in lower accuracy of text recognition in the image. This problem is particularly serious for Scene Text Image (STI). A scene text image is an image that contains text information captured in a natural scene. The text information in the scene text image may include but is not limited to at least one of an ID number, a ticket, a billboard, a license plate, and the like. Examples of text information are shown in Figs. 1A to 1C. Since the characteristics of the text in different scene text images are quite different, for example, the text size, font, color, brightness and/or degree of distortion may be different, therefore, the difficulty of text recognition for scene text images is much greater than that of scanned documents The text in the image is recognized, which leads to a lower recognition accuracy of the scene text image than the recognition accuracy of the printed text image.

The traditional text recognition method generally uses the color similarity of adjacent pixels in the text image to interpolate between the colors of adjacent pixels according to a predefined method, thereby reconstructing the texture of the text image, and then based on The reconstructed text image performs text recognition. This text recognition method has a higher recognition accuracy rate for relatively clear text images, but the recognition accuracy rate for low-resolution text images drops sharply. Based on this, an embodiment of the present disclosure provides a text recognition method. As shown in FIG. 2, the method may include step 201 to step 203.

Step 201: Obtain a feature map of a first text image, where the feature map includes at least one feature sequence, and the feature sequence is used to represent the correlation between at least two image blocks in the first text image.

Step 202: Process the first text image according to the at least one characteristic sequence to obtain a second text image, the resolution of the second text image is greater than the resolution of the first text image.

Step 203: Perform text recognition on the second text image.

In step 201, the text in the first text image may include at least one of characters, symbols, and numbers. In some embodiments, the first text image may be an image captured in a natural scene, and the text in the first text image may be various types of text in a natural scene. For example, the first text image may be an image of an ID card, and the text in the first text image is the ID number and name on the ID card. For another example, the first text image may be an image of a billboard, and the text in the first text image is a slogan on the billboard. In other embodiments, the first text image may also be an image including printed text. In practical applications, the first text image may be a text image whose resolution is low and the accuracy of text recognition is lower than a preset accuracy threshold.

The individual words that make up a word or phrase or the letters that make up a word are not randomly combined. For example, for a group of texts "playing *mouse", because "playing a mole" is a frequently occurring phrase, therefore, "* The content of "has a high probability of being "land". This method of inferring text content based on context takes advantage of the correlation between texts. Because texts tend to have strong relevance. Therefore, feature extraction can be performed on the first text image to obtain a feature map of the first text image. Specifically, feature extraction may be performed on the first text image in the horizontal direction and/or the vertical direction respectively to obtain at least one feature sequence of the first text image. Each feature sequence may represent the correlation between at least two image blocks in the first text image.

In some embodiments, each pixel can be regarded as an image block, and each element in the feature sequence can represent the correlation between adjacent pixels in the first text image. In other embodiments, multiple adjacent pixels can also be used as an image block, and each element in the feature sequence can represent the correlation between adjacent image blocks in the first text image. .

In many cases, the background of the first text image is monochrome, and the color of the background is generally different from the color of the text. Therefore, the approximate position of the text in the first text image can be determined according to the binary image corresponding to the first text image . In the case of a large difference between the background color and the text color, a more accurate result can be obtained by using a binary image to determine the position of the text. In addition, the color of the text in the first text image can be determined according to the channel map of the first text image. Therefore, in some embodiments, multiple channel images of the first text image and binary images corresponding to the first text image can be acquired; feature extraction is performed on the multiple channel images and the binary image To obtain a feature map of the first text image.

Wherein, the binary image may be obtained according to the average gray value of the first text image. Specifically, the average gray value of each pixel in the first text image can be calculated, the gray value of the pixel whose pixel value is greater than the average gray value is determined as the first gray value, and the pixel value is less than or equal to The gray value of the pixel of the average gray value is determined as the second gray value, and the first gray value is greater than the second gray value. In some embodiments, the difference between the first gray value and the second gray value may be greater than a preset pixel value. For example, the first gray value may be 255, and the second gray value may be 0, so that each pixel in the binary image is a black pixel or a white pixel. In this way, the difference between the pixel value of the background pixel and the pixel value of the text pixel can be increased, so that the positioning of the text is more accurate. The channel diagram may be a channel diagram of the R channel, G channel, and B channel of an RGB (Red Green Blue) image, or may be a channel diagram of other channels used to characterize the color of the image.

In some embodiments, the first text image can be input to a pre-trained neural network, and a feature map output by the neural network can be obtained. The neural network may be a convolutional neural network (Convolutional Neural Networks, CNN), a long-short-term memory network (Long-Short Term Memory, LSTM), or other types of neural networks, or a neural network composed of multiple neural networks. The internet. In some embodiments, a bidirectional Long-Short Term Memory (BLSTM) can be used to obtain the feature map, and feature extraction is performed on the first text image in the horizontal and vertical directions at the same time, In order to improve the robustness of the reconstructed second text image.

The neural network may first generate an intermediate image according to the first text image, the number of channels of the intermediate image is greater than the number of channels of the first text image, and then perform feature extraction on the intermediate image to obtain the feature map . By generating an intermediate image whose number of channels is greater than the number of channels of the first text image, the richness of features in the first text image can be increased, thereby increasing the resolution of the reconstructed second text image. In practical applications, the neural network may include at least one convolutional neural network and a bidirectional long- and short-term memory network, each convolutional neural network in the at least one convolutional neural network is connected in turn, and the bidirectional long- and short-term memory The network is connected to the last one of the at least one convolutional neural network. The intermediate image may be generated through the at least one convolutional neural network, and feature extraction may be performed through a bidirectional long and short-term memory network.

Further, the neural network includes a plurality of sub-networks connected in sequence, and the structure of each sub-network may be the same as the structure of a single neural network in the above-mentioned embodiment, which will not be repeated here. Assuming that the i-th sub-network in the neural network is called the i-th sub-network, the i-th output image output by the i-th sub-network among the multiple sub-networks can be input to the multiple sub-networks. An i+1th subnetwork in the network to generate an i+1th intermediate image through the i+1th subnetwork, and perform feature extraction on the i+1th intermediate image to obtain an i+1th output image; The number of channels of the i+1th intermediate image is greater than the number of channels of the ith output image; the Nth output image is determined as the feature map; where i and N are positive integers, and N is the total number of sub-networks, 1≤i≤N-1, N≥2, where the first output image is obtained in the following way: the first sub-network generates a first intermediate image according to the first text image, and characterizes the first intermediate image Extract to get the first output image.

That is to say, the first sub-network generates a first intermediate image based on the first text image, performs feature extraction on the first intermediate image to obtain the first output image, and inputs the first output image to the second sub-network, where the first The number of channels in the intermediate image is greater than the number of channels in the first text image. The second sub-network generates a second intermediate image based on the first output image, extracts features from the second intermediate image to obtain the second output image, and inputs the second output image to the third sub-network, where the channel of the second intermediate image The number is greater than the number of channels of the first output image. And so on. Through multiple cascaded sub-networks, the features in the first text image can be fully extracted, thereby further improving the resolution of the reconstructed second text image.

In step 202, based on the feature sequence, the first text image may be up-sampled using an up-sampling method such as pixel shuffle to obtain a second text image corresponding to the first text image. Further, if the number of channels of the feature map generated in step 201 is greater than the number of channels of the first text image, in step 202, before the first text image is processed according to the at least one feature sequence, The first text image is processed so that the number of channels of the first text image is the same as the number of channels of the feature map. Then, the processed first text image is processed according to the feature sequence in the feature map to obtain the second text image. In this step, the process of processing the first text image to increase the number of channels of the first text image can be implemented by using a convolutional neural network.

On this basis, after the second text image is obtained, the second text image can also be processed so that the number of channels of the second text image is the same as the number of channels of the first text image, that is, Restore the second text image to four channels. This process can also be implemented by a convolutional neural network.

In some embodiments, the neural network used in step 201 may be trained based on multiple sets of training images. Each set of training images includes a first training image and a second training image with the same text. The second training image includes the same text; wherein the resolution of the first training image is less than a preset first resolution threshold, the resolution of the second training image is greater than a preset second resolution threshold, and The first resolution threshold is less than or equal to the second resolution threshold. The first training image may be referred to as a low resolution (Low Resolution, LR) image, and the second training image may be referred to as a high resolution (High Resolution, HR) image.

A text image data set may be established in advance. The text image data set may include a plurality of text image pairs, and each text image pair includes a low-resolution text image and a high-resolution text image corresponding to the low-resolution text image. Resolution text image. The text in the text image pair may be text in various natural scenes, and the natural scene may include, but is not limited to, at least one of scenes such as streets, libraries, shops, and interiors of vehicles.

In other embodiments, the following neural network can also be used as a general neural network, and the general neural network is directly trained through the first training image and the second training image: for feature extraction to obtain A neural network of feature maps, a convolutional neural network used to process the first text image to increase the number of channels of the first text image before feature extraction, and after obtaining the second text image, channel the second text image Restored convolutional neural network.

Specifically, the first training image may be input to the neural network, and an output image of the neural network may be obtained; a loss function may be determined based on the second training image corresponding to the first training image and the output image; Perform supervised training on the neural network based on the loss function.

The loss function may be various types of loss functions, or a combination of two or more loss functions. In some embodiments, the loss function includes at least one of a first loss function and a second loss function, and the first loss function may be based on the mean square error of each corresponding pixel in the first training image and the second training image. Determining, for example, can be the L2 loss function. In other embodiments, the second loss function may be determined based on the difference between the gradient fields of each corresponding pixel in the first training image and the second training image, for example, may be a gradient profile loss function (Gradient Profile Loss, GPL) .

The gradient profile loss function L _{GP is} defined as follows:

in,

Represents the gradient field of the HR image at pixel x,

Represents the gradient field of the super-resolution image corresponding to the HR image (for example, the output image in Figure 4) at pixel x, x ₀ represents the lower limit of the pixel, x ₁ represents the upper limit of the pixel, E represents the calculated energy, in the formula

The subscript 1 represents the calculation of the L1 loss function.

The gradient field vividly shows the text characteristics and background characteristics of the text image. In addition, the LR image always has a wider gradient field curve, while the HR image has a narrower gradient field curve. After acquiring the gradient field of the HR image, the gradient field curve can be compressed to be narrower without performing complicated mathematical operations. Therefore, by using the gradient profile loss function, the sharp boundary between the text feature and the background feature can be reconstructed, which helps to better distinguish the text and the background, and can produce clearer shapes, making the trained neural network more reliable.

In traditional model training methods, low-resolution images are artificially generated by down-sampling high-resolution images (low-resolution images generated in this way are called artificial low-resolution images), and then pass Artificial low-resolution images are used for model training. However, compared with such artificial low-resolution images, real low-resolution images (low-resolution images due to a long shooting focal length, etc.) tend to have lower resolution and more diversification. In addition, in many cases, the text in the text image has various shapes, scattered lighting and different backgrounds. Therefore, the model trained through artificial low-resolution images cannot obtain the feature map of the text image well, resulting in low accuracy of text recognition.

In order to solve the above-mentioned problem, the first training image and the second training image used in the embodiments of the present disclosure are both real images, that is, images taken at different focal lengths. Wherein, the first training image is obtained by a first image acquisition device provided with a first focal length to photograph the subject at a first position, and the second training image is obtained by a second image acquisition device provided with a second focal length It is obtained by photographing the subject at the first position that the first focal length is smaller than the second focal length. The first image acquisition device and the second image acquisition device may be the same image acquisition device, or may be different image acquisition devices. In some embodiments, the value of the first focal length may be between 24 mm and 120 mm, for example, it may be 70 mm. In other embodiments, the value of the second focal length may be between 120 mm and 240 mm, for example, it may be 150 mm. Further, the number of the first focal length and the second focal length may be multiple, and each of the multiple first focal lengths is smaller than the smallest one of the multiple second focal lengths. The second focal length. For example, the first focal length may include 35mm, 50mm, 70mm, etc., and the second focal length may include 150mm, 170mm, 190mm, etc.

When using the text image pairs in the text image data set for model training, generally the text image in the text image pair is first cropped from the area including the text, and the text image in the text image pair is cropped from the low-resolution text image. The resulting image area is used as the first training image, and the image area cropped from the high-resolution text image in the text image pair is used as the second training image. The cropped first training image and the second training image have the same size.

Since the text in the same text image pair is the same, in order to improve processing efficiency, one image in the text image pair is generally used as a reference image, the position of the region to be cropped in the reference image is obtained, and then the other image is adjusted according to the position Make cropping. For example, the high-resolution image in the text image pair can be used as the reference image, and the low-resolution image can be cropped according to the position of the text in the high-resolution image. However, due to the movement of the camera during the shooting and other reasons, the position of the center point of each image will be different. Therefore, through the above-mentioned method of cropping, the position of the text in the first training image and the second training image obtained will be different. The difference is that this phenomenon is called misalignment, as shown in Figure 3. Misalignment will cause the model to mistakenly correspond the background part of one image with the text part of another image, thereby learning the wrong pixel corresponding information and causing ghosting problems.

Therefore, in order to solve the above problem, in some embodiments, before performing neural network training based on the first training image and the second training image with the same text, the first training image and the second training image may also be aligned. . Specifically, the first training image can be processed through a pre-trained model, so that the first training image is aligned with the second training image. The model can interpolate and translate the first training image, thereby aligning the first training image with the second training image. The pre-trained model may be a spatial transformation network (Spatial Transformation Networks, STN). Through image alignment, the ghosting problem can be effectively reduced, and the accuracy of the trained neural network can be improved.

The number of the first training image and the second training image in each group of training images is 1. In order to better recognize images, all images can be rotated to the horizontal direction, and then the neural network can be trained according to the rotated first training image and second training image.

It is also possible to perform scaling processing on at least one of the first training image and the second training image, so that the sizes of the first training image and the second training image reach a preset value. Specifically, the first training image whose pixel size is smaller than the first size can be up-sampled, so that the first training image reaches the first size; the second training image whose pixel size is smaller than the second size can be uploaded. Sampling processing is performed so that the second training image reaches a second size, and the first size is smaller than the second size. In practice, it is found that when the pixel height of the text image reaches 16, the reconstruction of the text image can greatly improve the text recognition effect. If the pixel height of the text image is too small, even if the text image is reconstructed, the recognition result will be Not ideal enough, therefore, a pixel height of 16 can be selected as the first size. Further, the first size may be set to a pixel size of 64×16. On the other hand, when the pixel height exceeds 32, even if the pixel size is increased, the effect of text recognition is not improved. Therefore, a pixel height of 32 can be selected as the second size. Further, the second size may be set to a pixel size of 128×32.

It is also possible to select a part of image pairs from the text image data set as a test set to test the performance of the trained neural network. According to the resolution of the low-resolution images in the image pair, the test set can be divided into three subsets, wherein the resolution of the low-resolution images in the first subset is less than the preset third resolution threshold, The resolution of the low-resolution images in the second subset is greater than the third resolution threshold and less than the preset fourth resolution threshold, and the resolution of the low-resolution images in the third subset is greater than the preset fourth resolution The third resolution threshold is smaller than the fourth resolution threshold. In some embodiments, the third resolution threshold and the fourth resolution threshold may be set according to the resolution range of the low-resolution images in the test set. The performance of the neural network can be tested through three subsets, and the performance of the neural network can be determined according to the test results corresponding to the three subsets.

Fig. 4 shows the overall flow of the text recognition method according to an embodiment of the present disclosure. First, perform general neural network training. The first training image is input to the neural network, where the neural network includes a neural network for feature extraction, and a neural network for increasing and decreasing the number of channels of the first text image, for example, a convolutional neural network, which may also include A neural network used to align training images, for example, a spatial transformation network. Here, each neural network used for feature extraction may be referred to as a sequential residual block (Sequential Residual Block, SRB), and each SRB may include two convolutional neural networks and a bidirectional long short-term memory network (BLSTM). First align the first training image with the second training image, and then process the aligned first training image and second training image through a convolutional neural network to increase the number of channels of the first training image, and then convolution The first training image processed by the neural network is input to a plurality of cascaded sequence residual modules for feature extraction to obtain a feature map of the first training image. Then the feature map is up-sampled by the up-sampling module, and then the number of channels of the up-sampled image is restored to the original number of channels through the convolutional neural network to obtain the output image corresponding to the first training image. Calculate the L2 loss function and the gradient profile loss function according to the second training image corresponding to the output image and the first training image, and supervise the training process of the overall neural network through the above two loss functions to obtain the overall neural network parameter. After the general neural network training is completed, the first text image to be processed is input to the general neural network, and the output image of the general neural network is the second text image. Perform text recognition on the second text image to obtain a text recognition result.

Those skilled in the art can understand that in the above-mentioned methods of the specific implementation, the writing order of the steps does not mean a strict execution order but constitutes any limitation on the implementation process. The specific execution order of each step should be based on its function and possibility. The inner logic is determined.

As shown in FIG. 5, the present disclosure also provides an image processing device, which includes:

The acquiring module 501 is configured to acquire a feature map of a first text image, the feature map includes at least one feature sequence, and the feature sequence is used to represent the correlation between at least two image blocks in the first text image ；

The first processing module 502 is configured to process the first text image according to the at least one characteristic sequence to obtain a second text image, the resolution of the second text image is greater than the resolution of the first text image ；

The text recognition module 503 is used to perform text recognition on the second text image.

In some embodiments, the neural network includes multiple sub-networks connected in sequence; the acquisition module is configured to: input the i-th output image output by the i-th sub-network of the multiple sub-networks into the multiple sub-networks To generate the i+1th intermediate image through the i+1th subnetwork, and perform feature extraction on the i+1th intermediate image to obtain the i+1th output image; The number of channels of the i+1th intermediate image is greater than the number of channels of the i-th output image; the Nth output image is determined as the feature map; where i and N are positive integers, and N is the total number of sub-networks, 1≤ i≤N-1, N≥2, where the first output image is obtained by the following method: the first sub-network generates a first intermediate image according to the first text image, and performs feature extraction on the first intermediate image, Obtain the first feature map.

In some embodiments, the device further includes an alignment module for aligning the first training image and the second training image before training the neural network based on the at least one set of training images .

In some embodiments, the functions or modules contained in the device provided in the embodiments of the present disclosure can be used to execute the methods described in the above method embodiments. For specific implementation, refer to the description of the above method embodiments. For brevity, here No longer.

The embodiments of the present specification also provide a computer device, which includes at least a memory, a processor, and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the program when the program is executed. The method described.

The embodiments of the present disclosure also include a computer device, including a memory, a processor, and a computer program stored on the memory and capable of running on the processor, and the processor implements the method described in any embodiment when the program is executed. .

FIG. 6 shows a more specific hardware structure diagram of a computing device provided by an embodiment of this specification. The device may include a processor 601, a memory 602, an input/output interface 603, a communication interface 604, and a bus 605. The processor 601, the memory 602, the input/output interface 603, and the communication interface 604 realize the communication connection between each other in the device through the bus 605.

The processor 601 can be implemented by a general CPU (Central Processing Unit, central processing unit), microprocessor, application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits for execution related Program to implement the technical solutions provided in the embodiments of this specification.

The memory 602 may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory, random access memory), static storage device, dynamic storage device, etc. The memory 602 may store an operating system and other application programs. When the technical solutions provided in the embodiments of the present specification are implemented through software or firmware, related program codes are stored in the memory 602 and called and executed by the processor 601.

The input/output interface 603 is used to connect an input/output module to realize information input and output. The input/output/module can be configured in the device as a component (not shown in the figure), or can be connected to the device to provide corresponding functions. The input device may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and an output device may include a display, a speaker, a vibrator, an indicator light, and the like.

The communication interface 604 is used to connect a communication module (not shown in the figure) to realize the communication interaction between the device and other devices. The communication module can realize communication through wired means (such as USB, network cable, etc.), or through wireless means (such as mobile network, WIFI, Bluetooth, etc.).

The bus 605 includes a path for transmitting information between various components of the device (for example, the processor 601, the memory 602, the input/output interface 603, and the communication interface 604).

It should be noted that although the above device only shows the processor 601, the memory 602, the input/output interface 603, the communication interface 604, and the bus 605, in the specific implementation process, the device may also include the equipment necessary for normal operation. Other components. In addition, those skilled in the art can understand that the above-mentioned device may also include only the components necessary to implement the solutions of the embodiments of the present specification, and not necessarily include all the components shown in the figures.

The embodiments of this specification also provide a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the method described in any of the foregoing embodiments is implemented.

The embodiments of this specification also provide a computer program, wherein the computer program implements the method described in any of the foregoing embodiments when the computer program is executed by a processor.

Computer-readable media include permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology. The information can be computer-readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.

From the description of the foregoing implementation manners, it can be understood that those skilled in the art can clearly understand that the embodiments of this specification can be implemented by means of software plus a necessary general hardware platform. Based on this understanding, the technical solutions of the embodiments of this specification can be embodied in the form of software products, which can be stored in storage media, such as ROM/RAM, A magnetic disk, an optical disk, etc., include several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute the methods described in the various embodiments or some parts of the embodiments of this specification.

The systems, devices, modules, or units illustrated in the above embodiments may be specifically implemented by computer chips or entities, or implemented by products with certain functions. A typical implementation device is a computer. The specific form of the computer can be a personal computer, a laptop computer, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email receiving and sending device, and a game control A console, a tablet computer, a wearable device, or a combination of any of these devices.

The various embodiments in this specification are described in a progressive manner, and the same or similar parts between the various embodiments can be referred to each other, and each embodiment focuses on the difference from other embodiments. In particular, as for the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment. The device embodiments described above are merely illustrative. The modules described as separate components may or may not be physically separated. The functions of the modules can be combined in the same way when implementing the solutions of the embodiments of this specification. Or multiple software and/or hardware implementations. It is also possible to select some or all of the modules according to actual needs to achieve the objectives of the solutions of the embodiments. Those of ordinary skill in the art can understand and implement without creative work.

Claims

A text recognition method, the method includes:

Acquiring a feature map of the first text image, the feature map including at least one feature sequence, the feature sequence being used to represent the correlation between at least two image blocks in the first text image;

Processing the first text image according to the at least one characteristic sequence to obtain a second text image, the resolution of the second text image is greater than the resolution of the first text image;

Perform text recognition on the second text image.
The method according to claim 1, wherein said obtaining the feature map of the first text image comprises:

Acquiring multiple channel images of the first text image and binary images corresponding to the first text image;

Perform feature extraction on the multiple channel images and the binary image to obtain a feature map of the first text image.
The method according to claim 1 or 2, wherein said obtaining the feature map of the first text image comprises:

The first text image is input into a pre-trained neural network, and a feature map output by the neural network is obtained.
The method according to claim 3, wherein the neural network obtains the feature map based on the following method:

Generating an intermediate image according to the first text image, the number of channels of the intermediate image is greater than the number of channels of the first text image;

Perform feature extraction on the intermediate image to obtain the feature map.
The method according to claim 3 or 4, wherein the neural network includes at least one convolutional neural network and a bidirectional long- and short-term memory network, and the output end of the at least one convolutional neural network is connected to the bidirectional long- and short-term memory network Connected to the input terminal;

The acquiring the characteristic sequence of the first text image includes:

Inputting the first text image into the at least one convolutional neural network, and obtaining an intermediate image output by the at least one convolutional neural network;

The intermediate image is input to the bidirectional long and short-term memory network, and the feature map output by the bidirectional long and short-term memory network is obtained.
The method according to any one of claims 3 to 5, wherein the neural network includes a plurality of sub-networks connected in sequence;

The inputting the first text image into a pre-trained neural network and obtaining a feature map output by the neural network includes:

Input the i-th output image output by the i-th sub-network among the plurality of sub-networks to the i+1-th sub-network among the plurality of sub-networks to generate the i+1-th intermediate image through the i+1-th sub-network , Performing feature extraction on the i+1th intermediate image to obtain the i+1th output image; the number of channels of the i+1th intermediate image is greater than the number of channels of the ith output image;

Determining the Nth output image as the feature map;

Among them, i and N are positive integers, N is the total number of sub-networks, 1≤i≤N-1, N≥2,

Wherein, the first output image is obtained in the following manner: the first sub-network generates a first intermediate image according to the first text image, and performs feature extraction on the first intermediate image to obtain the first output image.
The method according to any one of claims 1 to 6, the method further comprising:

Before processing the first text image according to the at least one feature sequence, the first text image is processed so that the number of channels of the first text image is the same as the number of channels of the feature map.
The method according to claim 7, further comprising:

After obtaining the second text image, process the second text image so that the number of channels of the second text image is the same as the number of channels of the first text image;

The performing text recognition on the second text image includes:

Perform text recognition on the processed second text image.
The method according to any one of claims 3 to 8, the method further comprising:

Training the neural network based on at least one set of training images, each set of training images includes a first training image and a second training image, and the first training image and the second training image include the same text;

Wherein, the resolution of the first training image is less than a first resolution threshold, the resolution of the second training image is greater than a second resolution threshold, and the first resolution threshold is less than or equal to the second resolution Threshold.
The method according to claim 9, wherein the training the neural network based on at least one set of training images comprises:

Inputting the first training image to the neural network, and obtaining an output image of the neural network;

Determining a loss function based on the second training image corresponding to the first training image and the output image;

Perform supervised training on the neural network based on the loss function.
The method according to claim 10, wherein the loss function includes at least one of a first loss function and a second loss function;

The first loss function is determined based on the mean square error of each corresponding pixel in the first training image and the second training image; and/or,

The second loss function is determined based on the difference between the gradient fields of each corresponding pixel in the first training image and the second training image.
The method according to any one of claims 9 to 11, the method further comprising:

Before training the neural network based on the at least one set of training images, the first training image and the second training image are aligned.
The method according to claim 12, wherein the aligning the first training image and the second training image comprises:

The first training image is processed through a pre-trained spatial transformation network to align the text in the first training image with the text in the second training image.
The method according to any one of claims 9 to 13, wherein the first training image is obtained by photographing the subject at the first position by a first image acquisition device provided with a first focal length;

The second training image is obtained by photographing the subject at the first position by a second image acquisition device provided with a second focal length;

The first focal length is smaller than the second focal length.
A text recognition device, the device includes:

An obtaining module, configured to obtain a feature map of a first text image, the feature map including at least one feature sequence, the feature sequence being used to represent the correlation between at least two image blocks in the first text image;

A first processing module, configured to process the first text image according to the at least one characteristic sequence to obtain a second text image, the resolution of the second text image is greater than the resolution of the first text image;

The text recognition module is used to perform text recognition on the second text image.
A computer-readable storage medium with a computer program stored thereon, wherein the program is executed by a processor to implement the method according to any one of claims 1 to 14.
A computer device comprising a memory, a processor, and a computer program stored on the memory and running on the processor, wherein the processor implements the method of any one of claims 1 to 14 when the program is executed .
A computer program, wherein when the computer program is executed by a processor, the method according to any one of claims 1 to 14 is realized.