CN109271967B

CN109271967B - Method and device for recognizing text in image, electronic equipment and storage medium

Info

Publication number: CN109271967B
Application number: CN201811202558.2A
Authority: CN
Inventors: 刘铭
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-10-16
Filing date: 2018-10-16
Publication date: 2022-08-26
Anticipated expiration: 2038-10-16
Also published as: CN109271967A

Abstract

The invention discloses a method and a device for recognizing a text in an image, electronic equipment and a computer readable storage medium, wherein the scheme executes end-to-end recognition of the text in the image through a multilayer superposed network model, and comprises the following steps: performing spatial separable convolution operation of the image layer by layer in a multilayer mode, fusing convolution characteristics extracted by the spatial separable convolution operation to a low layer mapped by layer-by-layer superposition, and mapping the low layer with a high layer of output convolution characteristics; obtaining global features from a bottom layer performing a spatially separable convolution operation; candidate region detection and region screening parameter prediction of texts in the image are carried out through the global features, and pooling features corresponding to the detected text regions are obtained; and backward propagating the pooled features to a recognition branch network layer for executing character recognition operation, and outputting the character sequence of the text region mark through the recognition branch network layer. The scheme saves the model training time and improves the identification accuracy.

Description

Method and device for identifying text in image, electronic equipment and storage medium

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to a method and an apparatus for recognizing text in an image, an electronic device, and a computer-readable storage medium.

Background

In the field of computer image processing, text recognition is a method for allowing a computer to automatically determine which character in an image belongs to which character in a character library, which is established by people in advance and usually contains the most commonly used characters in real life.

The identification of the text in the image is usually realized by building two models, wherein one model is used for finding out the position of the text in a natural scene image containing the text, and then cutting out the text area from the image. Another model is used to identify the specific character content of text regions. Specifically, a large number of sample images containing different characters are obtained as a training set, and the sample images are used for training a character classifier and a text locator respectively. After training is finished, a text region is firstly positioned from the image to be tested through a text positioner, then the text region is cut out, and then the character content of the text region is identified through a character classifier.

According to the scheme, the sample images are required to be used for training the character classifier and the text positioner respectively, the workload of model training is large, and the recognition accuracy of the final characters is influenced by the accuracy of the two models, so that the improvement of the text recognition accuracy in the images is limited.

Disclosure of Invention

The invention provides a method for recognizing texts in images, which aims to solve the problems that in the related art, training of a character classifier and training of a text positioner are required to be performed respectively, the workload of model training is large, and the recognition accuracy is low.

The invention provides a method for recognizing texts in images, which executes end-to-end recognition of texts in images through a multilayer superposed network model, and comprises the following steps:

performing spatial separable convolution operation of the image layer by layer in a multilayer mode, fusing convolution features extracted by the spatial separable convolution operation to a lower layer mapped by layer-by-layer superposition, and mapping the lower layer with a higher layer outputting the convolution features;

obtaining global features from a bottom layer performing a spatially separable convolution operation;

performing candidate region detection and region screening parameter prediction of a text in the image through the global features to obtain pooling features corresponding to the detected text regions;

and backward propagating the pooled features to a recognition branch network layer which executes character recognition operation, and outputting the character sequence of the text region mark through the recognition branch network layer.

In another aspect, the present invention provides an apparatus for recognizing a text in an image, the apparatus performing end-to-end recognition of the text in the image through a network model in which a plurality of layers are stacked, the apparatus comprising:

the spatial convolution operation module is used for carrying out spatial separable convolution operation on the image layer by layer in a multilayer mode, fusing convolution features extracted by the spatial separable convolution operation to a lower layer mapped by layer-by-layer superposition, and mapping the lower layer with a higher layer outputting the convolution features;

a global feature extraction module for obtaining global features from a bottom layer performing a spatially separable convolution operation;

the pooling feature obtaining module is used for detecting candidate regions of texts in the images and predicting region screening parameters through the global features to obtain pooling features corresponding to the detected text regions;

and the character sequence output module is used for backward propagating the pooled features to a recognition branch network layer for executing character recognition operation, and outputting the character sequence of the text region mark through the recognition branch network layer.

In another aspect, the present invention further provides an electronic device, including:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform a method for recognizing text in the image.

In addition, the invention also provides a computer readable storage medium, which stores a computer program, and the computer program can be executed by a processor to complete the method for recognizing the text in the image.

The technical scheme provided by the embodiment of the invention can have the following beneficial effects:

according to the technical scheme provided by the invention, the end-to-end recognition of the text in the image is executed through the network models which are stacked in a multilayer manner, so that the recognition of the text in the image can be realized only by training one network model without separately training a text positioner and a character classifier, the workload of model training is reduced, the accuracy of final recognition is only influenced by the accuracy of one network model, the improvement of the recognition accuracy can be facilitated, and the situation that the improvement of the recognition accuracy is mutually limited by the two models is avoided.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a schematic illustration of an implementation environment according to the present invention;

FIG. 2 is a block diagram illustrating an apparatus in accordance with an exemplary embodiment;

FIG. 3 is a flow diagram illustrating a method for recognition of text in an image according to an exemplary embodiment;

FIG. 4 is a schematic diagram of a network architecture of spatially separable convolutional network layers;

FIG. 5 is a schematic diagram of a network architecture for recognizing characters in an image according to the present invention;

FIG. 6 is a detailed flowchart of step 350 in a corresponding embodiment of FIG. 3;

FIG. 7 is a schematic diagram of the principle of the pooling layer extracting pixel-level region screening parameters from global features;

FIG. 8 is a flowchart showing details of step 353 in the corresponding embodiment of FIG. 6;

FIG. 9 is a detailed flowchart of step 370 in a corresponding embodiment of FIG. 3;

FIG. 10 is a block diagram illustrating the architecture of the identified branch network layer;

FIG. 11 is a schematic diagram of a network architecture of a method for recognizing text in an image according to the present invention;

FIG. 12 is a flowchart of a method for recognizing text in an image according to another embodiment based on the corresponding embodiment in FIG. 3;

FIG. 13 is a flowchart detailing step 1230 in a corresponding embodiment of FIG. 12;

FIG. 14 is a detailed flowchart of step 1231 in a corresponding embodiment of FIG. 13;

fig. 15 is a schematic diagram of the effect of the practical application of the present invention.

FIG. 16 is a block diagram illustrating an apparatus for recognition of text in an image according to an exemplary embodiment;

FIG. 17 is a block diagram of details of a pooled feature acquisition module in a corresponding embodiment of FIG. 16;

fig. 18 is a detailed block diagram of the screening rotating unit in the corresponding embodiment of fig. 17.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

FIG. 1 is a schematic illustration of an implementation environment in accordance with the present invention. The implementation environment includes: the user equipment 110, the user equipment 110 may perform recognition of text in the image by running an application. The user equipment may be a server, a desktop computer, a mobile terminal, an intelligent appliance, etc.

The user device 110 may include an image capturing device 111 such as a camera, and further perform text recognition on an image captured by the image capturing device 111 by using the method provided by the present invention.

According to the requirement, the implementation environment can also comprise a server 130 besides the user equipment 110, the server 130 is connected with the user equipment 110 through a wired or wireless network, the server 130 sends the image to be recognized to the user equipment 110, and the user equipment 110 recognizes the text in the image.

In practical application, the text content recognized from the image can be further subjected to text translation, text content editing, storage and the like. The method for recognizing the text in the image can be applied to a text recognition task in any scene, and realizes understanding of the content of the text in the image, such as recognition of characters in natural scene character pictures, advertisement pictures, videos, identity cards, driving licenses, business cards and license plates.

Fig. 2 is a block diagram illustrating an apparatus 200 according to an example embodiment. The apparatus 200 may be, for example, the user equipment 110 in the implementation environment shown in fig. 1.

Referring to fig. 2, the apparatus 200 may include one or more of the following components: a processing component 202, a memory 204, a power component 206, a multimedia component 208, an audio component 210, a sensor component 214, and a communication component 216.

The processing component 202 generally controls overall operation of the device 200, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations, among others. The processing components 202 may include one or more processors 218 to execute instructions to perform all or a portion of the steps of the methods described below. Further, the processing component 202 can include one or more modules that facilitate interaction between the processing component 202 and other components. For example, the processing component 202 can include a multimedia module to facilitate interaction between the multimedia component 208 and the processing component 202.

The memory 204 is configured to store various types of data to support operations at the apparatus 200. Examples of such data include instructions for any application or method operating on the apparatus 200. The Memory 204 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk or optical disk. Also stored in memory 204 are one or more modules configured to be executed by the one or more processors 218 to perform all or a portion of the steps of any of the methods of fig. 3, 6, 8, 9, 12-14, described below.

The power supply component 206 provides power to the various components of the device 200. The power components 206 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 200.

The multimedia component 208 includes a screen that provides an output interface between the device 200 and the user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a touch panel. If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. The screen may further include an Organic Light Emitting Display (OLED for short).

The audio component 210 is configured to output and/or input audio signals. For example, the audio component 210 may include a Microphone (MIC) configured to receive external audio signals when the device 200 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 204 or transmitted via the communication component 216. In some embodiments, audio component 210 also includes a speaker for outputting audio signals.

The sensor component 214 includes one or more sensors for providing various aspects of status assessment for the device 200. For example, the sensor assembly 214 may detect an open/closed state of the device 200, the relative positioning of the components, the sensor assembly 214 may also detect a change in position of the device 200 or a component of the device 200, and a change in temperature of the device 200. In some embodiments, the sensor assembly 214 may also include a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 216 is configured to facilitate wired or wireless communication between the apparatus 200 and other devices. The device 200 may access a WIreless network based on a communication standard, such as WiFi (WIreless-Fidelity). In an exemplary embodiment, the communication component 216 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the Communication component 216 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, Infrared Data Association (IrDA) technology, Ultra Wideband (UWB) technology, bluetooth technology, and other technologies.

In an exemplary embodiment, the apparatus 200 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital signal processors, digital signal processing devices, programmable logic devices, field programmable gate arrays, controllers, microcontrollers, microprocessors or other electronic components for performing the methods described below.

FIG. 3 is a flow diagram illustrating a method for recognition of text in an image according to an exemplary embodiment. The scope of applicability and the subject of execution of the method of recognition of text in an image may be a user device, which may be the user device 110 of the implementation environment shown in fig. 1. The method performs end-to-end recognition of text in an image through a multi-layered superimposed network model. The end-to-end recognition means that the input of the network model is original image data, and the output is a final character sequence. As shown in fig. 3, the method specifically includes the following steps.

In step 310, spatially separable convolution operations of an image are performed layer by layer in a multi-layer manner, and convolution features extracted by the spatially separable convolution operations are fused to a layer-by-layer mapped lower layer, which is mapped to a higher layer outputting the convolution features.

It should be noted that the network model of the multi-layer stack may include a spatially separable convolutional network layer, a regional regression network layer, a pooling layer, a temporal convolutional network layer, and a character classification layer. The spatial separable convolution network layer, the regional regression network layer and the pooling layer are used as detection branches for extracting pooling characteristics of text regions in the images according to original image data, and the temporal convolution network layer and the character classification layer are used as identification branches for outputting character sequences of the text regions according to the pooling characteristics of the text regions.

Specifically, the spatially separable convolution operation refers to performing convolution calculation on an image to be recognized layer by layer in a multi-layer manner by using a spatially separable convolution (Effnet) layer. Wherein the spatially separable convolutional layer comprises a mapped upper layer and a lower layer, the upper layer and the lower layer being relative concepts, the first layer being computed and the second layer being computed and the lower layer. The convolution characteristics extracted by the high-level convolution calculation are fused to the low level mapped by the layer-by-layer superposition, which means that the convolution calculation result of the low level needs to be combined with the convolution calculation result of the high level. Because the more the number of the convolution layers, the more the details are lost, and by fusing the convolution characteristics extracted from the high layer to the low layer, the more the details can be retained, and the information loss can be avoided.

In step 330, global features are obtained from the lowest layer where the spatially separable convolution operation is performed.

The bottom layer is the last output layer of the spatially separable convolution layer, the spatially separable convolution layer performs spatially separable convolution operation on the original image to be recognized layer by layer in a multi-layer mode, and the finally output feature matrix is called as global features. The global features may be used to characterize feature information of the original input image.

Fig. 4 is a schematic diagram of a network architecture of a spatially separable convolutional layer, and as shown in fig. 4, an original image to be recognized is used as an input of the spatially separable convolutional layer, and then convolutional calculation is performed layer by layer, features extracted from a high layer are fused to a lower layer of a mapping, and global features are output at the lowest layer of the spatially separable convolutional layer. Wherein each parallelogram represents the convolution feature extracted for each layer.

In step 350, candidate region detection and region screening parameter prediction of the text in the image are performed through the global features, and pooling features corresponding to the detected text regions are obtained.

The candidate area detection means detecting a candidate area where a text in an image is located according to the global features, and the number of the candidate areas may be multiple. The prediction of the region screening parameters refers to obtaining predicted values of the region screening parameters according to the global features, and the candidate regions can be screened according to the predicted values, so that the detection precision of the text regions in the image is improved. The pooling characteristic of the text region refers to characteristic data of the text region output by the pooling layer, and in one embodiment, the pooling characteristic of the text region may be image data after the text region is leveled, and the leveling refers to rotating the tilted text region to a horizontal position.

Specifically, the global features output by the spatial separable convolutional network layer may be input into the regional regression network layer and the pooling layer, respectively, and candidate regions of the text in the image are detected by the regional regression network layer, so as to output candidate regions of text borders, which are simply referred to as border candidate regions. The global features are subjected to convolution transformation through the pooling layer to realize prediction of region screening parameters, then the pooling layer screens frame candidate regions according to the region screening parameters, text regions in the image can be detected, and then the inclined text regions are rotated to obtain image data of the horizontal text regions to serve as the pooling features of the text regions.

In step 370, the pooled features are propagated back to a recognition branch network layer performing character recognition operations, through which the character sequence of the text region label is output.

The recognition branch network layer is the last layers of the multi-layer superposed network model and is used for recognizing characters contained in the text area according to the pooling characteristics of the text area. Specifically, the identification branch network layer comprises a time convolution network layer and a character classification layer of the network model. Specifically, the pooling layer transmits the pooling features of the text region to the time convolution network layer, convolution calculation is performed on the pooling features through the time convolution network layer, character sequence features are extracted, the character sequence features are transmitted to the character classification layer, and the probability that each character belongs to each character in the dictionary is output through the character classification layer.

For example, assuming that the dictionary contains 7439 words, the character classification layer may output the probability that each character in the text region belongs to each word in the dictionary, and the word with the highest probability in the dictionary is the recognition result of the character in the text region, so that for a plurality of characters in the text region, the recognition result of each character may be output, and the character sequence of the text region mark may be obtained.

According to the technical scheme provided by the exemplary embodiment of the invention, the end-to-end recognition of the text in the image is executed through the multilayer superposed network models, so that the recognition of the text in the image can be realized only by training one network model without separately training a text positioner and a character classifier, the workload of model training is reduced, the accuracy of final recognition is only influenced by the accuracy of one network model, the improvement of the recognition accuracy can be facilitated, and the situation that the improvement of the recognition accuracy is mutually limited by the two models is avoided.

With respect to the technical solutions provided by the above exemplary embodiments of the present invention, fig. 5 is a flowchart of a text recognition scheme. As shown in fig. 5, the text recognition scheme divides text detection and recognition into two tasks, and the recognition task can be performed only after the detection task is completed. Specifically, during detection, the original image is firstly input into a feature extraction convolution network, then the extracted features are transmitted to a regional regression network, the regional regression network outputs detected frame candidate regions, but the regions are rough and further frame regression is needed to improve the accuracy of the frame so as to be closer to the edge of the character, and secondary frame regression and classification can give the coordinates and the corresponding confidence coefficient of the character frame in the image, namely the possibility of containing the character. The two prediction results are compared with the position of the character label in the image, then the prediction loss is calculated through a loss function, and the parameter updating of the model is adjusted according to the loss.

When the oblique characters are detected, a large blank area exists above a frame candidate area detected by the regional regression network, which reduces the precision of the detection frame, so that the frame candidate area output by the regional regression network and the global features extracted by the feature extraction convolutional network are input to the rotational interest area pooling layer together to obtain the detected oblique character area. As shown in fig. 5, the inclined text area is marked in the original image in the form of a text box, and then the corresponding area is cut out from the original image according to the coordinates of the text box, thereby completing the positioning of the area where the text is located. It should be noted that at this stage, there is already a positioning error in the region where the text is located.

And then, inputting the area image where the cut characters are located into an identification network, wherein the identification network firstly extracts the convolution characteristics of the input area image, then provides the extracted convolution characteristics to a character classification layer, the character classification layer identifies the character sequence represented by the input sequence, and when all character areas in the original image are identified, the character identification task of the original image is completed. It should be noted that, at this stage, the difference between the character sequence output by the character classification layer and the actual character sequence needs to be calculated through another loss function, and the parameter update of the recognition network and the character classification layer is adjusted according to the difference. That is, since there is an error in character recognition at this stage, the final overall recognition error includes an error in character area positioning and an error in character recognition.

It should be noted that if the character region positioning error is large, even if the accuracy of character recognition is improved, the improvement of the entire recognition accuracy is limited. The region detection and the character recognition are trained separately, which is not beneficial to the improvement of the performance, and the error generated in the recognition stage can not be transmitted to the detection part to correct the parameters of the detection model, so that the bottleneck of detection or recognition performance can be generated on some training sets. Also, training the detection model and the recognition model separately increases the workload of model training. The speed of extracting the features of the feature extraction convolutional network is low, the number of tasks processed by the whole system in unit time is influenced, and meanwhile, the deployment of a model at a mobile terminal is not facilitated.

The invention realizes the end-to-end recognition of the text in the image through the network model with multilayer superposition, namely, the input is the original image, the output is the character sequence, the accuracy of the final character recognition is only determined by the error of one model, and the two tasks are synthesized from the model, thereby avoiding the performance bottleneck caused by separate training and being beneficial to the improvement of the recognition accuracy; and only one network model is trained to realize the recognition of the text, so that the time for training the model is greatly saved, at least half of the time is saved compared with the time for separately training the two models, and in practice, the time for adjusting the parameters is saved by 4 to 5 times because the parameter settings of the two models are different. In addition, the invention uses the Effnet network architecture to carry out the spatial separable convolution operation of the image, and the convolution characteristics extracted by the spatial separable convolution operation are fused to the low layer mapped by layer-by-layer superposition, thereby not only realizing the acceleration of the global characteristic extraction stage, but also overcoming the defect that the model precision needs to be sacrificed when the existing acceleration network structure is accelerated, reducing the storage space required by the model operation and being convenient for the deployment and application at the mobile terminal.

In an exemplary embodiment, as shown in fig. 6, the step 350 specifically includes:

in step 351, inputting the global features into a regional regression network layer for performing candidate region detection, and outputting a frame candidate region of the text in the image through the regional regression network layer;

it should be explained that the present invention performs end-to-end recognition of text in an image through a network model with multiple layers superimposed, and the local regression network layers are several layers of the network model and are used for detecting the regions where the text may be located, that is, performing candidate region detection.

Specifically, global features are extracted from an original image through a spatial separable convolutional network layer of the network model, the global features are input into a regional regression network layer, and frame candidate regions of texts in the image are output through the regional regression network layer. The frame candidate area refers to an area that the text edge may enclose. In the training stage, the frame candidate region of the text can be output through the regional regression network layer, the confidence coefficient (possibility of containing characters) of the detected candidate frame and the candidate frame is obtained by performing secondary frame regression and classification on the frame candidate region, the multitask loss is calculated according to the position coordinates of the actual text frame, and the loss is minimized by adjusting the parameters of the regional regression network layer. The regional regression network layer can be fast-R-CNN (fast target detection convolutional neural network), and the main contribution of the fast-R-CNN is to design a network architecture for extracting candidate regions, so that time-consuming selective search is replaced, and the detection speed is greatly improved.

In step 352, the bounding box candidate region is input into a pooling layer that performs region screening and region rotation;

the pooling layer is connected with the space separable convolutional network layer and is used for performing region screening and region rotation on the frame candidate region according to the global features output by the space separable convolutional network layer. The region screening refers to screening out a region where an accurate text is located from a plurality of frame candidate regions, and the region rotation refers to rotating an inclined text region to a horizontal position. Therefore, the frame candidate region output by the regional regression network layer and the global feature output by the spatial separable convolution network layer are jointly input into the pooling layer.

In step 353, according to the pixel level region screening parameter obtained by predicting the region screening parameter of the global feature by the pooling layer, screening the text region from the frame candidate region and rotating the text region to a horizontal position, so as to obtain the pooling feature of the text region.

The pixel-level region screening parameters refer to parameters for screening and rotating the frame candidate region obtained according to global feature prediction. The pixel-level region screening parameters may include pixel-level classification confidence, pixel-level rotation angle, and pixel-level bounding box distance. The text region refers to a region where text is located. The pooling layer can perform convolution transformation on the global characteristics through multiple convolution cores to obtain pixel-level region screening parameters, and then screen out text regions from multiple frame candidate regions according to the pixel-level region screening parameters, and further rotate the inclined text regions to the horizontal position to obtain the pooling characteristics of the text regions.

As shown in fig. 7, the global feature is transformed by a first convolution kernel, and a pixel-level classification confidence, i.e., a probability that each pixel in the original image belongs to text, is output. And the global features are transformed by a second convolution kernel, and the pixel-level frame distance, namely the predicted distance from each pixel point to the upper, lower, left and right sides of the text frame where the pixel point is located, is output. And the global features are transformed by a third convolution kernel, and a pixel-level rotation angle, namely the angle required to rotate when each pixel point rotates to the horizontal position, is output.

In an exemplary embodiment, as shown in fig. 8, the step 353 specifically includes:

in step 3531, a pixel-level classification confidence degree generated by the pooling layer performing convolution calculation on the global features is obtained, where the pixel-level classification confidence degree refers to a probability that each pixel in the image belongs to a text region;

specifically, the pooling layer may perform convolution calculation on the global feature (feature image) by a convolution kernel with a size of 1 × 1 and a step size of 1, and output a confidence prediction result that each pixel belongs to the text, thereby obtaining a pixel-level classification confidence. The high confidence pixel point indicates that the pixel point has a high probability of belonging to the text region, and similarly, the low confidence pixel point indicates that the pixel point has a low probability of belonging to the text region.

In step 3532, the text region is filtered out of the frame candidate region according to the pixel-level classification confidence and the intersection proportion of the frame candidate region;

the intersection proportion of the frame candidate regions refers to the overlapping proportion of different frame candidate regions. Because the noise frame exists in the frame candidate region, the method performs non-maximum suppression on the detection result of the frame candidate region according to the pixel level classification confidence coefficient and the intersection proportion of the frame candidate region, thereby screening the text region from the frame candidate region and improving the accuracy of text region detection.

Specifically, a non-maximum suppression algorithm is used, the confidence degrees are classified according to the pixel levels, the frame candidate regions with high confidence degrees are reserved, the frame candidate regions without overlapping are reserved, the frame candidate regions with low intersection ratios are reserved, and therefore the text regions are obtained by screening from all the frame candidate regions.

In step 3533, the text region is rotated to a horizontal position by an interpolation algorithm according to a pixel-level rotation angle and a pixel-level border distance generated by convolution calculation of the global features by the pooling layer, so as to obtain the pooled features of the text region.

It should be noted that, when the pooling layer obtains the pixel-level classification confidence, the pooling layer may perform convolution calculation on the global features at the same time to obtain a pixel-level rotation angle and a pixel-level frame distance. Referring to the above explanation, the pixel-level rotation angle refers to an angle that needs to be rotated when each pixel point rotates to a horizontal position, and the pixel-level border distance refers to a predicted distance from each pixel point to the top, bottom, left and right sides of a text border where the pixel point is located. Specifically, the pooling layer may perform convolution calculation on the global feature through a convolution check with a size of 1 × 1 and a step length of 4, and output a distance from each pixel point to the upper, lower, left, and right sides of a text frame where the pixel point is located. The pooling layer can perform convolution calculation on the global characteristics through a convolution check with the size of 1 × 1 and the step length of 4, and outputs an angle required to rotate each pixel point to a horizontal position.

Therefore, the pooling layer can rotate the inclined text region to the horizontal direction according to the pixel point rotation angle and the pixel level frame distance, and the pooling feature of the text region can be image data of the text region after the text region is rotated to the horizontal direction.

Specifically, rotating the detected text region to the horizontal position requires interpolation through the pooling layer, and the text region with the original angle is converted to the horizontal position, so as to identify the model. The interpolation needs to determine the corresponding relationship between the original point and the target point through a transformation matrix T, and the calculation formula of the transformation matrix T is as follows:

v _ ratio represents the ratio of the height roi _ h of the transformed text region map to the sum of the distances from the current point to the upper and lower boundaries of the predicted text region; roi _ h is a preset known quantity.

Where, roi _ w is v _ ratio x (l + r), and roi _ w represents the width of the transformed text region map.

d _x ＝l×cosπ _i -t×sinπ _i -x，

d _y ＝l×cosπ _i +t×sinπ _i -y，

Wherein, r, l, t, b are the distances from the current pixel point of the detection branch prediction to the right boundary, the left boundary, the upper boundary, the lower boundary (i.e. the pixel-level border distance), pi of the text border respectively _i Representing the tilt angle (i.e., pixel level rotation angle) of the current pixel of the detected branch prediction. (x, y) is the coordinate position of the current pixel point in the original image. Assume that the pre-transform point is Psrc (x) _s ，y _s ) Post-transform Pdst (x) _d ，y _d ) Then, then

The feature mapping position before transformation can be multiplied by the transformation matrix T through a left equation to obtain the transformed feature mapping position, so that coordinate interpolation is completed, and horizontal rotation of the text region is realized.

It is emphasized that, unlike the existing character recognition method, which completes the character recognition by transmitting the detection result output by the detection model to the recognition model, the present invention takes charge of optimizing the feature map (i.e. pooling feature) finally input as the recognition branch by detecting as a learning branch of the model, and realizes that the detection result (i.e. the detected text region) is converted into the feature map which can be directly used by the recognition branch in the same model by means of numerical value sampling, thereby realizing the simultaneous learning training of the detection and recognition tasks.

In an exemplary embodiment, the identification branch network layer in the step 370 includes a time convolution network layer and a character classification layer, as shown in fig. 9, the step 370 specifically includes:

in step 371, backward propagating the pooled features to the temporal convolution network layer for extracting character features;

the backward propagation refers to transmitting the pooled features output by the pooled layer to a time convolution network layer, and performing convolution transformation on the pooled features through the time convolution network layer to extract character sequence features. Unlike the existing CTC (connection temporal classification based on neural networks) or Attention network structure, the present invention uses TCN (time convolutional network) as a part of identifying branch network layer, which has the following advantages: the training and testing time of the network is greatly shortened because large-scale parallel can be carried out in the TCN; the TCN can flexibly adjust the magnitude of the receptive field by determining how many convolution layers are stacked, so that the long-term and short-term memory length of the model can be better controlled in an explicit mode, and the CTC or Attention recognition model cannot control the long-term and short-term memory length due to the fact that the internal cycle times of the model cannot be estimated in the CTC or Attention recognition model; the propagation direction of the TCN is different from the time direction of the input sequence, so that the problem of gradient explosion or disappearance frequently caused by RNN model training is solved; the TCN consumes lower memory, is more obviously represented on a long input sequence, and reduces the deployment and application expenses of the model.

In step 372, the extracted character features are input into the character classification layer, and the character sequence of the text region mark is output through the character classification layer.

The character features are character sequence features, the extracted character sequence features are input into the character classification layer, the probability that each character in the text region belongs to each character in the dictionary can be output, the character with the maximum probability in the dictionary is found out, the character is the recognition result of the character in the text region, and therefore the character sequence marked in the text region is obtained.

Fig. 10 is a schematic diagram of a structure of the identified branch network layer, and as shown in fig. 10, the pooled features output by the pooled layer are subjected to 4 time convolution operations, and the input of each convolutional layer is respectively subjected to hole causal convolution, weight normalization, activation function transformation, and random discarding to obtain the output of the current convolutional layer. The filter size k of the first convolution operation is 3, the expansion factor d of the convolution kernel is 1, the filter size k of the second convolution operation is 3, the expansion factor d of the convolution kernel is 1, the filter size k of the third convolution operation is 3, the expansion factor d of the convolution kernel is 2, the filter size k of the fourth convolution operation is 1, and the expansion factor d of the convolution kernel is 4. Thereafter, character sequence features, i.e., features of each character, are extracted through a bidirectional LSTM (long short term memory network). Bidirectional LSTM is preferred over unidirectional LSTM in that it can use information in both past and future times to make the final prediction more accurate. The output result of the bi-directional LSTM may be a feature vector of 512, followed by classification of the output features by a CTC decoder of the character classification layer into 7439 classes. Where the 7439 class indicates that there are 7439 characters in the dictionary so that the output features can be classified into one of the 7439 characters.

Fig. 11 is a schematic diagram of a network model architecture for text recognition in an image according to the present invention, and as shown in fig. 11, an original image is first input into a spatial separable convolutional network layer, global features are extracted from the original image through the spatial separable convolutional network layer, and then the global features are respectively input into a regional regression network layer and a pooling layer, and the regional regression network layer detects a frame candidate region according to the global features. In the training stage, the detected candidate frame and the confidence coefficient of the candidate frame can be obtained through secondary frame regression and frame classification, the multitask loss is calculated according to the position of the text frame, and the parameter of the regional regression network layer is adjusted to minimize the multitask loss. The frame candidate region output by the regional regression network layer is input into the pooling layer, and the pooling layer can perform screening and leveling on the frame candidate region according to the global feature input by the spatial separable convolutional network layer and the frame candidate region input by the regional regression network layer to obtain a leveled text region feature, namely a pooling feature. And inputting the horizontal text region characteristics into the time convolution network layer, extracting character sequence characteristics, inputting the character sequence characteristics into a character classifier, and outputting a character recognition result of the text in the image.

In an exemplary embodiment, as shown in fig. 12, the method provided by the present invention further includes:

in step 1210, a sample image set in which text information is recorded on an image is obtained, wherein the content of the text information is known;

the sample image set comprises a large number of image samples, the image samples are marked with text information, and the specific content of the text information is known. The sample image set may be stored in a local storage medium of the user device 110 or may be stored in the server 130.

In step 1230, the set of sample images is used to train the network model, and the parameters of the network model are adjusted to minimize the difference between the character sequence of each sample image output by the network model and the corresponding text information.

Specifically, the sample image set can be used as a training set to train a network model required for text recognition in the image. Specifically, the sample image set may be used as an input of the network model, and parameters of the network model may be adjusted according to an output of the network model, so as to minimize a difference between a character sequence recognition result of the sample image set output by the network model and known text information. For example, the similarity may be maximized by calculating the similarity between the character sequence recognition result and the known text information.

In an exemplary embodiment, as shown in fig. 13, the step 1230 specifically includes:

in step 1231, obtaining a text recognition error of the network model according to an error generated by text region detection performed by the network model and an error generated by executing a character recognition operation;

the network model is divided into two tasks of text region detection and character recognition operation. The text recognition error of the network model refers to the recognition error of the whole framework of the network model. The error may be the sum of an error generated by text region detection and an error generated by character recognition. The error generated by the text region detection may be an error existing in the detected text region before the output of the pooled feature, and the error generated by the character recognition operation may be an error generated by performing classification recognition on the characters in the text region after the output of the pooled feature.

In step 1232, according to the text recognition error, the network layer parameters for the network model to perform the text region detection and the network layer parameters for performing the character recognition operation are adjusted by back propagation, so that the text recognition error is minimized.

Back propagation refers to adjusting the parameters of the previous network model based on the results of the later recognition. Specifically, according to the recognition error of the whole framework of the network model, i.e. the error of the final output character sequence, the network layer parameters of the previous text region detection task and the network layer parameters for executing the character recognition operation are adjusted, so that the error between the final output character sequence and the real character sequence is minimized. Thus, the error generated in the recognition stage can be transmitted to the detection part to correct the parameters in the detection stage.

In an exemplary embodiment, as shown in fig. 14, the step 1231 specifically includes:

in step 1401, determining an error generated by the network model for text region detection according to an error generated by the network model for pixel-level classification prediction, an error generated by pixel-level border distance prediction, and an error generated by pixel-level rotation angle prediction;

the error generated by the pixel-level classification prediction refers to an error between the pixel-level classification confidence and a classification result of an actual pixel point belonging to a text region. The error generated by the pixel-level frame early warning prediction refers to the error between the actual distance and the prediction distance between the upper part, the lower part, the left part and the right part of the text frame where each pixel point is located, and the pixel-level rotation angle prediction refers to the error between the actual rotation angle and the prediction rotation angle when the pixel points rotate to the horizontal position.

Specifically, the error generated by the text region detection performed by the network model is represented as L _Detection ，

L _Detection ＝L _cls +αL _{geo_reg}

L _Detection Is the total loss function of the detected branches (text region detection), L _cls Is a loss function of confidence of pixel-level classification in the detection branch, i.e. the error generated by prediction of pixel-level classification, L _{geo_reg} Is a loss function of the pixel-level border distance (distance from the upper part to the lower part and the left part to the right part of the border), namely the error alpha between the predicted distance between the upper part and the lower part and the left part and the right part of the text border of the distance of each pixel point and the actual distance is L _georeg The proportion in the total detected branch loss.

Wherein the content of the first and second substances,

n is the number of positive valued elements in a confidence map prediction matrix,

marking whether the current pixel is a character (taking a value of 0 or 1), u _i Whether the current pixel is the predicted value of the character (the value is 0 or 1).

Wherein the content of the first and second substances,

n is the number of positive value elements in a confidence mapping prediction matrix, pi _i Representing the predicted pixel-level rotation angle,

indicating the pixel-level rotation angle of the label, beta indicates the angular loss accounting for L _{geo_reg} The ratio of (1).

Four geometric quantities (distance from upper, lower, left and right boundaries of the text box) B representing the predicted bounding box _i Four geometric quantities (distance from the upper, lower, left and right boundaries of the text box)

IOU loss in between, the IOU loss function is defined as follows:

represents the intersection of the two text boxes,

representing the union of two text boxes.

In step 1402, the error generated by text region detection performed by the network model and the error generated by performing character recognition operation are added in a weighted manner to obtain the text recognition error of the network model.

Specifically, the loss function of the entire network model, i.e., the text recognition error of the network model, is expressed as follows:

L _total ＝L _Detection +ε _recognition L _recogtion

L _Detection to detect loss of branch generation, L _recogtion The penalty incurred for recognizing a branch, i.e. the error, epsilon, incurred by performing a character recognition operation _recognition The loss of the identification branch accounts for the proportion of the total loss of the model, so as to control the contribution degree of the identification branch to the optimization of the whole model. The penalty incurred by the detection branch has been calculated in step 1401 and the penalty incurred by the identification branch is expressed as follows:

r is the number of the areas to be identified,

is the identification label of the first region, is the input currently identified by ρ,

the calculation formula of (a) is as follows:

c ^* is a character-level notation sequence, c ^* ＝{c ₀ ，...，c _L-1 L is the length of the marked sequence, L is less than or equal to 7439, 7439 is the number of characters in the dictionary, and only the characters in the dictionary can be recognized.

It should be noted that, in the detection task, the loss function of the bounding box regression adopts an IOU (Intersection over Union) loss function, which has the following advantages compared with the L2 loss: the four coordinates of the frame are used as a whole to be studied and optimized, so that the training difficulty of the model is reduced, the detection accuracy and the learning speed of the model can be improved, and meanwhile, the diversity adaptability of the sample is enhanced.

The scheme provided by the invention can support web api (network application program interface) service calling and mobile terminal deployment, and as shown in fig. 15, by adopting the technical scheme provided by the invention, specific character contents can be directly identified from an original image and output.

The following is an embodiment of an apparatus of the present invention, which may be used to execute an embodiment of a method for recognizing a text in an image executed by the user equipment 110 according to the present invention. For details that are not disclosed in the embodiments of the apparatus of the present invention, please refer to the embodiments of the method for recognizing text in an image of the present invention.

Fig. 16 is a block diagram illustrating an apparatus for recognizing text in an image according to an exemplary embodiment, which may be used in the user equipment 110 in the implementation environment shown in fig. 1 to perform all or part of the steps of the method for recognizing text in an image shown in any one of fig. 3, 6, 8, 9, 12-14. The device performs end-to-end recognition of text in an image through a network model with multiple layers of superposition, as shown in fig. 16, and the device includes but is not limited to: a spatial convolution operation module 1610, a global feature extraction module 1630, a pooled feature obtaining module 1650 and a character sequence output module 1670.

A spatial convolution operation module 1610, configured to perform spatial separable convolution operation on an image layer by layer in a multi-layer manner, and fuse convolution features extracted by the spatial separable convolution operation to a lower layer mapped by layer-by-layer superposition, where the lower layer is mapped to a higher layer outputting the convolution features;

a global feature extraction module 1630 configured to obtain global features from the lowest layer that performs the spatially separable convolution operation;

a pooling feature obtaining module 1650, configured to perform candidate region detection and region screening parameter prediction on a text in the image according to the global feature, so as to obtain a pooling feature corresponding to the detected text region;

a character sequence output module 1670, configured to propagate the pooled features backward to a recognition branch network layer that performs a character recognition operation, and output the character sequence of the text region tag through the recognition branch network layer.

The implementation processes of the functions and actions of the modules in the device are specifically described in the implementation processes of the corresponding steps in the method for recognizing the text in the image, and are not described herein again.

The spatial convolution operation module 1610 may be, for example, one of the physical structure processors 218 in fig. 2.

The global feature extraction module 1630, the pooled feature obtaining module 1650 and the character sequence output module 1670 may also be functional modules, and are configured to execute corresponding steps in the method for recognizing text in an image. It is understood that these modules may be implemented in hardware, software, or a combination of both. When implemented in hardware, these modules may be implemented as one or more hardware modules, such as one or more application specific integrated circuits. When implemented in software, the modules may be implemented as one or more computer programs executing on one or more processors, such as the programs stored in memory 204 and executed by processor 218 of FIG. 2.

Optionally, as shown in fig. 17, the pooled feature obtaining module 1650 includes, but is not limited to:

a candidate region output unit 1651 configured to input the global feature to a regional regression network layer that performs candidate region detection, and output a bounding box candidate region of a text in the image through the regional regression network layer;

a pooling input unit 1652 for inputting the bounding box candidate region into a pooling layer where region filtering and region rotation are performed;

a filtering rotation unit 1653, configured to filter the text region from the frame candidate region according to the pixel-level region filtering parameter obtained by performing region filtering parameter prediction on the global feature by the pooling layer, and rotate the text region to a horizontal position, so as to obtain the pooled feature of the text region.

Optionally, as shown in fig. 18, the screening rotary unit 1653 includes, but is not limited to:

a confidence obtaining subunit 1801, configured to obtain a pixel-level classification confidence that is generated by performing convolution calculation on the global feature by the pooling layer, where the pixel-level classification confidence is a probability that each pixel in the image belongs to a text region;

a candidate region screening subunit 1802, configured to screen out the text region from the frame candidate region according to the pixel-level classification confidence and the intersection proportion of the frame candidate region;

a text region rotation subunit 1803, configured to rotate the text region to a horizontal position through an interpolation algorithm according to a pixel-level rotation angle and a pixel-level border distance generated by performing convolution calculation on the global feature by the pooling layer, so as to obtain a pooling feature of the text region.

Optionally, the recognition branch network layer includes a time convolution network layer and a character classification layer, and the character sequence output module 1670 includes but is not limited to:

the character feature extraction unit is used for backward propagating the pooled features to the time convolution network layer to extract character features;

and the character classification unit is used for inputting the extracted character features into the character classification layer and outputting the character sequence of the text region mark through the character classification layer.

Optionally, the apparatus further includes but is not limited to:

the system comprises a sample set acquisition module, a sample image acquisition module and a text information acquisition module, wherein the sample image set is used for acquiring a sample image set recorded with text information on an image, and the content of the text information is known;

and the model training module is used for training the network model by utilizing the sample image set and enabling the difference between the character sequence of each sample image output by the network model and the corresponding text information to be minimum by adjusting the parameters of the network model.

Optionally, the model training module includes but is not limited to:

a model error obtaining unit, configured to obtain a text recognition error of the network model according to an error generated by performing text region detection on the network model and an error generated by performing a character recognition operation;

and the model parameter adjusting unit is used for adjusting the network layer parameters of the network model for text region detection and the network layer parameters for executing character recognition operation through back propagation according to the text recognition error so as to minimize the text recognition error.

Optionally, the model error obtaining unit includes but is not limited to:

the detection error determining subunit is used for determining an error generated by the network model for text region detection according to an error generated by pixel-level classification prediction of the network model, an error generated by pixel-level frame distance prediction and an error generated by pixel-level rotation angle prediction;

and the error fusion subunit is used for carrying out weighted addition on the error generated by text region detection of the network model and the error generated by executing character recognition operation to obtain the text recognition error of the network model.

Optionally, the present invention further provides an electronic device, which may be used in the user equipment 110 in the implementation environment shown in fig. 1 to execute all or part of the steps of the method for recognizing text in an image shown in any one of fig. 3, fig. 6, fig. 8, fig. 9, fig. 12 to fig. 14. The electronic device includes:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to execute the method of recognizing text in an image according to the above exemplary embodiment.

The specific manner in which the processor of the electronic device performs the operations in this embodiment has been described in detail in the embodiment related to the method for recognizing text in the image, and will not be elaborated upon here.

In an exemplary embodiment, a storage medium is also provided that is a computer-readable storage medium, such as may be transitory and non-transitory computer-readable storage media, including instructions. The storage medium stores a computer program executable by the processor 218 of the apparatus 200 to perform the above-described method of recognizing text in an image.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A method for recognizing text in an image, the method performing end-to-end recognition of text in an image through a network model with multiple layers superimposed, the method comprising:

2. The method according to claim 1, wherein the candidate region detection and region screening parameter prediction of the text in the image are performed through the global feature, and obtaining the pooled feature corresponding to the detected text region comprises:

inputting the global features into a regional regression network layer for executing candidate region detection, and outputting frame candidate regions of texts in the images through the regional regression network layer;

inputting the frame candidate area into a pooling layer for performing area screening and area rotation;

and according to pixel level region screening parameters obtained by predicting region screening parameters of the global features by the pooling layer, screening the text region from the frame candidate region and rotating the text region to a horizontal position to obtain the pooling features of the text region.

3. The method according to claim 2, wherein the obtaining the pooled feature of the text region by filtering out the text region from the frame candidate region and rotating the text region to a horizontal position according to a pixel-level region filtering parameter obtained by predicting a region filtering parameter of the global feature by the pooling layer comprises:

obtaining a pixel-level classification confidence coefficient generated by the pooling layer through convolution calculation of the global features, wherein the pixel-level classification confidence coefficient is the probability of each pixel in the image belonging to a text region;

screening out the text region from the frame candidate region according to the pixel-level classification confidence coefficient and the intersection proportion of the frame candidate region;

and according to the pixel level rotation angle and the pixel level frame distance generated by performing convolution calculation on the global feature by the pooling layer, rotating the text region to a horizontal position by an interpolation algorithm to obtain the pooling feature of the text region.

4. The method of claim 1, wherein the recognition branching network layer comprises a temporal convolution network layer and a character classification layer, wherein the back-propagating the pooled features to the recognition branching network layer performing a character recognition operation, wherein outputting the sequence of characters of the text region label through the recognition branching network layer comprises:

backward propagating the pooled features to the time convolution network layer to extract character features;

inputting the extracted character features into the character classification layer, and outputting the character sequence of the text region marks through the character classification layer.

5. The method of claim 1, further comprising:

acquiring a sample image set recorded with text information on an image, wherein the content of the text information is known;

and training the network model by using the sample image set, and minimizing the difference between the character sequence of each sample image output by the network model and the corresponding text information by adjusting the parameters of the network model.

6. The method of claim 5, wherein the training of the network model using the sample image set to minimize the difference between the character sequence of each sample image output by the network model and the corresponding text information by adjusting parameters of the network model comprises:

acquiring a text recognition error of the network model according to an error generated by text region detection of the network model and an error generated by executing character recognition operation;

and adjusting the network layer parameters of the network model for text region detection and the network layer parameters for executing character recognition operation through back propagation according to the text recognition error, so that the text recognition error is minimized.

7. The method of claim 6, wherein the obtaining the text recognition error of the network model from the error generated by text region detection according to the network model and the error generated by performing a character recognition operation comprises:

determining an error generated by the network model for text region detection according to an error generated by pixel-level classification prediction of the network model, an error generated by pixel-level frame distance prediction and an error generated by pixel-level rotation angle prediction;

and carrying out weighted addition on an error generated by text region detection of the network model and an error generated by executing character recognition operation to obtain a text recognition error of the network model.

8. An apparatus for recognizing a text in an image, the apparatus performing an end-to-end recognition of the text in the image through a network model in which a plurality of layers are stacked, the apparatus comprising:

a global feature extraction module for obtaining global features from a lowest layer performing a spatially separable convolution operation;

9. The apparatus of claim 8, wherein the pooled feature acquisition module comprises:

the candidate region output unit is used for inputting the global features into a regional regression network layer for executing candidate region detection and outputting frame candidate regions of texts in the images through the regional regression network layer;

a pooling input unit for inputting the bounding box candidate region into a pooling layer for performing region screening and region rotation;

and the screening rotation unit is used for screening the text region from the frame candidate region and rotating the text region to a horizontal position according to a pixel level region screening parameter obtained by predicting a region screening parameter of the global feature by the pooling layer, so as to obtain the pooling feature of the text region.

10. The apparatus of claim 9, wherein the screening rotating unit comprises:

a confidence obtaining subunit, configured to obtain a pixel-level classification confidence generated by performing convolution calculation on the global features by the pooling layer, where the pixel-level classification confidence is a probability that each pixel in the image belongs to a text region;

a candidate region screening subunit, configured to screen out the text region from the frame candidate region according to the pixel-level classification confidence and the intersection proportion of the frame candidate region;

and the text region rotation subunit is configured to rotate the text region to a horizontal position through an interpolation algorithm according to a pixel-level rotation angle and a pixel-level border distance, which are generated by performing convolution calculation on the global feature by the pooling layer, so as to obtain the pooled feature of the text region.

11. The apparatus of claim 8, wherein the recognition branching network layer comprises a time convolution network layer and a character classification layer, and wherein the character sequence output module comprises:

12. The apparatus of claim 8, further comprising:

13. The apparatus of claim 12, wherein the model training module comprises:

a model error obtaining unit, configured to obtain a text recognition error of the network model according to an error generated by text region detection performed by the network model and an error generated by performing a character recognition operation;

14. An electronic device, characterized in that the electronic device comprises:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the method of recognizing text in an image according to any one of claims 1 to 7.

15. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, which is executable by a processor to perform the method for recognizing text in an image according to any one of claims 1 to 7.