CN113033531A

CN113033531A - Method and device for recognizing text in image and electronic equipment

Info

Publication number: CN113033531A
Application number: CN201911374226.7A
Authority: CN
Inventors: 崔淼
Original assignee: Shanghai Xiaoi Robot Technology Co Ltd
Current assignee: Shanghai Xiaoi Robot Technology Co Ltd
Priority date: 2019-12-24
Filing date: 2019-12-24
Publication date: 2021-06-25
Anticipated expiration: 2039-12-24
Also published as: CN113033531B

Abstract

The invention provides a method, a device and electronic equipment for recognizing texts in images, wherein the method comprises the following steps: acquiring an image to be processed containing a text; locating a text region in the image; marking a plurality of reference points of the text area, and acquiring coordinates of the reference points, wherein the reference points comprise a plurality of boundary points and a plurality of central axis points of the text area; correcting the text area according to the coordinates of the reference point to obtain a corrected area; extracting text features of the correction area, and performing pyramid feature fusion on the text features to obtain fusion features; and identifying characters contained in the text area according to the fusion characteristics. According to the steps, the method for recognizing the text in the image can correct the image more accurately, improves the accuracy of character recognition, reduces the parameter amount during processing and improves the processing efficiency.

Description

Method and device for recognizing text in image and electronic equipment

Technical Field

The present invention relates to the field of image processing, and in particular, to a method and an apparatus for recognizing text in an image, and an electronic device.

Background

With the continuous and deep research of computer vision technology, the character recognition in image scenes is more and more emphasized. Text recognition in image scenes aims at converting text regions of an image into machine-readable symbols and automatically decoding into characters. Due to the complex natural scene, irregular shape, similar characters and unequal illumination conditions, especially in some scenes, the shapes of characters often have the states of distortion, inclination and the like, so that the difficulty of character recognition is further improved.

For such a situation, the conventional method generally adopts a method such as a Spatial Transform Network (STN) to correct the image, but if the distortion is serious, the distortion is easily caused. Moreover, some existing recognition methods have the problems of low training speed and low recognition accuracy, and recognition models are usually huge, so that the requirements of commercialization are difficult to meet.

Disclosure of Invention

In order to solve the problems, the invention provides a method for recognizing texts in images, which can effectively improve the accuracy of character recognition.

A method of text recognition in an image, comprising the steps of:

acquiring an image to be processed containing a text;

locating a text region in the image;

marking a plurality of reference points of the text area, and acquiring coordinates of the reference points, wherein the reference points comprise a plurality of boundary points and a plurality of central axis points of the text area;

correcting the text area according to the coordinates of the reference point to obtain a corrected area;

extracting text features of the correction area, and performing pyramid feature fusion on the text features to obtain fusion features;

and identifying characters contained in the text area according to the fusion characteristics.

Optionally, the correcting the text region according to the coordinates of the reference point, and obtaining a corrected region includes:

inputting the coordinates of the reference points into a trained space transformation network, and acquiring the coordinates of correction points corresponding to the reference points;

and correcting the text region according to the coordinates of the reference point and the coordinates of the correction point to obtain a corrected region.

Optionally, the extracting text features of the correction region, and performing pyramid feature fusion on the text features, where obtaining fusion features includes:

inputting the correction area into a trained lightweight detection network, and performing down-sampling on the correction area through a Stem layer;

the data after down sampling are sequentially transmitted through each transition layer of the light weight detection network;

and carrying out pyramid feature fusion according to the output results of the Stem layer and the transition layers to obtain the fusion features.

Optionally, the performing pyramid feature fusion according to the output results of the Stem layer and the transition layers to obtain the fusion features includes:

selecting a plurality of convolution layers from the Stem layer, selecting a plurality of transition layers from each transition layer, and performing pyramid feature fusion according to output results of the plurality of convolution layers and the plurality of transition layers to obtain the fusion feature.

Optionally, the selecting a plurality of convolutional layers from the Stem layer, selecting a plurality of transition layers from each transition layer, and performing pyramid feature fusion according to the output results of the plurality of convolutional layers and the plurality of transition layers to obtain the fusion feature includes:

selecting output results of a first convolution layer and three transition layers of the lightweight detection network to perform pyramid feature fusion, wherein the first convolution layer is selected from the first two convolution layers of the Stem layer;

sequentially passing the output result of the first convolution layer through the three-layer transition layer, wherein the scales of the input data of the three-layer transition layer are 1/2, 1/4 and 1/8 of the correction area respectively;

taking the output result of the third transition layer as a fourth depth feature, performing up-sampling, and adding the up-sampling and the output result of the second transition layer to obtain a third depth feature;

the third depth feature is up-sampled and added with the output result of the first transition layer to obtain a second depth feature;

up-sampling the second depth feature, and adding the up-sampled second depth feature and the output result of the first convolution layer to obtain a first depth feature;

fusing each of the depth features into the fused feature.

Optionally, the identifying the text included in the text region according to the fusion feature includes:

adopting a trained bidirectional long-short term memory network to carry out sequence prediction on the fusion characteristics to obtain an identification result sequence;

and decoding to obtain characters contained in the text area according to the identification result sequence.

Optionally, after the sequence prediction is performed on the fusion features by using the trained bidirectional long and short term memory network to obtain a recognition result sequence, and before the characters contained in the text region are decoded according to the recognition result sequence, the method further includes the steps of:

and optimizing the recognition result sequence by using a loss function, and removing repeated characters and interval characters in the recognition result sequence.

The invention also provides a device for recognizing the text in the image, which comprises:

the data acquisition module is used for acquiring an image to be processed containing a text;

the region positioning module is used for positioning a text region in the image;

a reference point obtaining module, configured to mark multiple reference points of the text region and obtain coordinates of the reference points, where the reference points include multiple boundary points and multiple central axis points of the text region;

the image correction module is used for correcting the text region according to the coordinates of the reference point to obtain a corrected region;

the feature processing module is used for extracting text features of the correction area, and performing pyramid feature fusion on the text features to obtain fusion features;

and the character recognition module is used for recognizing characters contained in the text area according to the fusion characteristics.

The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize any one of the above methods for recognizing texts in images. The present invention also provides a computer-readable storage medium having stored thereon a computer program, characterized in that the program, when executed by a processor, implements any of the above-described methods for recognizing text in an image.

According to the method for identifying the text in the image, the central axis point is introduced as the reference point in the process of correcting the text region, and the distorted and skewed text region is restrained by means of the boundary point and the central axis point, so that a better restraining effect is obtained, and the problems of further stretching and deformation in the correcting process are avoided; in addition, the invention can select a light-weight feature extraction network, reduce the number of parameters and improve the processing speed. The combination of the identification and correction network can effectively improve the identification effect of the oblique bent text and the robustness of text identification, so that the scenes suitable for character identification are richer.

Drawings

The above and other objects, features and advantages of the present application will become more apparent by describing in more detail embodiments of the present application with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. In the drawings, like reference numbers generally represent like parts or steps.

Fig. 1 is a schematic flow chart illustrating a method for recognizing text in an image according to an embodiment of the present invention;

FIG. 2(a) is a schematic diagram of boundary point selection according to an embodiment of the present invention;

FIG. 2(b) is a schematic diagram of selecting boundary points and central points according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a method for recognizing text in an image according to a second embodiment of the present invention;

FIG. 4 is a schematic diagram of a feature fusion method according to a second embodiment of the present invention;

FIG. 5 is a schematic structural diagram of an apparatus for recognizing text in an image according to a fourth embodiment of the present invention;

fig. 6 is a schematic structural diagram of an electronic device according to a fifth embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example one

In an embodiment of the present invention, as shown in fig. 1, a method for recognizing a text in an image is provided, which specifically includes the following steps:

step S110: acquiring an image to be processed containing a text;

step S120: locating a text region in the image;

step S130: marking a plurality of reference points of the text area, and acquiring coordinates of the reference points, wherein the reference points comprise a plurality of boundary points and a plurality of central axis points of the text area;

step S140: correcting the text area according to the coordinates of the reference point to obtain a corrected area;

step S150: extracting text features of the correction area, and performing pyramid feature fusion on the text features to obtain fusion features;

step S160: and identifying characters contained in the text area according to the fusion characteristics.

In this embodiment, a text region in an image with distortion such as distortion, deformation, and angular tilt is corrected, and in this embodiment, accuracy of a correction result can be improved well by simultaneously constraining a plurality of boundary points and a plurality of central axis points of the text region.

The processing target of the present embodiment is image data containing text, and the image may be a general RGB image including color information. The image to be processed may correspond to different image formats, different storage formats, and different compression modes, which are all within the protection scope of the present invention.

After the image to be processed is obtained, in step S120, the area where the text in the image is located. The purpose of positioning is to better define the edge information and the center information of the selected region during correction, and the method for positioning the text region in this embodiment is not limited.

In the next step S130, as shown in fig. 2, a plurality of boundary points are marked on the edge of the text region, a plurality of central axis points are selected along the central axis of the text region as reference points for image correction, and position coordinates of the reference points in the original image are obtained. Conventional spatial transformation methods tend to rely only on edge information, such as the upper and lower boundaries of text, but this tends to blur the glyph after image rectification, as shown in fig. 2 (a). Thus, in the present embodiment, the rectification transformation is constrained by the symmetric text boundary points along with the central axis point, as shown in fig. 2 (b).

When the reference points are obtained, preferably, 5-10 reference points can be selected on the upper boundary, the lower boundary and the central axis of the text area according to the number and the length of characters, so that the correction accuracy is improved, and the processing efficiency is considered.

In step S140, the text region is corrected according to the coordinates of each reference point. And mapping the original distorted and inclined boundary into a straight boundary region by performing coordinate transformation on each reference point, constraining through a central symmetry axis to obtain a corresponding mapping relation from the reference point to a correction point, and processing other pixels in the region according to the corresponding mapping relation to finish correction transformation.

Alternatively, in step S140, when the text region is corrected, the correction may be performed by:

step S141: inputting the coordinates of the reference points into a trained space transformation network, and acquiring the coordinates of correction points corresponding to the reference points;

step S142: and correcting the text region according to the coordinates of the reference point and the coordinates of the correction point to obtain a corrected region.

In step S141, the trained spatial transform network is used to obtain the coordinates of the correction points, and for example, the coordinates of the correction points corresponding to the reference points may be predicted by using a residual error network. When the space transformation network is trained, the selected training samples are the reference point coordinates of the image to be corrected and the correction point coordinates corresponding to the normal image, and the space transformation network establishes the corresponding relation of image correction through repeated training.

After the image correction is completed, a corrected text region is obtained, and in step S150, text features in the image are extracted from the corrected region, and a fusion feature is obtained by means of pyramid feature fusion. The fusion feature can further reflect the difference between the character and the background in the image, and based on this, in step S160, character recognition can be performed better based on the fusion feature.

Optionally, in this embodiment, step S160 may be specifically implemented by the following steps:

step S161: adopting a trained bidirectional long-short term memory network to carry out sequence prediction on the fusion characteristics to obtain an identification result sequence;

step S162: and decoding to obtain characters contained in the text area according to the identification result sequence.

The bidirectional long-short term memory network can process information with any length in sequence from the context, convert input image features into labels, and process the labels in a vectorization mode in a sparse matrix mode, so that the label pre-sequence and post-sequence information distribution of the obtained feature sequence is obtained. And decoding the characters contained in the text area according to the labels contained in the identification result sequence.

Optionally, between the step S161 and the step S162, the method may further include:

step S163: and optimizing the recognition result sequence by using a loss function, and removing repeated characters and interval characters in the recognition result sequence.

Due to factors such as different intervals of characters or slight deformation, the same character may have different characteristic expression forms during image processing, so that adjacent labels in the recognition result sequence output by the bidirectional long-short term memory network may correspond to the same target character, and repetition is another problem. The loss function is introduced to solve the problem, and the trained loss network model can remove the interval characters and the repeated characters in the recognition result to obtain a more accurate recognition result.

In the present embodiment, the constraint in image correction is strengthened mainly by increasing the central axis point, so that a more standard correction region can be obtained. And a good foundation is laid for the accurate recognition of characters in subsequent images.

Example two

In this embodiment, a method for recognizing a text in an image is provided, as shown in fig. 3, the specific steps include:

step S210: acquiring an image to be processed containing a text;

step S220: locating a text region in the image;

step S230: marking a plurality of reference points of the text area, and acquiring coordinates of the reference points, wherein the reference points comprise a plurality of boundary points and a plurality of central axis points of the text area;

step S240: correcting the text area according to the coordinates of the reference point to obtain a corrected area;

step S250: extracting text features of the correction area, and performing pyramid feature fusion on the text features to obtain fusion features;

step S260: and identifying characters contained in the text area according to the fusion characteristics.

Wherein, step S250 includes:

step S251: inputting the correction area into a trained lightweight detection network, and performing down-sampling on the correction area through a Stem layer;

step S252: the data after down sampling are sequentially transmitted through each transition layer of the light weight detection network;

step S253: and carrying out pyramid feature fusion according to the output results of the Stem layer and the transition layers to obtain the fusion features.

In this embodiment, the text features of the correction area are extracted by a lightweight detection network, and hierarchical pyramid feature fusion is performed. In this way, the amount of parameters in the feature processing is reduced, and the processing speed is increased. Other steps in this embodiment can be found in embodiment one.

Generally, when the convolutional neural network is used, the performance of the network itself is improved by deepening the network layer number or widening the network structure. The lightweight detection network generally improves the processing capacity through feature reuse and bypass arrangement from the aspect of features, can greatly reduce the parameter quantity of the network, and avoids the problem of gradient disappearance.

In this embodiment, after the light-weight detection network is input in the correction area, the correction area first passes through the Stem layer, and the Stem layer can realize first downsampling of the spatial dimension of the input image, increase the number of channels, and ensure strong feature expression capability without increasing a large amount of calculation.

And then, the data after down sampling passes through each transition layer of the lightweight detection network once, and the image features of the correction area are respectively extracted according to the capability obtained by network training. And finally, carrying out pyramid feature fusion according to the output results of the Stem layer and the transition layers to obtain the fusion features.

In an optional implementation manner of this embodiment, a plurality of convolution layers may be selected from the Stem layer, a plurality of transition layers may be selected from each transition layer, and pyramid feature fusion may be performed according to output results of the plurality of convolution layers and the plurality of transition layers to obtain the fusion feature. Namely, output results of all levels in the network are not needed, specific layers are selected, and the operation times and unnecessary resource waste are reduced on the basis of ensuring the feature fusion effect.

Further, in a preferred implementation manner of this embodiment, as shown in fig. 4, output results of a first convolutional layer and three transition layers of the lightweight detection network are selected for pyramid feature fusion, where the first convolutional layer is selected from the two convolutional layers before the Stem layer.

The Stem layer is composed of a plurality of convolution layers, and preferably, one of the first two convolution layers can be arbitrarily selected as a first convolution layer for subsequent feature fusion calculation.

And then, sequentially passing the output result of the first convolution layer through the three-layer transition layer, wherein the input data of the three-layer transition layer have the scales of 1/2, 1/4 and 1/8 of the correction area respectively. That is, the output result of the first convolution layer is used as the input of the first transition layer, and the scale is 1/2 of the original correction area scale; the output result scale of the first transition layer is 1/4 of the original correction region scale, and the output result scale is used as the input of the second transition layer; the second transition layer outputs the data of the original corrective zone scale 1/8 as input to the third transition layer.

As shown in fig. 4, after the output results of the layers are obtained in sequence, pyramid feature fusion is performed. Taking the output result of the third transition layer as a fourth depth feature, performing up-sampling, and adding the up-sampling and the output result of the second transition layer to obtain a third depth feature; the third depth feature is up-sampled and added with the output result of the first transition layer to obtain a second depth feature; up-sampling the second depth feature, and adding the up-sampled second depth feature and the output result of the first convolution layer to obtain a first depth feature; thereafter, the depth features are fused into the fused feature, for example, by stitching together the depth features.

In the pyramid feature fusion process, the conventional sampling methods such as deconvolution, feature image interpolation and the like are easy to reduce the convolution receptive field and cause the problem of losing small text target feature information. Preferably, in this embodiment, the hole convolution is used to perform upsampling on each depth feature, so that the field of view of the convolution can be increased, and more global information of the long text characters can be extracted and obtained.

In the embodiment, the lightweight network and the characteristic pyramid are fused, so that the defect that a traditional network needs a large amount of training sets to achieve certain generalization capability is overcome, the parameter amount is greatly reduced, the recognition speed is higher, and the method is more suitable for the demand of commercialization.

EXAMPLE III

In the embodiment, a method for recognizing a text in an image is provided, which includes the following steps:

step S1: acquiring an image to be processed containing a text;

step S2: locating a text region in the image;

step S3: marking a plurality of reference points of the text area, and acquiring coordinates of the reference points, wherein the reference points comprise a plurality of boundary points and a plurality of central axis points of the text area;

step S4: inputting the coordinates of the reference points into a trained space transformation network, and acquiring the coordinates of correction points corresponding to the reference points;

step S5: correcting the text region according to the coordinates of the reference point and the coordinates of the correction point to obtain a correction region;

step S6: inputting the correction area into a trained lightweight detection network, performing down-sampling on the correction area through a Stem layer, and randomly selecting a first convolutional layer from the first convolutional layer and the first two convolutional layers of the Stem layer;

step S7: sequentially passing the output result of the first convolution layer through the three-layer transition layer, wherein the scales of the input data of the three-layer transition layer are 1/2, 1/4 and 1/8 of the correction area respectively;

step S8: taking the output result of the third transition layer as a fourth depth feature, performing up-sampling, and adding the up-sampling and the output result of the second transition layer to obtain a third depth feature;

step S9: the third depth feature is up-sampled and added with the output result of the first transition layer to obtain a second depth feature;

step S10: up-sampling the second depth feature, and adding the up-sampled second depth feature and the output result of the first convolution layer to obtain a first depth feature;

step S11: fusing each of the depth features into the fused feature.

Step S12: adopting a trained bidirectional long-short term memory network to carry out sequence prediction on the fusion characteristics to obtain an identification result sequence;

step S13: optimizing the recognition result sequence by using a loss function, and removing repeated characters and interval characters in the recognition result sequence;

step S14: and decoding to obtain characters contained in the text area according to the identification result sequence.

In the embodiment, through the steps, when the graph correction is performed on the character image with distortion such as inclination and bending, better correction results can be obtained by jointly constraining the boundary point and the central axis point of the character area, and problems such as character deformation and stretching appear on the surface in the process; by adopting the lightweight network and the characteristic pyramid fusion, the parameter quantity is greatly reduced, and the identification speed is faster. The method and the device not only improve the recognition effect of the text in the distorted image, but also improve the robustness of text recognition, and the recognition scenes are richer. The experimental result shows that compared with the classical convolution cyclic neural network, the recognition rate of the character recognition scheme in the image in the embodiment can be improved by about 3%.

Example four

The present embodiment provides an apparatus for recognizing a text in an image, as shown in fig. 5, including:

the data acquisition module 10 is used for acquiring an image to be processed containing a text;

a region locating module 20, configured to locate a text region in the image;

a reference point obtaining module 30, configured to mark multiple reference points of the text region, and obtain coordinates of the reference points, where the reference points include multiple boundary points and multiple central axis points of the text region;

the image correction module 40 is configured to correct the text region according to the coordinates of the reference point to obtain a corrected region;

the feature processing module 50 is configured to extract text features of the correction region, perform pyramid feature fusion on the text features, and obtain fusion features;

and a text recognition module 60, configured to recognize the text included in the text region according to the fusion feature.

The device for recognizing the text in the image in the embodiment can effectively improve the accuracy of image correction through the common constraint of the boundary point and the central axis point.

Preferably, in this embodiment, the image rectification module 40 includes a trained spatial transform network, and can obtain coordinates of a corresponding rectification point according to the coordinates of the reference point, and the image rectification module 40 rectifies the text region according to the coordinates of the reference point and the coordinates of the rectification point, so as to obtain a rectification region.

Preferably, in this embodiment, the feature processing module 50 includes a trained lightweight detection network, and at least includes a Stem layer and a plurality of transition layers. Inputting a trained light detection network into the correction area, and performing down-sampling on the correction area through a Stem layer; and sequentially passing the down-sampled data through each transition layer of the lightweight detection network to obtain a plurality of characteristic output results. The feature processing module 50 further includes a feature fusion unit 51, configured to perform pyramid feature fusion according to the output results of the Stem layer and each transition layer, so as to obtain the fusion feature.

Preferably, the feature fusion unit 51 may select a plurality of convolution layers from the Stem layer, select a plurality of transition layers from each transition layer, and perform pyramid feature fusion according to output results of the plurality of convolution layers and the plurality of transition layers to obtain the fusion feature. Through certain screening, a better level is selected, and the processing efficiency is improved.

Further, the feature fusion unit 51 selects output results of a first convolution layer and three transition layers of the lightweight detection network to perform pyramid feature fusion, where the first convolution layer is selected from the first two convolution layers of the Stem layer;

taking the output result of the third transition layer as a fourth depth feature, performing up-sampling, and adding the up-sampling and the output result of the second transition layer to obtain a third depth feature; the third depth feature is up-sampled and added with the output result of the first transition layer to obtain a second depth feature; up-sampling the second depth feature, and adding the up-sampled second depth feature and the output result of the first convolution layer to obtain a first depth feature; finally, each of the depth features is fused into the fused feature.

Optionally, in this embodiment, the text recognition module 60 includes a trained bidirectional long-short term memory network, and is configured to perform sequence prediction on the fusion features to obtain a recognition result sequence. In addition, the text recognition module 60 further includes a decoding unit 61, configured to decode the text included in the text region according to the recognition result sequence.

Further, the character recognition module 60 further includes an optimization unit 62, where the optimization unit 62 optimizes the recognition result sequence by using a loss function to remove repeated characters and space characters in the recognition result sequence.

The embodiment further adopts the lightweight network and the feature pyramid fusion, so that the parameter quantity of the operation is greatly reduced, the recognition speed is higher, the robustness of text recognition is improved, and the recognition scene is richer.

EXAMPLE five

It should be noted that the method for recognizing text in an image according to the embodiment of the present application may be integrated into the electronic device 90 as a software module and/or a hardware module, in other words, the electronic device 90 may integrate the method for recognizing text in an image according to the embodiment. For example, the text-in-image recognition method may be applied to a software module in the operating system of the electronic device 90, or may be applied to an application program developed therefor; of course, the text-in-image recognition method may also be incorporated into a device that is one of the hardware modules of the electronic device 90.

In another embodiment of the present application, the carrier integrated with the text recognition method in the image and the electronic device 90 may also be separate devices (e.g., a server), and the carrier integrated with the text recognition method in the image may be connected to the electronic device 90 through a wired and/or wireless network and transmit the interactive information according to an agreed data format.

Fig. 6 is a schematic structural diagram of an electronic device 90 according to an embodiment of the present application. As shown in fig. 6, the electronic apparatus 90 includes: one or more processors 91 and memory 92; and computer program instructions stored in the memory 92 which, when executed by the processor 91, cause the processor 91 to perform a method of text recognition in an image as in any of the embodiments described above.

The processor 91 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 90 to perform desired functions.

Memory 92 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by the processor 91 to implement the steps of the text-in-image recognition methods of the various embodiments of the present application described above and/or other desired functions. Information such as light intensity, compensation light intensity, position of the filter, etc. may also be stored in the computer readable storage medium.

In one example, the electronic device 90 may further include: an input device 93 and an output device 94, which are interconnected by a bus system and/or other form of connection mechanism (not shown in fig. 6).

The output device 94 may output various information to the outside, and may include, for example, a display, speakers, a printer, and a communication network and its connected remote output devices, among others.

Of course, for simplicity, only some of the components of the electronic device 90 relevant to the present application are shown in fig. 6, and components such as buses, input devices/output interfaces, and the like are omitted. In addition, the electronic device 90 may include any other suitable components, depending on the particular application.

In addition to the above-described methods and apparatus, embodiments of the present application may also be a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform the steps of the method of text recognition in images as in any of the above-described embodiments.

The computer program product may write program code for carrying out operations for embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present application may also be a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, cause the processor to perform the steps in the method of text in images recognition according to various embodiments of the present application described in the above-mentioned text in images recognition method section of the present specification.

The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

It should be noted that in the apparatus and devices of the present application, the components may be disassembled and/or reassembled. These decompositions and/or recombinations are to be considered as equivalents of the present application.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the application to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims

1. A method for recognizing text in an image is characterized by comprising the following steps:

acquiring an image to be processed containing a text;

locating a text region in the image;

2. The method for recognizing text in an image according to claim 1, wherein said correcting the text region according to the coordinates of the reference point to obtain a corrected region comprises:

3. The method for recognizing text in an image according to claim 1, wherein the extracting text features of the correction region and performing pyramid feature fusion on the text features to obtain fusion features comprises:

4. The method for recognizing text in an image according to claim 3, wherein said performing pyramid feature fusion according to the output results of said Stem layer and each transition layer to obtain said fusion features comprises:

5. The method for recognizing text in an image according to claim 4, wherein said selecting a plurality of convolutional layers in said Stem layer, selecting a plurality of transition layers in each of said transition layers, and performing pyramid feature fusion based on output results of said plurality of convolutional layers and said plurality of transition layers to obtain said fusion feature comprises:

fusing each of the depth features into the fused feature.

6. The method for recognizing text in an image according to claim 1, wherein the recognizing the text included in the text region according to the fusion feature comprises:

7. The method for recognizing text in images according to claim 6, wherein after the sequence prediction of the fused features is performed by using the trained two-way long-short term memory network to obtain a recognition result sequence, and before the characters contained in the text region are decoded according to the recognition result sequence, the method further comprises the steps of:

8. An apparatus for recognizing text in an image, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements a method of text recognition in an image according to any one of claims 1 to 7 when executing the program.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a method for recognizing text in an image according to any one of claims 1 to 7.