CN113033531B

CN113033531B - Method and device for identifying text in image and electronic equipment

Info

Publication number: CN113033531B
Application number: CN201911374226.7A
Authority: CN
Inventors: 崔淼
Original assignee: Shanghai Xiaoi Robot Technology Co Ltd
Current assignee: Shanghai Xiaoi Robot Technology Co Ltd
Priority date: 2019-12-24
Filing date: 2019-12-24
Publication date: 2023-10-27
Anticipated expiration: 2039-12-24
Also published as: CN113033531A

Abstract

The application provides a method, a device and electronic equipment for identifying texts in images, wherein the method comprises the following steps: acquiring an image to be processed containing text; locating text regions in the image; marking a plurality of reference points of the text region, and acquiring coordinates of the reference points, wherein the reference points comprise a plurality of boundary points and a plurality of central axis points of the text region; correcting the text region according to the coordinates of the reference points to obtain a corrected region; extracting text features of the correction area, and carrying out pyramid feature fusion on the text features to obtain fusion features; and identifying the characters contained in the text region according to the fusion characteristics. According to the steps, the method for identifying the text in the image can accurately correct the image, improve the accuracy of character identification, reduce the parameter quantity during processing and improve the processing efficiency.

Description

Method and device for identifying text in image and electronic equipment

Technical Field

The present application relates to the field of image processing, and in particular, to a method and apparatus for identifying text in an image, and an electronic device.

Background

With the continuous and deep research of computer vision technology, character recognition in an image scene is receiving more and more attention. Chinese character recognition in an image scene aims to convert a text region of an image into machine-readable symbols and automatically decode into characters. Because of the complex natural scene, irregular shapes, similar characters and unequal illumination conditions, especially in some scenes, the shapes of the characters often have states such as distortion, inclination and the like, so that the difficulty of character recognition is further improved.

In this case, the existing method generally adopts a mode such as a Space Transformation Network (STN) to correct the image, but distortion is easily caused if the distortion is serious. In addition, the existing identification methods have the problems of low training speed and low identification accuracy, and the identification model is usually huge, so that the requirement of productization is difficult to achieve.

Disclosure of Invention

In order to solve the problems, the application provides a method for identifying text in an image, which can effectively improve the accuracy of text identification.

A method of text recognition in an image, comprising the steps of:

acquiring an image to be processed containing text;

locating text regions in the image;

marking a plurality of reference points of the text region, and acquiring coordinates of the reference points, wherein the reference points comprise a plurality of boundary points and a plurality of central axis points of the text region;

correcting the text region according to the coordinates of the reference points to obtain a corrected region;

extracting text features of the correction area, and carrying out pyramid feature fusion on the text features to obtain fusion features;

and identifying the characters contained in the text region according to the fusion characteristics.

Optionally, the correcting the text region according to the coordinates of the reference point, and obtaining the corrected region includes:

inputting the coordinates of the reference points into a trained space transformation network, and obtaining the coordinates of the correction points corresponding to the reference points;

and correcting the text region according to the coordinates of the reference point and the coordinates of the correction point to obtain a correction region.

Optionally, extracting the text feature of the correction area, performing pyramid feature fusion on the text feature, and obtaining the fusion feature includes:

inputting the correction area into a trained lightweight detection network, and downsampling the correction area through a Stem layer;

sequentially passing the downsampled data through each transition layer of the lightweight detection network;

and carrying out pyramid feature fusion according to the output results of the Stem layer and each transition layer to obtain the fusion features.

Optionally, the performing pyramid feature fusion according to the output results of the Stem layer and each transition layer, where obtaining the fusion feature includes:

and selecting a plurality of convolution layers from the Stem layers, selecting a plurality of transition layers from the transition layers, and performing pyramid feature fusion according to output results of the plurality of convolution layers and the plurality of transition layers to obtain the fusion features.

Optionally, selecting a plurality of convolution layers from the Stem layers, selecting a plurality of transition layers from the transition layers, and performing pyramid feature fusion according to output results of the plurality of convolution layers and the plurality of transition layers, where obtaining the fusion feature includes:

the output results of a first convolution layer and three transition layers of the lightweight detection network are selected to carry out pyramid feature fusion, wherein the first convolution layer is selected from the first two convolution layers of the Stem layer;

sequentially passing the output result of the first convolution layer through the three transition layers, wherein the scales of input data of the three transition layers are respectively 1/2, 1/4 and 1/8 of the correction area;

taking the output result of the third transition layer as a fourth depth characteristic, up-sampling, and adding the up-sampling result with the output result of the second transition layer to obtain a third depth characteristic;

upsampling the third depth feature and adding the upsampled third depth feature to the output result of the first transition layer to obtain a second depth feature;

upsampling the second depth feature and adding the upsampled second depth feature to the output result of the first convolutional layer to obtain a first depth feature;

and fusing each depth feature into the fusion feature.

Optionally, the identifying, according to the fusion feature, the text included in the text region includes:

performing sequence prediction on the fusion characteristics by adopting a trained two-way long-short-term memory network to obtain an identification result sequence;

and decoding to obtain the characters contained in the text region according to the recognition result sequence.

Optionally, after the sequence prediction is performed on the fusion feature by using the trained bidirectional long-short-term memory network to obtain an identification result sequence, before the text contained in the text region is decoded according to the identification result sequence, the method further comprises the steps of:

optimizing the recognition result sequence by using a loss function, and removing repeated characters and interval characters in the recognition result sequence.

The application also provides a device for identifying the text in the image, which comprises the following steps:

the data acquisition module is used for acquiring an image to be processed containing text;

the region positioning module is used for positioning a text region in the image;

a reference point obtaining module, configured to mark a plurality of reference points of the text region, and obtain coordinates of the reference points, where the reference points include a plurality of boundary points and a plurality of central axis points of the text region;

the image correction module is used for correcting the text region according to the coordinates of the reference point to obtain a correction region;

the feature processing module is used for extracting text features of the correction area, and carrying out pyramid feature fusion on the text features to obtain fusion features;

and the text recognition module is used for recognizing the text contained in the text region according to the fusion characteristics.

The application also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the method for identifying the text in the image when executing the program. The present application also provides a computer readable storage medium having stored thereon a computer program, characterized in that the program when executed by a processor implements any of the above-mentioned methods for identifying text in an image.

According to the identification method of the text in the image, provided by the application, the central axis point is introduced as the reference point in the correction process of the text region, and the distorted and skewed text region is restrained by means of the boundary point and the central axis point, so that a better restraint effect is obtained, and the problems of further stretching and deformation in the correction process are avoided; in addition, the application can select a lightweight characteristic extraction network, reduce the quantity of parameters and improve the processing speed. The application can effectively improve the recognition effect of the inclined bending text and the robustness of the text recognition by combining the recognition and correction network, so that the scenes suitable for the text recognition are more abundant.

Drawings

The above and other objects, features and advantages of the present application will become more apparent by describing embodiments of the present application in more detail with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of embodiments of the application and are incorporated in and constitute a part of this specification, illustrate the application and together with the embodiments of the application, and not constitute a limitation to the application. In the drawings, like reference numerals generally refer to like parts or steps.

FIG. 1 is a flowchart of a method for recognizing text in an image according to an embodiment of the present application;

FIG. 2 (a) is a schematic diagram illustrating boundary point selection according to an embodiment of the application;

FIG. 2 (b) is a schematic view showing boundary point and center axis point selection in accordance with an embodiment of the present application;

FIG. 3 is a flowchart of a method for identifying text in an image according to a second embodiment of the present application;

FIG. 4 is a schematic diagram of a feature fusion method according to a second embodiment of the present application;

FIG. 5 is a schematic diagram of a text recognition device in an image according to a fourth embodiment of the present application;

fig. 6 is a schematic structural diagram of an electronic device in a fifth embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Example 1

In one embodiment of the present application, as shown in fig. 1, there is provided a method for recognizing text in an image, which specifically includes the following steps:

step S110: acquiring an image to be processed containing text;

step S120: locating text regions in the image;

step S130: marking a plurality of reference points of the text region, and acquiring coordinates of the reference points, wherein the reference points comprise a plurality of boundary points and a plurality of central axis points of the text region;

step S140: correcting the text region according to the coordinates of the reference points to obtain a corrected region;

step S150: extracting text features of the correction area, and carrying out pyramid feature fusion on the text features to obtain fusion features;

step S160: and identifying the characters contained in the text region according to the fusion characteristics.

In this embodiment, the text region in the image with distortion such as distortion, deformation, and angle inclination is corrected, and in this embodiment, the accuracy of the correction result can be improved well by simultaneously restricting multiple boundary points and multiple central axis points of the text region.

The processing object of the present embodiment is image data containing text, and the image may be a general RGB image including color information. The image to be processed can correspond to different image formats, can correspond to different storage formats and can also correspond to different compression modes, and the image to be processed is in the protection scope of the application.

After the image to be processed is acquired, in step S120, the area where the text in the image is located. The positioning purpose is to better define the edge information and the center information of the selected area during correction, and the positioning mode of the text area is not limited in this embodiment.

In the following step S130, as shown in fig. 2, a plurality of boundary points are marked on the edge of the text region, and a plurality of central axis points are selected along the central axis of the text region, which are used as reference points for image correction, and the position coordinates of these reference points in the original image are obtained. Conventional spatial transformation methods often rely only on edge information, such as upper and lower boundaries of text, but this can easily blur the corrected glyphs of the image, as shown in fig. 2 (a). The corrective transformation is thus constrained in this embodiment by the symmetrical text boundary points along with the central axis point, as shown in fig. 2 (b).

When the reference points are acquired, preferably, 5-10 reference points can be respectively selected on the upper boundary, the lower boundary and the central axis of the text region according to the number and the length of the characters, so that the correction accuracy is improved and the processing efficiency is considered.

In step S140, the text region is corrected according to the coordinates of each reference point. The original distorted and inclined boundary is mapped into a straight boundary area by carrying out coordinate transformation on each reference point, the straight boundary area is restrained by a central symmetry axis, the corresponding mapping relation from the reference point to the correction point is obtained, and other pixels in the area are processed according to the mapping relation, so that correction transformation is completed.

Alternatively, in step S140, when the text region is corrected, the following steps may be implemented:

step S141: inputting the coordinates of the reference points into a trained space transformation network, and obtaining the coordinates of the correction points corresponding to the reference points;

step S142: and correcting the text region according to the coordinates of the reference point and the coordinates of the correction point to obtain a correction region.

In step S141, a trained spatial transformation network is used to obtain coordinates of the correction points, for example, a residual network may be used to predict coordinates of the correction points corresponding to the reference points. When the spatial transformation network is trained, the selected training samples are the reference point coordinates of the image to be corrected and the correction point coordinates corresponding to the normal image, and the spatial transformation network establishes the corresponding relation of image correction through repeated training.

After the image correction is completed, a corrected text region is obtained, and in step S150, text features in the image are extracted from the corrected region, and fusion features are obtained by means of pyramid feature fusion. The fusion feature can further reflect the difference between the text and the background in the image, based on which, in step S160, text recognition can be better performed according to the fusion feature.

Alternatively, in the present embodiment, step S160 may be specifically implemented by the following steps:

step S161: performing sequence prediction on the fusion characteristics by adopting a trained two-way long-short-term memory network to obtain an identification result sequence;

step S162: and decoding to obtain the characters contained in the text region according to the recognition result sequence.

The bidirectional long-short-term memory network can process any length information from the context according to the sequence, convert the input image characteristics into labels, and process the labels in a vectorization manner in a sparse matrix mode, so that the label preamble and the post information of the obtained characteristic sequence are distributed. And decoding to obtain the characters contained in the text region according to the labels contained in the identification result sequence.

Optionally, between the step S161 and the step S162, the method may further include:

step S163: optimizing the recognition result sequence by using a loss function, and removing repeated characters and interval characters in the recognition result sequence.

Due to factors such as different intervals or slight deformation of characters, different characteristic expression forms may exist in the same character during image processing, so that adjacent labels may correspond to the same target character in a recognition result sequence output by the two-way long-short-term memory network, and repetition is another problem. The loss function is introduced to solve the problem, and the trained loss network model can remove interval characters and repeated characters in the recognition result so as to obtain a more accurate recognition result.

In the present embodiment, the restriction in the image correction is reinforced mainly by adding the center axis point, so that a more standard correction area can be obtained. Laying a good foundation for accurate identification of characters in subsequent images.

Example two

In this embodiment, a method for identifying text in an image is provided, as shown in fig. 3, and specific steps include:

step S210: acquiring an image to be processed containing text;

step S220: locating text regions in the image;

step S230: marking a plurality of reference points of the text region, and acquiring coordinates of the reference points, wherein the reference points comprise a plurality of boundary points and a plurality of central axis points of the text region;

step S240: correcting the text region according to the coordinates of the reference points to obtain a corrected region;

step S250: extracting text features of the correction area, and carrying out pyramid feature fusion on the text features to obtain fusion features;

step S260: and identifying the characters contained in the text region according to the fusion characteristics.

Wherein, step S250 includes:

step S251: inputting the correction area into a trained lightweight detection network, and downsampling the correction area through a Stem layer;

step S252: sequentially passing the downsampled data through each transition layer of the lightweight detection network;

step S253: and carrying out pyramid feature fusion according to the output results of the Stem layer and each transition layer to obtain the fusion features.

In this embodiment, text features of the correction area are extracted through a lightweight detection network, and hierarchical pyramid feature fusion is performed. In this way, the parameter amount during feature processing is reduced, and the processing speed is increased. Other steps in this embodiment can be referred to in embodiment one.

In general, when convolutional neural networks are used, the performance of the network is improved by deepening the network layer number or widening the network structure. In general, the lightweight detection network improves the processing capability through feature reuse and bypass setting from the feature point of view, and can greatly reduce the parameter number of the network and avoid the gradient vanishing problem.

In this embodiment, after the correction area is input into the light detection network, the step layer is first passed, so that the step layer can realize the first downsampling of the spatial dimension of the input image, increase the number of channels, and ensure a stronger feature expression capability without increasing more calculation amount.

And then, the downsampled data passes through each transition layer of the lightweight detection network once, and the image characteristics of the correction area are respectively extracted according to the capability obtained by network training. And finally, carrying out pyramid feature fusion according to the output results of the Stem layer and each transition layer to obtain the fusion features.

In an optional implementation manner of this embodiment, a plurality of convolution layers may be selected from the Stem layers, a plurality of transition layers may be selected from the transition layers, and pyramid feature fusion may be performed according to output results of the plurality of convolution layers and the plurality of transition layers, so as to obtain the fusion feature. That is, the output results of all levels in the network are not required to be used, but specific layers are selected, so that the number of operations and unnecessary resource waste are reduced on the basis of ensuring the feature fusion effect.

Further, in a preferred implementation manner of this embodiment, as shown in fig. 4, output results of a first convolution layer and three transition layers of the lightweight detection network are selected to perform pyramid feature fusion, where the first convolution layer is selected from the first two convolution layers of the Stem layer.

The Stem layer is composed of a plurality of convolution layers, and preferably, one of the two previous convolution layers can be arbitrarily selected as a first convolution layer for subsequent feature fusion calculation.

And sequentially passing the output result of the first convolution layer through the three transition layers, wherein the scales of input data of the three transition layers are respectively 1/2, 1/4 and 1/8 of the correction area. Namely, the output result of the first convolution layer is used as the input of the first transition layer, and the scale is 1/2 of the original correction area scale; the output result scale of the first transition layer is 1/4 of the original correction area scale and is used as the input of the second transition layer; the second transition layer outputs 1/8 of the original correction area scale data as the input of the third transition layer.

As shown in fig. 4, after the output results of the layers are obtained sequentially, pyramid feature fusion is performed. Taking the output result of the third transition layer as a fourth depth characteristic, up-sampling, and adding the up-sampling result with the output result of the second transition layer to obtain a third depth characteristic; upsampling the third depth feature and adding the upsampled third depth feature to the output result of the first transition layer to obtain a second depth feature; upsampling the second depth feature and adding the upsampled second depth feature to the output result of the first convolutional layer to obtain a first depth feature; thereafter, the depth features are fused into the fused features, e.g., by stitching the depth features together.

In the pyramid feature fusion process, the conventional sampling method such as deconvolution, feature image interpolation and the like is easy to reduce convolution receptive fields, and the problem of losing small text target feature information is generated. Preferably, in this embodiment, the hole convolution is used to up-sample each depth feature, so that the field of view of the convolution can be increased, and more global information of the long text word can be extracted.

In the embodiment, the lightweight network and the feature pyramid fusion are introduced, so that the defect that the traditional network needs a large amount of training sets to reach a certain generalization capability is avoided, the parameter quantity is greatly reduced, the recognition speed is higher, and the method is more suitable for the requirement of productization.

Example III

In this embodiment, a method for identifying text in an image is provided, including the following steps:

step S1: acquiring an image to be processed containing text;

step S2: locating text regions in the image;

step S3: marking a plurality of reference points of the text region, and acquiring coordinates of the reference points, wherein the reference points comprise a plurality of boundary points and a plurality of central axis points of the text region;

step S4: inputting the coordinates of the reference points into a trained space transformation network, and obtaining the coordinates of the correction points corresponding to the reference points;

step S5: correcting the text region according to the coordinates of the reference point and the coordinates of the correction point to obtain a correction region;

step S6: inputting the correction area into a trained light detection network, carrying out downsampling on the correction area through a Stem layer, and randomly selecting a first convolution layer from the first two convolution layers of the Stem layer;

step S7: sequentially passing the output result of the first convolution layer through the three transition layers, wherein the scales of input data of the three transition layers are respectively 1/2, 1/4 and 1/8 of the correction area;

step S8: taking the output result of the third transition layer as a fourth depth characteristic, up-sampling, and adding the up-sampling result with the output result of the second transition layer to obtain a third depth characteristic;

step S9: upsampling the third depth feature and adding the upsampled third depth feature to the output result of the first transition layer to obtain a second depth feature;

step S10: upsampling the second depth feature and adding the upsampled second depth feature to the output result of the first convolutional layer to obtain a first depth feature;

step S11: and fusing each depth feature into the fusion feature.

Step S12: performing sequence prediction on the fusion characteristics by adopting a trained two-way long-short-term memory network to obtain an identification result sequence;

step S13: optimizing the recognition result sequence by using a loss function, and removing repeated characters and interval characters in the recognition result sequence;

step S14: and decoding to obtain the characters contained in the text region according to the recognition result sequence.

According to the embodiment, through the steps, when graphic correction is carried out on the distorted character image with inclination, bending and the like, the boundary points and the central axis points of the character area are used for constraint together, so that a better correction result can be obtained, and the problems of character deformation, stretching and the like occur on the surface in the process; by adopting the lightweight network and feature pyramid fusion, the parameter quantity is greatly reduced, and the recognition speed is faster. According to the embodiment, the recognition effect of the text in the distorted image is improved, the robustness of text recognition is improved, and the recognition scene is richer. Experimental results show that compared with a classical convolutional recurrent neural network, the recognition rate of the Chinese character recognition scheme in the image in the embodiment can be improved by about 3%.

Example IV

In this embodiment, a device for identifying text in an image is provided, as shown in fig. 5, including:

a data acquisition module 10, configured to acquire an image to be processed including text;

a region locating module 20 for locating text regions in the image;

a reference point obtaining module 30, configured to mark a plurality of reference points of the text region, and obtain coordinates of the reference points, where the reference points include a plurality of boundary points and a plurality of center axis points of the text region;

an image correction module 40, configured to correct the text region according to the coordinates of the reference point, so as to obtain a corrected region;

the feature processing module 50 is configured to extract text features of the correction area, perform pyramid feature fusion on the text features, and obtain fusion features;

and the text recognition module 60 is configured to recognize the text included in the text region according to the fusion feature.

The text recognition device in the image in the embodiment can effectively improve the accuracy of image correction through the constraint of the boundary point and the central axis point.

Preferably, in this embodiment, the image correction module 40 includes a trained spatial transformation network, and can obtain coordinates of a corresponding correction point according to the coordinates of the reference point, and the image correction module 40 corrects the text region according to the coordinates of the reference point and the coordinates of the correction point, so as to obtain a corrected region.

Preferably, in the present embodiment, the feature processing module 50 includes a trained lightweight detection network, including at least a Stem layer and a plurality of transition layers. Inputting the correction area into a trained lightweight detection network, and downsampling the correction area through a Stem layer; and sequentially passing the downsampled data through each transition layer of the lightweight detection network to obtain a plurality of characteristic output results. The feature processing module 50 further includes a feature fusion unit 51, configured to perform pyramid feature fusion according to the output results of the Stem layer and each transition layer, so as to obtain the fusion feature.

Preferably, the feature fusion unit 51 may select a plurality of convolution layers from the Stem layers, select a plurality of transition layers from the transition layers, and perform pyramid feature fusion according to the output results of the plurality of convolution layers and the plurality of transition layers, so as to obtain the fusion feature. By screening to a certain extent, a better level is selected, and the processing efficiency is improved.

Further, the feature fusion unit 51 selects output results of a first convolution layer and three transition layers of the lightweight detection network to perform pyramid feature fusion, where the first convolution layer is selected from the first two convolution layers of the Stem layer;

taking the output result of the third transition layer as a fourth depth characteristic, up-sampling, and adding the up-sampling result with the output result of the second transition layer to obtain a third depth characteristic; upsampling the third depth feature and adding the upsampled third depth feature to the output result of the first transition layer to obtain a second depth feature; upsampling the second depth feature and adding the upsampled second depth feature to the output result of the first convolutional layer to obtain a first depth feature; and finally, fusing the depth features into the fusion features.

Optionally, in this embodiment, the text recognition module 60 includes a trained two-way long-short-term memory network for performing sequence prediction on the fusion features to obtain a recognition result sequence. The text recognition module 60 further includes a decoding unit 61, configured to decode and obtain the text included in the text region according to the recognition result sequence.

Further, the text recognition module 60 further includes an optimizing unit 62, and the optimizing unit 62 optimizes the recognition result sequence by using a loss function, and removes repeated characters and space characters in the recognition result sequence.

According to the embodiment, the lightweight network and the feature pyramid are further adopted to be fused, so that the parameter quantity of operation is greatly reduced, the recognition speed is higher, the robustness of text recognition is improved, and recognition scenes are richer.

Example five

It should be noted that the method for recognizing text in an image according to the embodiment of the present application may be integrated into the electronic device 90 as a software module and/or a hardware module, in other words, the electronic device 90 may integrate the method for recognizing text in an image in the above embodiment. For example, the in-image text recognition method may be applied to a software module in the operating system of the electronic device 90, or may be applied to an application developed for it; of course, the method of text recognition in the image may also be inherited in an apparatus with one of the plurality of hardware modules of the electronic device 90.

In another embodiment of the present application, the carrier integrated with the text recognition method in the image and the electronic device 90 may also be separate devices (e.g., servers), and the carrier integrated with the text recognition method in the image may be connected to the electronic device 90 through a wired and/or wireless network and transmit the interactive information according to a agreed data format.

Fig. 6 is a schematic structural diagram of an electronic device 90 according to an embodiment of the application. As shown in fig. 6, the electronic device 90 includes: one or more processors 91 and memory 92; and computer program instructions stored in the memory 92, which when executed by the processor 91, cause the processor 91 to perform the method of text recognition in an image as in any of the embodiments described above.

Processor 91 may be a Central Processing Unit (CPU) or other form of processing unit having data processing and/or instruction execution capabilities, and may control other components in electronic device 90 to perform desired functions.

Memory 92 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, random Access Memory (RAM) and/or cache memory (cache), and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on the computer readable storage medium that may be executed by the processor 91 to perform the steps and/or other desired functions of the method for text in images of various embodiments of the application described above. Information such as light intensity, compensation light intensity, position of the filter, etc. may also be stored in the computer readable storage medium.

In one example, the electronic device 90 may further include: an input device 93 and an output device 94, which are interconnected by a bus system and/or other form of connection mechanism (not shown in fig. 6).

The output device 94 may output various information to the outside, and may include, for example, a display, a speaker, a printer, and a communication network and a remote output apparatus connected thereto, etc.

Of course, only some of the components of the electronic device 90 that are relevant to the present application are shown in fig. 6 for simplicity, components such as buses, input/output interfaces, etc. are omitted. In addition, the electronic device 90 may include any other suitable components depending on the particular application.

In addition to the methods and apparatus described above, embodiments of the application may also be a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform the steps of the method for text recognition in an image as in any of the embodiments described above.

The computer program product may include program code for performing operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present application may also be a computer-readable storage medium, having stored thereon computer program instructions, which when executed by a processor, cause the processor to perform the steps of the text-in-image recognition method according to the various embodiments of the present application described in the text-in-image recognition method section above in the present specification.

The computer readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

It should be noted that in the apparatus and device of the present application, the components may be disassembled and/or assembled. Such decomposition and/or recombination should be considered as equivalent aspects of the present application.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit embodiments of the application to the form disclosed herein. Although a number of example aspects and embodiments have been discussed above, a person of ordinary skill in the art will recognize certain variations, modifications, alterations, additions, and subcombinations thereof.

Claims

1. A method for identifying text in an image, comprising the steps of:

acquiring an image to be processed containing text;

locating text regions in the image;

identifying characters contained in the text region according to the fusion characteristics;

correcting the text region according to the coordinates of the reference point, wherein the obtaining the corrected region comprises: inputting the coordinates of the reference points into a trained space transformation network, and obtaining the coordinates of the correction points corresponding to the reference points; correcting the text region according to the coordinates of the reference point and the coordinates of the correction point to obtain a correction region;

and identifying the characters contained in the text region according to the fusion characteristics comprises the following steps: performing sequence prediction on the fusion characteristics by adopting a trained two-way long-short-term memory network to obtain an identification result sequence; and decoding to obtain the characters contained in the text region according to the recognition result sequence.

2. The method for identifying text in an image according to claim 1, wherein the extracting text features of the correction area, and performing pyramid feature fusion on the text features, and obtaining fusion features include:

3. The method for identifying text in an image according to claim 2, wherein said performing pyramid feature fusion according to the output results of said Stem layer and said respective transition layers, obtaining said fusion feature comprises:

4. The method for recognizing text in an image according to claim 3, wherein selecting a plurality of convolution layers from the Stem layers, selecting a plurality of transition layers from the transition layers, and performing pyramid feature fusion according to output results of the plurality of convolution layers and the plurality of transition layers, wherein obtaining the fusion feature comprises:

and fusing each depth feature into the fusion feature.

5. The method for recognizing text in an image according to claim 4, wherein after said sequence prediction is performed on said fusion feature using a trained two-way long-short-term memory network to obtain a recognition result sequence, before decoding to obtain text contained in said text region according to said recognition result sequence, further comprising the steps of:

6. An apparatus for recognizing text in an image, comprising:

the image correction module is used for correcting the text region according to the coordinates of the reference point to obtain a correction region; the image correction module comprises a trained space transformation network, acquires coordinates of corresponding correction points according to the coordinates of the reference points, and corrects the text region according to the coordinates of the reference points and the coordinates of the correction points to obtain a correction region;

the text recognition module is used for recognizing the text contained in the text region according to the fusion characteristics;

the character recognition module adopts a trained two-way long-short-term memory network to conduct sequence prediction on the fusion characteristics to obtain a recognition result sequence; and decoding to obtain the characters contained in the text region according to the recognition result sequence.

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method for identifying text in an image as claimed in any one of claims 1 to 5 when the program is executed by the processor.

8. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method for recognition of text in an image according to any one of claims 1 to 5.