CN111242125A

CN111242125A - Natural scene image text detection method, storage medium and terminal device

Info

Publication number: CN111242125A
Application number: CN202010040806.9A
Authority: CN
Inventors: 张勇; 黄裕倍; 赵东宁; 廉德亮; 谢维信
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2020-01-14
Filing date: 2020-01-14
Publication date: 2020-06-05
Anticipated expiration: 2040-01-14
Also published as: CN111242125B

Abstract

The invention discloses a natural scene image text detection method, a storage medium and a terminal device, wherein the method comprises the following steps: performing feature extraction processing on an image to be detected by adopting a deep convolutional neural network model to obtain basic feature maps of four stages; performing fusion processing on the basic feature maps of the four stages to obtain depth fusion feature maps of the three stages; adopting an improved inclusion module to perform aggregation operation on the depth fusion characteristic maps in the three stages to obtain a text region probability prediction characteristic map and a text region position prediction characteristic map; and carrying out algorithm processing on the text region probability prediction characteristic map and the text region position prediction characteristic map to obtain the position of the text in the natural scene image. The method utilizes the multi-stage feature maps for fusion, uses the image feature information of the early stage of the deep convolutional neural network for final feature map aggregation, and can effectively improve the accuracy of natural scene image text detection.

Description

Natural scene image text detection method, storage medium and terminal device

Technical Field

The invention relates to the field of target detection, in particular to a natural scene image text detection method, a storage medium and a terminal device.

Background

The existing natural scene image text detection methods are mainly divided into two categories, one category is a method for extracting candidate text regions based on manually designed features, and the method comprises a method based on a sliding window and a method based on image pixel connected regions. A disadvantage of this type of approach is that the manually designed features have great limitations in terms of accuracy and completeness. The other type is a method based on a deep convolutional neural network, the method carries out probability prediction and position prediction of a text region through a trained deep convolutional neural network and a softmax layer, and the method has the defect that only a feature map of the last stage is used for prediction, image feature information of the early stage of the deep convolutional neural network cannot be completely utilized, and the detection accuracy is low.

Accordingly, the prior art is yet to be improved and developed.

Disclosure of Invention

In view of the foregoing deficiencies of the prior art, an object of the present invention is to provide a natural scene image text detection method, a storage medium, and a terminal device, which aim to solve the problem of low detection accuracy of the conventional natural scene image text detection method.

The technical scheme of the invention is as follows:

a natural scene image text detection method comprises the following steps:

zooming the original natural scene image to obtain an image to be detected;

performing feature extraction processing on the image to be detected by adopting a deep convolutional neural network model to obtain basic feature maps of four stages;

performing fusion processing on the basic feature maps of the four stages to obtain depth fusion feature maps of the three stages;

adopting an improved inclusion module to perform aggregation operation on the depth fusion characteristic maps in the three stages to obtain a text region probability prediction characteristic map and a text region position prediction characteristic map;

and carrying out algorithm processing on the text region probability prediction characteristic map and the text region position prediction characteristic map to obtain the position of the text in the natural scene image.

The natural scene image text detection method comprises the following steps of adopting a depth convolution neural network model to carry out feature extraction processing on an image to be detected to obtain four stages of basic feature maps:

and performing feature extraction processing on the image to be detected by adopting a deep convolutional neural network model ResNet50 to obtain a first-stage basic feature map, a second-stage basic feature map, a third-stage basic feature map and a fourth-stage basic feature map.

The natural scene image text detection method comprises the following steps of performing fusion processing on the four stages of basic feature maps to obtain three stages of depth fusion feature maps:

respectively performing anti-pooling operation on the fourth-stage basic feature map, performing feature map splicing processing on the third-stage basic feature map and the fourth-stage basic feature map subjected to anti-pooling operation through jump switching to obtain a first spliced map, and performing first convolution processing on the first spliced map to obtain a first-stage depth fusion feature map;

performing anti-pooling operation on the first-stage depth fusion feature map, performing feature map splicing processing on the second-stage basic feature map and the anti-pooling operated first-stage depth fusion feature map through jump-over to obtain a second spliced map, and performing first convolution processing on the second spliced map to obtain a second-stage depth fusion feature map;

performing inverse pooling operation on the second-stage depth fusion feature map, performing feature map splicing processing on the first-stage basic feature map and the second-stage depth fusion feature map subjected to inverse pooling operation through jump switching to obtain a third spliced map, and performing first convolution processing on the third spliced map to obtain a third-stage depth fusion feature map.

The natural scene image text detection method, wherein the convolution kernel size of the first convolution processing is 3 × 3.

The method for detecting the natural scene image text comprises the following steps of adopting an improved inclusion module to perform aggregation operation on the depth fusion feature maps in the three stages to obtain a text region probability prediction feature map and a text region position prediction feature map:

performing inverse pooling on the first-stage depth fusion feature map and the second-stage depth fusion feature map, so that the length and the width of the first-stage depth fusion feature map, the second-stage depth fusion feature map and the third-stage depth fusion feature map are the same;

performing feature map splicing processing on the third-stage depth fusion feature map, the first-stage depth fusion feature map subjected to inverse pooling and the second-stage depth fusion feature map to obtain a fourth spliced map;

and performing second convolution processing on the fourth mosaic by adopting an improved inclusion module to obtain the text region probability prediction characteristic map and the text region position prediction characteristic map.

The text detection method for the natural scene image comprises an improved inclusion module, a convolution layer and an output layer, wherein the convolution layer comprises a first convolution unit with convolution kernel of 1 × 1, a second convolution unit with convolution kernels of 3 × 1 and 1 × 3, a third convolution unit with convolution kernels of 5 × 1 and 1 × 5 and a third convolution unit with maximum pooling and convolution kernel of 1 × 1.

The method for detecting the text of the natural scene image comprises the following steps of carrying out algorithm processing on the text region probability prediction characteristic map and the text region position prediction characteristic map to obtain the position of the text in the natural scene image:

obtaining a preliminary text region according to the text region position prediction feature map;

and calculating and screening the preliminary text region by combining the text region probability prediction characteristic map and a non-maximum suppression algorithm, and outputting the position of the text in the natural scene image.

The natural scene image text detection method comprises the steps of scaling an original natural scene image to be 512 × 512 in size to obtain an image to be detected.

A computer readable storage medium, wherein the computer readable storage medium stores one or more programs, which are executable by one or more processors to implement the steps in the natural scene image text detection method as described in any one of the above.

A terminal device comprising a processor, a memory and a communication bus; the memory has stored thereon a computer readable program executable by the processor;

the communication bus realizes connection communication between the processor and the memory;

the processor, when executing the computer readable program, implements the steps in the natural scene image text detection method as described in any one of the above.

Has the advantages that: the method firstly utilizes a deep convolutional neural network model to extract the features of the natural scene image to obtain basic feature maps in four stages, thereby solving the limitation of the basic features obtained by manual design in the aspects of accuracy and completeness; then, carrying out fusion processing on the basic feature maps of the four stages to obtain deep fusion feature maps of the three stages, and carrying out aggregation operation on the deep fusion feature maps of the three stages by adopting an improved inclusion module to obtain a text region probability prediction feature map and a text region position prediction feature map; by utilizing the multi-stage feature maps for fusion, the image feature information of the early stage of the deep convolutional neural network is used for final feature map aggregation, and the accuracy of natural scene image text detection can be effectively improved.

Drawings

Fig. 1 is a flowchart of a method for detecting text in images of natural scenes according to a preferred embodiment of the present invention.

FIG. 2 is a schematic structural diagram of a deep convolutional neural network model according to the present invention.

FIG. 3 is a flowchart of step S30 of FIG. 1 according to the present invention.

FIG. 4 is a flowchart of step S40 of FIG. 1 according to the present invention.

Fig. 5 is a schematic structural view of a conventional inclusion module.

Fig. 6 is a schematic structural diagram of an improved inclusion module of the present invention.

Fig. 7 is a block diagram of a terminal device according to a preferred embodiment of the present invention.

Detailed Description

The invention provides a natural scene image text detection method, a storage medium and a terminal device, and the invention is further described in detail below in order to make the purpose, technical scheme and effect of the invention clearer and clearer. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

The inventor finds that, in the prior art, a trained deep convolutional neural network and a softmsx layer are generally adopted to predict the probability and position of a text region of a natural scene image, and the method has the defects that only a feature map of the last stage is used for prediction, and image feature information of the early stage of the deep convolutional neural network cannot be completely utilized, so that the detection accuracy is low.

Based on the problems in the prior art, an embodiment of the present invention provides a flowchart of a preferred embodiment of a method for detecting a text in a natural scene image, where as shown in fig. 1, the method includes the steps of:

s10, zooming the original natural scene image to obtain an image to be detected;

s20, performing feature extraction processing on the image to be detected by adopting a deep convolutional neural network model to obtain basic feature maps of four stages;

s30, carrying out fusion processing on the basic feature maps of the four stages to obtain depth fusion feature maps of the three stages;

s40, adopting an improved inclusion module to perform aggregation operation on the depth fusion feature maps of the three stages to obtain a text region probability prediction feature map and a text region position prediction feature map;

and S50, carrying out algorithm processing on the text region probability prediction characteristic graph and the text region position prediction characteristic graph to obtain the position of the text in the natural scene image.

In the embodiment, firstly, a deep convolutional neural network model is utilized to perform feature extraction on a natural scene image to obtain basic feature maps in four stages, so that the limitation of the basic features obtained by manual design on the aspects of accuracy and completeness is solved; then, carrying out fusion processing on the basic feature maps of the four stages to obtain deep fusion feature maps of the three stages, and carrying out aggregation operation on the deep fusion feature maps of the three stages by adopting an improved inclusion module to obtain a text region probability prediction feature map and a text region position prediction feature map; by utilizing the multi-stage feature maps for fusion, the image feature information of the early stage of the deep convolutional neural network is used for final feature map aggregation, and the accuracy of natural scene image text detection can be effectively improved.

In this embodiment, the method for detecting a text in a natural scene image belongs to the field of target detection, which is an important research direction in the field of computer vision, and the method for detecting a text in a natural scene image is mainly used for locating a text region in a natural scene image and outputting a position of the text region in the natural scene image. The method for detecting the natural scene image text provided by the embodiment can be widely applied to the fields of text recognition, information retrieval, picture annotation and the like as a key step.

In some embodiments, in order to facilitate feature extraction processing on the image at a later stage, scaling processing needs to be performed on the original natural scene image to obtain an image to be detected with a fixed size. As an example, the original natural scene image is scaled to a size of 512 × 512, resulting in an image to be detected.

In some embodiments, as shown in fig. 2, a deep convolutional neural network model ResNet50 is used to perform feature extraction on the image to be detected, so as to obtain a first-stage basic feature map, a second-stage basic feature map, a third-stage basic feature map, and a fourth-stage basic feature map. In this embodiment, the resnet (residual network)50 refers to a residual network 50, which is a deep convolutional network, and is easier to optimize and can improve accuracy by increasing a corresponding depth, and the core of the method is to solve a side effect (degradation problem) caused by increasing the depth, so that the network performance can be improved by simply increasing the network depth. The deeper the deep neural network can calculate the richer the features, the better the effect can be obtained, the only disadvantage of the deeper neural network is that the parameters to be trained are very huge, which results in that it needs a lot of computing resources, but actually, as the network deepens, the magnitude of the gradient (norm) drops sharply, which is called as the gradient disappears, which results in the learning rate being very slow; in rare cases, the gradient is also sharply increased, namely, the gradient explosion phenomenon appears, and the accuracy shown on the training set is not improved but reduced compared with that of a shallow network. The residual error network in this embodiment is a network proposed to solve the network deepening gradient disappearance phenomenon. In this embodiment, the ResNet50 is used to perform feature extraction processing on the image to be detected, so that the first-stage basic feature map, the second-stage basic feature map, the third-stage basic feature map, and the fourth-stage basic feature map can be obtained quickly and accurately.

In some embodiments, as shown in fig. 2 and fig. 3, the step of performing fusion processing on the four stages of basic feature maps to obtain three stages of depth fusion feature maps includes:

s31, performing anti-pooling operation on the fourth-stage basic feature map, performing feature map splicing processing on the third-stage basic feature map and the fourth-stage basic feature map subjected to anti-pooling operation through jump connection to obtain a first spliced map, and performing first convolution processing on the first spliced map to obtain a first-stage deep fusion feature map;

s32, performing anti-pooling operation on the first-stage depth fusion feature map, performing feature map splicing processing on the second-stage basic feature map and the anti-pooled first-stage depth fusion feature map through jump connection to obtain a second spliced map, and performing first convolution processing on the second spliced map to obtain a second-stage depth fusion feature map;

and S33, performing anti-pooling operation on the second-stage depth fusion feature map, performing feature map splicing processing on the first-stage basic feature map and the second-stage depth fusion feature map subjected to anti-pooling operation through jump switching to obtain a third spliced map, and performing first convolution processing on the third spliced map to obtain a third-stage depth fusion feature map.

In this embodiment, the basic feature maps of the four stages are fused by using inverse pooling, skip-join, feature map stitching, and first convolution processing, so as to obtain a depth fusion feature map of three stages. In this embodiment, in order to perform the feature fusion operation on the basic feature maps of the four stages, it is necessary to perform the inverse pooling operation on the fourth-stage basic feature map, the first-stage depth fusion feature map, and the second-stage depth fusion feature map first, so that the length and width of the feature map of the stage are the same as those of the feature map of the previous stage. In this embodiment, the jump connection operation is mainly used to directly transfer the basic feature maps of the first to third stages to step S30 to complete the feature map splicing operation, where the feature map splicing operation is to directly splice together the basic feature maps with the same length and width. And finally, respectively carrying out first convolution processing on the first spliced graph, the second spliced graph and the third spliced graph subjected to the characteristic graph splicing processing to complete characteristic fusion and obtain the depth fusion characteristic graphs of the three stages.

In some embodiments, the convolution kernel size of the first convolution process is 3 x 3. That is to say, in this embodiment, a convolution operation with a convolution kernel size of 3 × 3 is used to perform a convolution operation on each feature map obtained by the feature map stitching operation, so as to complete feature fusion, and obtain depth fusion feature maps in three stages.

In some embodiments, as shown in fig. 2 and 4, the step of performing an aggregation operation on the three stages of depth fusion feature maps by using an improved inclusion module to obtain a text region probability prediction feature map and a text region position prediction feature map includes:

s41, performing inverse pooling on the first-stage depth fusion feature map and the second-stage depth fusion feature map, so that the first-stage depth fusion feature map, the second-stage depth fusion feature map and the third-stage depth fusion feature map have the same length and width;

s42, performing feature map splicing processing on the third-stage depth fusion feature map, the first-stage depth fusion feature map subjected to anti-pooling processing and the second-stage depth fusion feature map to obtain a fourth splicing map;

and S43, performing second convolution processing on the fourth mosaic by adopting an improved inclusion module to obtain the text region probability prediction characteristic diagram and the text region position prediction characteristic diagram.

In this embodiment, an inclusion module improved for text features of images of natural scenes is used to perform inverse pooling operation on depth fusion feature maps in a first stage and a second stage to obtain three depth fusion feature maps with the same length and width, the three depth fusion feature maps with the same length and width are subjected to feature map splicing, and finally, an inclusion convolution operation is performed on a fourth spliced map subjected to feature map splicing to obtain a text region probability prediction feature map and a text region position prediction feature map for prediction. In this embodiment, because the features carried by the depth fusion feature maps in the three stages are different, and for this feature, in combination with the physical features of the text in the natural scene image, the conventional inclusion module shown in fig. 5 is improved in this embodiment, and the improved inclusion module shown in fig. 6 is obtained by modifying the size of the convolution kernel in the conventional inclusion module shown in fig. 5, so that the text features in the natural scene image can be more accurately extracted, and the text detection accuracy of the natural scene image is improved.

As shown in fig. 5 and 6, the present embodiment splits the 3 × 3 convolution unit in the conventional inclusion module into the second convolution unit composed of convolution kernels of 3 × 1 and 1 × 3, splits the 3 × 3 convolution unit in the conventional inclusion module into the third convolution unit composed of convolution kernels of 5 × 1 and 1 × 5, and adds the 1 × 1 convolution unit under the largest pooling unit in the conventional inclusion module. That is, the improved convolution module in this embodiment includes an input layer, a convolution layer, and an output layer, where the convolution layer includes a first convolution unit with a convolution kernel of 1 × 1, a second convolution unit with convolution kernels of 3 × 1 and 1 × 3, a third convolution unit with convolution kernels of 5 × 1 and 1 × 5, and a third convolution unit with a maximum pooling and convolution kernel of 1 × 1. In the embodiment, the improved inclusion module is adopted to aggregate the three stages of deep fusion feature maps, so that the calculation efficiency can be improved, the network can be deepened, meanwhile, the network nonlinearity can be increased, and the accurate text region probability prediction feature map and the text region position prediction feature map can be obtained.

In some embodiments, the preliminary text region obtained from the text region position prediction feature map may have overlap and redundancy, and the calculation and screening need to be performed in combination with the text region probability prediction feature map and the non-maximum suppression algorithm, and the position of the final text is output, specifically implemented as follows: (1) preliminarily obtaining text regions from the text region position prediction characteristic map, and obtaining the prediction probability corresponding to each text region from the text region probability prediction characteristic map; (2) with 0.9 as a threshold value of the prediction probability, eliminating text regions smaller than the threshold value; (3) and calculating the area intersection ratio of each residual text region and the reference text region by taking the text region with the highest prediction probability as the reference for the residual text regions. Deleting the text area with the intersection ratio larger than 0.2; (4) and outputting the position of the text in the natural scene image according to the length, the width and the center coordinates of the text region.

Based on the above natural scene image text detection method, the present embodiment provides a computer-readable storage medium storing one or more programs, which are executable by one or more processors to implement the steps in the natural scene image text detection method according to the above embodiment.

Based on the above natural scene image text detection method, the present invention further provides a terminal device, as shown in fig. 7, which includes at least one processor (processor) 20; a display screen 21; and a memory (memory)22, and may further include a communication Interface (Communications Interface)23 and a bus 24. The processor 20, the display 21, the memory 22 and the communication interface 23 can communicate with each other through the bus 24. The display screen 21 is configured to display a user guidance interface preset in the initial setting mode. The communication interface 23 may transmit information. The processor 20 may call logic instructions in the memory 22 to perform the methods in the embodiments described above.

Furthermore, the logic instructions in the memory 22 may be implemented in software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product.

The memory 22, which is a computer-readable storage medium, may be configured to store a software program, a computer-executable program, such as program instructions or modules corresponding to the methods in the embodiments of the present disclosure. The processor 20 executes the functional application and data processing, i.e. implements the method in the above-described embodiments, by executing the software program, instructions or modules stored in the memory 22.

The memory 22 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal device, and the like. Further, the memory 22 may include a high speed random access memory and may also include a non-volatile memory. For example, a variety of media that can store program codes, such as a usb disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, may also be transient storage media.

In addition, the specific processes loaded and executed by the storage medium and the instruction processors in the terminal device are described in detail in the method, and are not stated herein.

In summary, the invention firstly utilizes the deep convolutional neural network model to extract the features of the natural scene image to obtain the basic feature map of four stages, thereby solving the limitation of the basic features obtained by manual design in the aspects of accuracy and completeness; then, carrying out fusion processing on the basic feature maps of the four stages to obtain deep fusion feature maps of the three stages, and carrying out aggregation operation on the deep fusion feature maps of the three stages by adopting an improved inclusion module to obtain a text region probability prediction feature map and a text region position prediction feature map; by utilizing the multi-stage feature maps for fusion, the image feature information of the early stage of the deep convolutional neural network is used for final feature map aggregation, and the accuracy of natural scene image text detection can be effectively improved.

It is to be understood that the invention is not limited to the examples described above, but that modifications and variations may be effected thereto by those of ordinary skill in the art in light of the foregoing description, and that all such modifications and variations are intended to be within the scope of the invention as defined by the appended claims.

Claims

1. A natural scene image text detection method is characterized by comprising the following steps:

zooming the original natural scene image to obtain an image to be detected;

2. The natural scene image text detection method according to claim 1, wherein the step of performing feature extraction processing on the image to be detected by using the deep convolutional neural network model to obtain four stages of basic feature maps comprises:

3. The natural scene image text detection method according to claim 2, wherein the step of performing fusion processing on the four stages of basic feature maps to obtain three stages of depth fusion feature maps comprises:

4. The natural scene image text detection method according to claim 3, wherein the convolution kernel size of the first convolution processing is 3 x 3.

5. The natural scene image text detection method according to claim 3, wherein the step of performing aggregation operation on the three stages of depth fusion feature maps by using an improved inclusion module to obtain a text region probability prediction feature map and a text region position prediction feature map comprises:

6. The natural scene image text detection method according to claim 5, wherein the modified inclusion module includes an input layer, a convolution layer, and an output layer, the convolution layer including a first convolution unit having a convolution kernel of 1 x 1, a second convolution unit having convolution kernels of 3 x 1 and 1 x 3, a third convolution unit having convolution kernels of 5 x 1 and 1 x 5, and a third convolution unit having a maximum pooling and a convolution kernel of 1 x 1.

7. The method for detecting the natural scene image text according to claim 1, wherein the step of performing an algorithm process on the text region probability prediction feature map and the text region position prediction feature map to obtain the position of the text in the natural scene image includes:

8. The method according to claim 1, wherein the original natural scene image is scaled to a size of 512 x 512 to obtain the image to be detected.

9. A computer readable storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the steps in the natural scene image text detection method as claimed in any one of claims 1 to 8.

10. A terminal device, comprising: a processor, a memory, and a communication bus; the memory has stored thereon a computer readable program executable by the processor;

the processor, when executing the computer readable program, performs the steps in the natural scene image text detection method as claimed in any one of claims 1-8.