CN113888505B

CN113888505B - Natural scene text detection method based on semantic segmentation

Info

Publication number: CN113888505B
Application number: CN202111157377.4A
Authority: CN
Inventors: 张立和; 隋国际
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2021-09-30
Filing date: 2021-09-30
Publication date: 2024-05-07
Anticipated expiration: 2041-09-30
Also published as: CN113888505A

Abstract

The invention belongs to the field of computer vision by a deep learning technology, and provides a natural scene text detection method based on semantic segmentation. According to the method, a feature extraction network is firstly constructed, then a feature selection module is used for screening effective information, the screened multi-scale feature information is fused through a feature pyramid network structure, finally a semantic segmentation result after the edge of a text region is obviously strengthened is obtained through an edge strengthening network and a semantic segmentation network, and finally boundary coordinate information of the text region is obtained. The invention realizes a quick lightweight text detection model, not only can detect text areas with various complex shapes and backgrounds, but also the detection process is quick and accurate.

Description

Natural scene text detection method based on semantic segmentation

Technical Field

The invention belongs to the technical field of artificial intelligence, relates to deep learning and computer vision content, and particularly relates to a natural scene text detection method.

Background

The text detection is an important step for acquiring important information of human society by a computer and realizing man-machine interaction, and aims to enable the computer to quickly acquire a text area containing effective information in the field of vision like a human. In a natural scene image, the part with the greatest information density is usually a character, and the first step of acquiring information is to find the position of the character. Through the text region containing effective information selected, the process of acquiring the information by the computer is more accurate and efficient, and the redundancy of calculation and storage resources in the later period is reduced, so that the overall performance of image understanding is improved. In general, text regions containing valid information and other background regions of unwanted information in an image, while understanding an image only requires attention to the valid information therein, ignoring unwanted information, which distinguishes foreground and background alien from semantic segmentation in computer vision. Therefore, it is feasible to perform scene text detection by using a computer to simulate a human visual system.

The prior text detection utilizes a traditional machine learning mode to count and analyze pixel distribution in an image, and the mode cannot fully consider global information, but only carries out traversal searching in the image through a fixed algorithm, so that the speed and the accuracy are not ideal. The method based on deep learning effectively solves the problems of speed and accuracy, the method proposed in the early stage mainly uses a neural network to predict the frame information of a text region, and is limited by the expression capability of the network, and the method of directly regressing the text frame can only detect a simple text region. This approach is not effective if the background and text are not easily separated from each other, the text style is curved, etc. While semantic segmentation can solve the above problems well. Firstly, the processing speed of the neural network on the image can meet the requirement of real-time property due to the development of deep learning and the rapid improvement of the current computer computing power. Furthermore, the semantic segmentation mode can accurately separate the foreground and the background of the target, even if the target has a complex outline, the detection can be performed under the conditions of complex scenes and complex texts. By tracing the detected semantic information, the exact outline of the text region can be obtained, which makes the complex text extraction in natural scenes more efficient.

Disclosure of Invention

The invention aims to solve the technical problems that: the defect of the current scene text detection is overcome, and the edge-enhanced natural scene text detection method based on semantic segmentation is provided, so that the purpose of high-precision and high-efficiency detection is achieved.

The technical scheme of the invention is as follows:

a natural scene text detection method based on semantic segmentation comprises the following steps:

(1) Constructing a basic feature extraction network

The feature extraction network adopts a ResNet or MobileNet classical network structure as backbone, 1/4, 1/8, 1/16 and 1/32 features of the input image size are extracted from different layers respectively as output, and the number of channels corresponding to the output features is 64, 128, 256 and 512 channels respectively;

(2) Construction of feature screening Module

The input of the feature screening module is divided into two parts i and h, i represents the output feature of the feature extraction network, h represents the output feature of the feature screening module at the upper stage, the two parts are subjected to convolution fusion and then normalized by using a sigmoid function, the normalized result is used as a weight, the two inputs i and h are subjected to selective fusion, and finally the fused output feature is obtained; the whole operation process is defined as follows:

S＝sigmoid(conv3(conv1(h),conv2(i)))

out＝conv4((1-S)·h+S·i)

Where S represents a normalized feature screening heat map, conv (x) represents a series of self-network structures consisting of convolution, batch normalization, reLu activation functions, out represents the final output feature map, fixed as 64 channels. It should be noted that the above operation process also implies the step of channel transformation;

(3) Construction of feature pyramid networks

The feature pyramid network is a step of fusing the outputs of the feature screening modules. Feature screening modules are used at 3 in the network, but the network structure of the modules is only one, namely 1 module 3 is multiplexed. Firstly, feature expansion is carried out on the 1/32-size feature map output by the feature extraction network by using a pyramid pooling network (ASPP), so that a 1/32-size feature map res4 is obtained. Up-sampling res4 to be 1/16 size, and then taking the 1/16 size feature map output by the feature extraction network and the 1/16 size feature map as h and i inputs of a feature screening module respectively, wherein the feature screening module outputs a 1/16 size feature map res3; repeating the steps to obtain res2 and res1, wherein the sizes are 1/8 and 1/4 respectively. Finally, up-sampling res2, res3 and res4 to the size of res1, and then cascading on channels to obtain a multi-scale fusion feature map with 256 channels;

(4) Constructing edge-enhanced networks

The edge enhancement network consists of 3 layers of neural networks, wherein the first two layers of neural networks consist of convolution, batch normalization and ReLu activation functions, and the last layer of neural networks consist of convolution, bias and sigmoid activation functions. Finally, obtaining an edge strengthening heat map with the channel number of 1, wherein the pixel point value range is [0,1], and the larger the value is, the closer the value is to the edge position;

(5) Constructing semantic segmentation networks

Firstly, a 256-channel feature map output by a feature pyramid network and a 1-channel feature map output by an edge enhancement network are cascaded on channels, then the result is input into a 3-layer convolutional neural network, and the front 2-layer network structure consists of upsampling, convolution, batch normalization and ReLu activation functions, wherein the upsampling operation adopts a bilinear interpolation method to enlarge the size of the feature map to 2 times of the original size. And the final layer of network adopts convolution, offset and sigmoid activation functions to obtain a semantic segmentation heat map of 1 channel, wherein the range of values is between 0 and 1. Converting the heat map into a binary map with only two values of 0 and 1 by setting 0.7 as a threshold value;

(6) Contour forming

Separating different text regions from the binarization map by using OpenCV software, and solving a closed polygon with the smallest perimeter of the region for each region, wherein the vertex coordinates of the polygon are the position coordinates of the text region in the image. For a rectangular text area, its coordinates consist of 4 points. For other irregular text regions, openCV software can determine the number of polygon vertices by itself.

(7) Training method

For ResNet networks as backbone structures, first pre-training is performed on image classification dataset ImageNet, and pre-training network weight parameters are saved. Then the whole network is preheated on the artificial synthetic dataset SynthText to enable the model to achieve convergence on the task scene. And finally, performing final formal training under the specific scene data set. In addition, OHEM algorithm is used in the design of the loss function, difficult mining is carried out, and the area gap between the foreground and the background is balanced.

The invention has the beneficial effects that: the invention fully utilizes the strong distinguishing capability of the semantic segmentation algorithm between the foreground and the background, and performs multi-scale feature extraction through the feature pyramid network, thereby ensuring that both small-size texts and large-size texts in the image can be effectively detected. By introducing the information selection gate structure, the up-sampling and feature fusion part selects effective information to propagate and output, so that redundant information in the network is removed. In addition, the commonality of the semantic segmentation algorithm and the frame shaping algorithm in processing the irregular area ensures the accurate detection capability of the whole scheme on the irregular text area.

Drawings

Fig. 1 illustrates a multi-scale feature extraction network. Wherein the top row represents a feature extraction backbone network, and the different sizes represent progressively smaller extracted feature patterns. The middle row represents a feature filter gate with two inputs and ASPP represents a pyramid pooling network. The next row of differently sized boxes represents the extracted different scale feature maps. Finally, the feature images are aggregated together through an up-sampling step;

FIG. 2 is an internal concrete structure of a feature filter gate, conv (x) representing a number of layers of convolutional networks, x representing a pixel multiplication operation, and +representing a pixel addition operation;

FIG. 3 is a schematic diagram of an edge enhancement network, a semantic segmentation network, and a binarization process, wherein conv (x) represents a number of layers of convolutional networks;

FIG. 4 is a true value diagram of an edge enhancement structure, wherein the innermost line of the three lines represents the boundary after the text outline has been reduced to 0.5, and all pixel values within it are set to 0. The boundary of the outermost layer represents the boundary after enlarging the text outline 1.25 times the original, and all pixel values outside it are set to 0. The value of the middle black line is 1, which represents the original boundary, and the pixel values among the three boundary lines are linearly interpolated;

fig. 5 is an input image example;

FIG. 6 is a semantic segmentation result example;

fig. 7 is a frame example of a text region.

Detailed Description

The following describes the embodiments of the present invention further with reference to the drawings and technical schemes.

(1) Constructing a basic feature extraction network

The feature extraction network employs ResNet network architecture as the backbone, as shown by the upper row conv (x) in fig. 1. Its input is a 3-channel RGB image as shown in fig. 5. Extracting 1/4, 1/8, 1/16 and 1/32 features of the input image size from layers 4, 6, 9 and 13 of ResNet to be output, wherein the number of channels corresponding to the output features is 64, 128, 256 and 512 channels respectively;

(2) Construction of feature screening Module

As shown in fig. 2, the input of the feature screening module is i and h, i represents the output feature of the feature extraction network, h represents the output feature of the previous feature screening module, the two parts are subjected to convolution fusion and then normalized by using a sigmoid function, the normalization result is used as a weight, the i and h inputs are subjected to selective fusion, and finally the fused output feature is obtained; the whole operation process is defined as follows:

S＝sigmoid(conv3(conv1(h),conv2(i)))

out＝conv4((1-S)·h+S·i)

Wherein S is a normalized feature screening heat map, out is a final output feature map, which has 64 channels with the same size as i and h;

(3) Construction of feature pyramid networks

The feature pyramid network is a step of fusing the outputs of the feature screening modules. As shown in fig. 1, feature screening modules are used at 3 in the network, but the network structure of the modules is only one, namely 1 module 3 is multiplexed. Firstly, feature expansion is carried out on the 1/32-size feature map output by the feature extraction network by using a pyramid pooling network (ASPP), so that a 1/32-size feature map res4 is obtained. Up-sampling res4 to be 1/16 size, and then taking the 1/16 size feature map output by the feature extraction network and the 1/16 size feature map as h and i inputs of a feature screening module respectively, wherein the feature screening module outputs a 1/16 size feature map res3; repeating the steps to obtain res2 and res1, wherein the sizes are 1/8 and 1/4 respectively. Finally, up-sampling res2, res3 and res4 to the size of res1, and then cascading on channels to obtain a multi-scale fusion feature map with 256 channels;

(4) Constructing edge-enhanced networks

The edge enhancement network consists of 3 layers of neural networks, wherein the first two layers of neural networks consist of convolution, batch normalization and ReLu activation functions, and the last layer of neural networks consist of convolution, bias and sigmoid activation functions. Finally, obtaining an edge enhancement heat map with the channel number of 1, wherein the pixel point value range is [0,1], and the larger the value is, the closer the value is to the edge position. FIG. 4 illustrates a distribution of pixel values at text edge locations in a heat map;

(5) Constructing semantic segmentation networks

Firstly, a 256-channel feature map output by a feature pyramid network and a 1-channel feature map output by an edge enhancement network are cascaded on channels, then the result is input into a 3-layer convolutional neural network, and the front 2-layer network structure consists of upsampling, convolution, batch normalization and ReLu activation functions, wherein the upsampling operation adopts a bilinear interpolation method to enlarge the size of the feature map to 2 times of the original size. And the final layer of network adopts convolution, offset and sigmoid activation functions to obtain a semantic segmentation heat map of 1 channel, wherein the range of values is between 0 and 1. Converting the heat map into a binary map with only 0 and 1 values by setting 0.7 as a threshold value, as shown in fig. 6, wherein a black area represents the position of a character and a white area is a background area;

(6) Contour forming

Separating different text regions from the binarization map by using OpenCV software, and solving a closed polygon with the smallest perimeter of the region for each region, wherein the vertex coordinates of the polygon are the position coordinates of the text region in the image. In fig. 6, 3 text regions are detected in total by semantic segmentation and binarization, and in fig. 7, the border of each text region is derived from the binarization map using OpenCV. For the 3 rectangular text regions in fig. 7, openCV will output coordinates of 4 vertices, respectively. Finally, the coordinate points are taken as text region coordinates. For other irregular text areas, the OpenCV software can also determine the number of polygon vertices by itself.

(7) Training method

Using ResNet as backbone network, pre-training it on image classification dataset ImageNet and saving pre-training network weight parameters. The entire network is then pre-trained on the artificial synthetic dataset SynthText to allow the model to converge on the task scene. And finally, performing final formal training under the specific scene data set. In addition, OHEM algorithm is used in the design of the loss function, positive and negative sample balancing is performed, and the area gap between the foreground and the background is balanced. The network optimizer adopts Adam, the batch size is set to 8, an exponentially decaying learning rate curve is used, the initial learning rate is set to 0.0001, and the learning rate is reduced to 0.95 after every 1 ten thousand iterations, and 10 ten thousand iterations are performed.

Claims

1. A natural scene text detection method based on semantic segmentation is characterized by comprising the following steps:

(1) Constructing a basic feature extraction network

The feature extraction network adopts ResNet or MobileNet network structure as backbone, 1/4, 1/8, 1/16 and 1/32 features of the input image size are extracted from different layers as output, and the number of channels corresponding to the output features is 64, 128, 256 and 512 channels respectively;

(2) Construction of feature screening Module

S＝sigmoid(conv3(conv1(h),conv2(i)))

out＝conv4((1-S)·h+S·i)

Wherein S represents a normalized feature screening heat map; conv (x) represents a series of self-network structures consisting of convolution, batch normalization, reLu activation functions; out represents the final output feature map, fixed as 64 channels; the step of channel transformation is also implied in the operation process;

(3) Construction of feature pyramid networks

The feature pyramid network is used for fusing the output of the feature screening module; the feature screening module is used at 3 positions in the feature pyramid network, but the network structure of the feature screening module is only one, namely 1 module 3 positions are multiplexed; firstly, performing feature expansion on a 1/32-size feature map output by a feature extraction network by using a pyramid pooling network to obtain a 1/32-size feature map res4; up-sampling res4 to be 1/16 size, and then taking the 1/16 size feature map output by the feature extraction network and the 1/16 size feature map as h and i inputs of a feature screening module respectively, wherein the feature screening module outputs a 1/16 size feature map res3; repeating the steps to obtain res2 and res1, wherein the sizes are 1/8 and 1/4 respectively; finally, up-sampling res2, res3 and res4 to the size of res1, and then cascading on channels to obtain a multi-scale fusion feature map with 256 channels;

(4) Constructing edge-enhanced networks

The edge strengthening network consists of 3 layers of neural networks, wherein the first two layers of neural networks consist of convolution, batch normalization and ReLu activation functions, and the last layer of neural networks consist of convolution, bias and sigmoid activation functions; finally, obtaining an edge strengthening heat map with the channel number of 1, wherein the pixel point value range is [0,1], and the larger the value is, the closer the value is to the edge position;

(5) Constructing semantic segmentation networks

Firstly, cascading 256 channel feature images output by a feature pyramid network and 1 channel feature images output by an edge enhancement network on channels, inputting the results into a 3-layer convolutional neural network, wherein the front 2-layer network structure consists of upsampling, convolution, batch normalization and ReLu activation functions, and the upsampling operation adopts a bilinear interpolation method to enlarge the size of the feature images to 2 times of the original size; the final layer of network adopts convolution, bias and sigmoid activation functions to obtain a semantic segmentation heat map of a1 channel, wherein the range of values is between 0 and 1; converting the heat map into a binary map with only two values of 0 and 1 by setting 0.7 as a threshold value;

(6) Contour forming

Separating different text regions from a binarization graph by using OpenCV software, and solving a closed polygon with the smallest perimeter of the region for each region, wherein the vertex coordinates of the polygon are the position coordinates of the text region in an image; for a rectangular text region, its coordinates consist of 4 points; for other irregular text areas, the OpenCV software automatically determines the number of polygon vertexes;

(7) Training method

Using ResNet as a backbone network, pre-training the backbone network on an image classification dataset ImageNet, and storing pre-training network weight parameters; then the whole network is pre-trained on the artificial synthetic dataset SynthText to enable the model to converge on the task scene; finally, performing final formal training under the specific scene data set; in addition, OHEM algorithm is used in the design of the loss function, positive and negative sample balancing is carried out, and the area difference between the foreground and the background is balanced; the network optimizer adopts Adam, the batch size is set to 8, an exponentially decaying learning rate curve is used, the initial learning rate is set to 0.0001, and the learning rate is reduced to 0.95 after every 1 ten thousand iterations, and 10 ten thousand iterations are performed.