CN113065547A

CN113065547A - Character supervision information-based weak supervision text detection method

Info

Publication number: CN113065547A
Application number: CN202110262361.3A
Authority: CN
Inventors: 刘义江; 陈蕾; 侯栋梁; 池建昆; 范辉; 阎鹏飞; 魏明磊; 李云超; 姜琳琳; 辛锐; 陈曦; 杨青; 沈静文; 吴彦巧; 姜敬; 檀小亚; 师孜晗
Original assignee: Xiongan New Area Power Supply Company State Grid Hebei Electric Power Co; State Grid Hebei Electric Power Co Ltd
Current assignee: Xiongan New Area Power Supply Company State Grid Hebei Electric Power Co; State Grid Hebei Electric Power Co Ltd
Priority date: 2021-03-10
Filing date: 2021-03-10
Publication date: 2021-07-02

Abstract

The invention discloses a weak supervision text detection method based on character supervision information, which relates to the field of text detection, in particular to a weak supervision text detection method based on character supervision information, and comprises the following steps: carrying out feature extraction on the backbone network; up-sampling the extracted features; generating a character-level label; outputting a character region probability graph and a text center line; obtaining a connected region with a high response value, and then expanding the region to obtain a complete character boundary; and traversing the text center line, connecting all points in each text region, and smoothing to obtain a final detection region. The method can be applied to the text detection problem in various scenes, and the position of each character can be accurately positioned by means of the character detection result, so that higher detection precision is obtained. The weak supervision learning mode enables the whole network to iterate continuously, and finally a good convergence effect is achieved.

Description

Character supervision information-based weak supervision text detection method

Technical Field

The invention belongs to the field of text detection, and particularly relates to a weak supervision text detection method based on character supervision information.

Background

Text detection has been attracting much attention of researchers as a key step in OCR technology. The purpose of text detection is to accurately locate the position of characters in the picture and detect specific coordinate information for a subsequent recognition model to recognize. At present, the method has a great deal of application in the fields of automatic driving and picture retrieval. The traditional text detection technology mainly aims at a printed matter, converts an optical file into an image file by utilizing scanning equipment, converts the image file into a character dot matrix format, and further edits and processes a subsequent processing algorithm. However, with the development of the times, the current processing objects gradually evolve into text detection in natural scenes, the environment is more complex, and the fonts are more changeable. For such real scenes, the previous methods have great limitations.

For the text detection problem in natural scenes, the existing detection technology mainly uses a regression or segmentation method and takes words as basic units to directly obtain the region of the whole word. The methods can well process texts with smaller intervals, however, the space between each character in words of many practical application scenes is larger, and when the words are processed on the basis, complete text boundary information is difficult to obtain, so that the overall detection effect is influenced. The patent mainly solves the problem of text detection in complex scenes.

Disclosure of Invention

The invention provides a weak supervision text detection method based on character supervision information, which is used for solving the problems of detection of complex background and variable fonts in a natural scene in the prior art.

The invention adopts the following technical scheme:

the technical scheme of the invention mainly comprises two parts: the first part is a process of taking characters as learning targets and extracting word central line features, and the second part is a process of combining single characters and word central line based post-processing into a complete word. In the first part, ResNet34 added with a cavity convolution layer is adopted for feature extraction, a reverse U-shaped structure is utilized for semantic information enhancement, a feature map of each character area and a feature map of a word center line are obtained, the fact that most data sets are not labeled at a character level is considered, a weak supervision mode is introduced, character information is continuously generated in the training process in an iteration mode, and meanwhile confidence level setting is added to mark the quality of a weak supervision generated result. In the second part, the character feature map is used to restore complete characters, the word central lines are used to connect characters belonging to the same word, and finally the boundary is smoothed to obtain the final text region.

A weak supervision text detection method based on character supervision information comprises the following steps:

s100: carrying out feature extraction on the backbone network;

s200: upsampling the extracted features through an upsampling network;

s300: generating character-level labels for the obtained sampling data by a watershed algorithm in a weak supervision mode;

s400: outputting a character region probability graph and a text center line after the characteristics fused by the up-sampling network pass through four layers of convolutional layers;

s500: after the character probability graph is obtained, a connected region with a high response value is obtained by using opencv, and then the region is expanded by using a Vatti algorithm to obtain a complete character boundary;

s600: traversing a text center line, processing the characters passed by the center line as the same text, respectively taking four points of upper left, upper right, lower right and lower left on the boundary of each character, and finally sequencing and connecting all the points in each text region and smoothing to obtain a final detection region.

Further, the backbone network is a ResNet34 network.

Further, three convolutional layers are embedded as a block and replace the third layer of the ResNet34 network, each convolutional layer replaces the standard convolution with a hole convolution kernel, and the hole rates are set to 1, 2 and 3 respectively.

Further, the ResNet34 network adds a layer to further feature extraction.

Furthermore, the up-sampling network consists of four blocks, and each block performs convolution operation on the extracted features twice and then performs up-sampling; and the output result of each block and the output of the block corresponding to the backbone network are added according to the bit and then input into the next block.

Further, the weak supervision generates the label of the character set by the following process: the corresponding word part is intercepted according to the provided coordinate information, and then the position information of each character is obtained by utilizing a watershed algorithm and is sent into a network as marking information to participate in training.

Further, after the character result is generated, a confidence level is generated, and the confidence level is used for measuring whether the result generated this time is credible, and the calculation formula is as follows:

l (w) represents the number of characters predicted in the word w, and lc (w) represents the number of characters contained in the word w in the real label, and when the predicted number of characters is the same as the original word, the result is considered to be completely credible.

Further, at S100: before the feature extraction of the backbone network, the method also comprises the step of S90: and a picture size step, namely adjusting the pictures to be in a uniform size, and processing the pictures with the sizes not meeting the requirements by using a bilinear interpolation method and/or a data augmentation method.

Further, the data augmentation method includes: and randomly rotating a certain angle to change the brightness of the image and randomly adjusting the saturation of the picture.

Further, before the step of resizing the picture at S90, the method further comprises a step of making a weakly supervised training label, by which a region probability distribution map of each character and a text center line are generated, at S80.

(1) After the picture is input, feature extraction is carried out through a backbone network. In this approach, we have chosen ResNet34 as the backbone network based on runtime and final accuracy considerations. To increase the network's field of reception while retaining as much detailed information as possible, we replaced the convolutional layer of the third layer of the ResNet 34. We rebuild three convolution layers as a new block to replace the block in the original third layer, each convolution layer uses a cavity convolution kernel to replace the standard convolution, and the cavity rate is set to be 1, 2 and 3 respectively. The use of a hole convolution kernel further increases the ability of the network to extract large features. In addition, we add a layer to further feature extraction.

(2) After the feature extraction is completed, an up-sampling module is added. The module can fuse the spatial information of the high-resolution image and the semantic information of the low-resolution image, so that the generalization capability of the whole network is improved. The network consists of four blocks, and each block performs convolution operation on the characteristics twice and then performs up-sampling. And the output result of each block and the output of the block corresponding to the backbone network are added according to the bit and then input into the next block.

(3) And (4) weak supervision learning. Because the labeling cost of the character level is too large, the existing real data set almost mainly takes the labeling of the word level as the main part, so the method adopts a weak supervision mode and iteratively generates the character-connected labels in the training process. The generation process comprises the steps of firstly intercepting a corresponding word part according to the provided coordinate information, inputting the word part into the network, then obtaining the position information of each character by using a watershed algorithm, and then sending the generated result back to the network to be used as a label for training. After the character result is generated, a confidence coefficient is generated, and the confidence coefficient is used for measuring whether the result generated by the watershed algorithm is credible or not.

(4) And outputting a character area probability graph and a text center line after the characteristics fused by the up-sampling network pass through four layers of convolutional layers. And sending the generated character result back to the network to repeat (2) to (5) for direct network convergence.

(5) And after the character probability graph is obtained, a connected region with a high response value is obtained by using opencv, and the region is expanded by using a Vatti algorithm to obtain a complete character boundary. And then traversing the text center line, wherein the characters passed by the center line are treated as the same text. And (3) respectively taking four points of upper left, upper right, lower right and lower left on each character boundary, and finally sequencing and connecting all the points in each text region and smoothing to obtain a final detection region.

The invention has the following positive effects:

s100: carrying out feature extraction on the backbone network;

s200: upsampling the extracted features through an upsampling network;

The method can be applied to the text detection problem in various scenes, and the position of each character can be accurately positioned by means of the character detection result, so that higher detection precision is obtained. The weak supervision learning mode enables the whole network to iterate continuously, and finally a good convergence effect is achieved. The text center line is used as a learning target, so that the difficulty of network training is reduced, the network has a good effect on horizontal texts, and the texts under various inclined and bent conditions can be well detected. In addition, the network has better generalization capability, can be directly used in other scenes after training in one scene, and is also very effective for texts which are difficult to detect under weak illumination.

Drawings

Fig. 1 is a structural diagram of a backbone network ResNet34 according to an embodiment of the present invention;

FIG. 2 is a block diagram of an upsampling module in accordance with an embodiment of the present invention;

FIG. 3 is a diagram illustrating a process of transforming a two-dimensional Gaussian distribution into a quadrilateral frame by perspective transformation according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the detailed description and specific examples, while indicating the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.

As shown in fig. 1-3, a weak supervised text detection method based on character supervision information includes the following steps:

s100: carrying out feature extraction on the backbone network;

s200: upsampling the extracted features through an upsampling network;

Further, the backbone network is a ResNet34 network.

Further, the ResNet34 network adds a layer to further feature extraction.

Furthermore, a confidence level is generated after the character result is predicted, the confidence level is used for measuring whether the result generated this time is credible, and the calculation formula is as follows:

The technical scheme of the invention mainly comprises two parts: the first part is a process of taking characters as learning targets and extracting word central line features, and the second part is a process of combining single characters and word central line based post-processing into a complete word. In the first part, ResNet34 added with a cavity convolution layer is adopted for feature extraction, a reverse U-shaped structure is utilized for semantic information enhancement, a feature map of each character area and a feature map of a word central line are obtained, the fact that most data sets are not labeled at a character level is considered, a weak supervision mode is introduced, character information is continuously generated through a watershed algorithm in the training process, and meanwhile confidence level setting is added to mark the quality of a weak supervision generated result. In the second part, the character feature map is used to restore complete characters, the word central lines are used to connect characters belonging to the same word, and finally the boundary is smoothed to obtain the final text region.

The text detection method comprises the following main steps:

(1) after the picture is input, feature extraction is carried out through a backbone network. In this approach, we have chosen ResNet34 as the backbone network based on runtime and final accuracy considerations. To increase the network's field of reception while retaining as much detailed information as possible, we replaced the convolutional layer of the third layer of the ResNet 34. Three convolutional layers are built again and embedded into a third layer as a block, each convolutional layer uses a hole convolutional kernel to replace standard convolution, and the hole rates are respectively set to be 1, 2 and 3. The use of a hole convolution kernel further increases the ability of the network to extract large features. In addition, we add a layer to further feature extraction. The adjusted network is shown in figure 1.

(2) After the feature extraction is completed, an up-sampling module is added. The module can fuse the spatial information of the high-resolution image and the semantic information of the low-resolution image, so that the generalization capability of the whole network is improved. The network consists of four blocks, and each block performs convolution operation on the characteristics twice and then performs up-sampling. And the output result of each block and the output of the block corresponding to the backbone network are added according to the bit and then input into the next block. The structure of which is shown in fig. 2.

(3) And (4) weak supervision learning. Because the labeling cost of the character level is too large, the existing real data set almost mainly takes the labeling of the word level as the main part, so the method adopts a weak supervision mode and iteratively generates the character-connected labels in the training process. The generation process comprises the steps of firstly intercepting a corresponding word part according to the provided coordinate information, inputting the word part into the network, then obtaining the position information of each character by using a watershed algorithm, and then sending the generated result back to the network to be used as a label for training. After the character result is generated, a confidence coefficient is generated, the confidence coefficient is used for measuring whether the result generated by the watershed algorithm is credible or not, and the calculation formula is as follows:

l (w) represents the number of predicted characters in the word w, and lc (w) represents the number of characters contained in the word w in the real label. When the predicted number of characters is the same as the original word, we consider the result to be completely credible.

The following is a specific embodiment of the present invention, and the present invention provides a text detection method based on weak supervision of character supervision information, which comprises the following specific processes:

manufacturing a training label based on weak supervision:

1. the label comprises a probability distribution graph and a text center line. For each graph, we need to generate a regional probability distribution map for each character, taking into account that there are also differences in the center and edges inside the text. Probability can be expressed by using continuous two-dimensional Gaussian distribution, pixel points positioned in the center of the character have higher position scores, and pixel points positioned at the edge of the character have lower position scores, so that the position information of the pixel points can be fully utilized. However, since the shape of the character is generally not regular, the two-dimensional gaussian distribution needs to be transformed into a quadrilateral frame through perspective transformation, and the process is as shown in fig. 3.

The centerline is generated from a pre-provided real tag. And uniformly sampling 10 points in each of the upper and lower two sides of the real frame, calculating the center point of each point in sequence, and then using a line formed by connecting 12 points in total in the middle points of the sides as the center line of the text.

2. Scene text picture preprocessing

The picture size is fixed to 800 x 800 during training, and pictures with sizes not meeting the requirements are processed by a bilinear interpolation method. The data amplification method used by the method comprises the following steps: and randomly rotating a certain angle to change the brightness of the image and randomly adjusting the saturation of the image.

3. Character-level scene text picture feature extraction based on weak supervision

And (4) inputting tensor data obtained after picture preprocessing into ResNet34 for feature extraction. Where the third layer of the original ResNet34 is replaced by a block of hole convolved components. In addition, the method adds a layer additionally for enhancing feature extraction.

4. Feature semantic information enhancement based on up-sampling module

The ResNet34 network is used to extract spatial features, however semantic information in the training process can assist in recognizing different sized text. Therefore, four upsampling modules are added for feature fusion. After the features are input into the module, the channel dimension increasing is carried out through convolution of 1 × 1, and then feature processing is carried out through convolution of 3 × 3, wherein regularization operation is added to each convolution operation to prevent overfitting. Finally, the size of the characteristic diagram is enlarged through an up-sampling operation, and the enlarged characteristic diagram is added with the output result of ResNet34 and then input into the next block.

5. Character post-processing

And after the model is converged, outputting the prediction structure and the text center line of each final character region through a deconvolution module. And taking out the center of the character according to the Gaussian thermodynamic diagram, and expanding the center by using a Vatti algorithm to obtain a complete character region and obtain a boundary coordinate point. Then, using the centerline information, the characters belonging to the same centerline (text) are recorded in the same set. Based on the character set, four vertexes are taken from each character boundary, all vertexes are finally ordered according to the clockwise direction, and the boundary of the modified text is obtained after smoothing processing and is used as a final detection result.

6. Model training

The optimization objectives in the model training process are as follows:

where sc (p) represents the confidence level, Sr (p) and Sr x (p) represent the predicted probability value and the generated true probability value, respectively. In addition, the optimizer chooses SGD to compute the gradient and does the back propagation. The trained batch size is set to 10, for a total of 800 epochs.

7. Model application

After the training process, a plurality of models can be obtained, the optimal model (the optimal objective function value is the minimum) is selected for application, data enhancement is not needed when the model is applied, and only the image needs to be adjusted to be 800 x 800, and the normalized image can be used as the input of the model. Parameters of the whole network model are fixed, and the detection result of the text content in the image can be obtained after the input image is subjected to feature extraction through the neural network and post-processing.

The above embodiments are merely preferred examples of the present invention and are not exhaustive of the possible implementations of the present invention. Any obvious modifications to the above would be obvious to those of ordinary skill in the art, but would not bring the invention so modified beyond the spirit and scope of the present invention.

Claims

1. A weak supervision text detection method based on character supervision information is characterized by comprising the following steps:

s100: carrying out feature extraction on the backbone network; s200: upsampling the extracted features through an upsampling network;

2. The method of claim 1, wherein the backbone network is a ResNet34 network.

3. The method of claim 2, wherein three convolutional layers are embedded as a block and replace the third layer of the ResNet34 network, each convolutional layer replaces the standard convolution with a hole convolution kernel, and the hole rates are set to 1, 2 and 3 respectively.

4. The method for detecting weakly supervised character supervision information based on character supervision information as claimed in claim 3, wherein a layer is additionally added to the ResNet34 network for further feature extraction.

5. The weak supervision text detection method based on character supervision information as claimed in claim 4, characterized in that the up-sampling network is composed of four blocks, each block performs convolution operation twice on the extracted features, and then performs up-sampling; and the output result of each block and the output of the block corresponding to the backbone network are added according to the bit and then input into the next block.

6. The method for detecting weak supervision text based on character supervision information as claimed in claim 5, wherein the weak supervision generates labels of character sets by: the corresponding word part is intercepted according to the provided coordinate information, and then the position information of each character is obtained by utilizing a watershed algorithm and is sent into a network as marking information to participate in training.

7. The method as claimed in claim 6, wherein a confidence level is generated after the character result is generated, and the confidence level is used to measure whether the result generated this time is reliable, and the calculation formula is:

l (w) denotes the number of characters predicted in respect of the word w, l^c(w) represents the number of characters contained in the word w in the real label, and when the predicted number of characters is the same as the original word, the result is considered to be completely credible.

8. The character supervision information-based weakly supervised text detection method as recited in claim 7, wherein at S100: before the feature extraction of the backbone network, the method also comprises the step of S90: and a picture size step, namely adjusting the pictures to be in a uniform size, and processing the pictures with the sizes not meeting the requirements by using a bilinear interpolation method and/or a data augmentation method.

9. The method for detecting weakly supervised text based on character supervision information as recited in claim 8, wherein the data augmentation mode comprises: and randomly rotating a certain angle to change the brightness of the image and randomly adjusting the saturation of the picture.

10. The method for detecting weakly supervised character information based text as recited in claim 9, further comprising, before the picture resizing step of S90, a weakly supervised training label preparation step of S80, by which a region probability distribution map of each character and a text center line are generated.