CN113065548A

CN113065548A - Feature-based text detection method and device

Info

Publication number: CN113065548A
Application number: CN202110262507.4A
Authority: CN
Inventors: 刘义江; 李云超; 姜琳琳; 吴彦巧; 姜敬; 檀小亚; 师孜晗; 陈蕾; 侯栋梁; 池建昆; 范辉; 阎鹏飞; 魏明磊; 辛锐; 陈曦; 杨青; 沈静文
Original assignee: Xiongan New Area Power Supply Company State Grid Hebei Electric Power Co; State Grid Hebei Electric Power Co Ltd
Current assignee: Xiongan New Area Power Supply Company State Grid Hebei Electric Power Co; State Grid Hebei Electric Power Co Ltd
Priority date: 2021-03-10
Filing date: 2021-03-10
Publication date: 2021-07-02

Abstract

The invention discloses a text detection method and device based on characteristics, and relates to the technical field of text detection in natural scenes; the method comprises S1 semantic segmentation, wherein a first feature map containing global features is obtained from a picture through a first neural network; s2, fusing features to obtain a feature map with the same size, wherein the second feature map is an interested area with required detection information, the third feature map is an interested area with required mask information, fusing the first feature map and the second feature map to obtain a fourth feature map, and fusing the first feature map and the third feature map to obtain a fifth feature map; s3, detecting, performing category prediction and frame refinement on the fourth feature map, and acquiring a horizontal rectangular frame; s4 masking, performing convolution operation on the fifth feature map and acquiring a corresponding mask map; the device comprises a semantic segmentation module, a detection module, a mask module and a feature fusion module which are four program modules; it realizes general text detection in natural scenes through steps S1 to S4 and the like.

Description

Feature-based text detection method and device

Technical Field

The invention relates to the technical field of text detection in natural scenes, in particular to a text detection method and device based on features.

Background

General text detection has very important effect in the fields of computer vision occupation important positions, such as automatic driving, intelligent navigation and the like, although partial commercial application has been provided at present, text recognition under a natural scene is often more difficult, compared with scanning pictures, pictures under the natural scene are often worse in light condition, noisy in background, and meanwhile, phenomena of bending, perspective, blurring and the like of character parts exist, so that the performance of the current mainstream algorithm under a complex environment cannot be guaranteed. Especially, when the situations of occlusion, blurring and the like exist, the current mainstream method does not consider the global information of the text, so that the possibility of missing detection and false detection exists.

For the text detection problem, the existing methods based on deep learning are mainly divided into a character-based detection algorithm and a word-based detection algorithm, wherein the character-based detection algorithm detects characters existing in a picture by using a character detector designed in advance, and then connects the characters into words or text lines according to prior knowledge. In contrast, a word-level based detection algorithm that detects words directly is more efficient and simpler, but such an approach is generally not effective in detecting text having arbitrary shapes.

To solve this problem, some word-based methods further apply instance segmentation for text detection, solving the detection problem of arbitrarily shaped text, while having higher robustness against complex scenes such as curvature, perspective, etc. Existing example-based segmentation methods still have two major limitations.

First, these methods detect text based on only a single region of interest (RoI), without regard to global context, and therefore they tend to produce inaccurate detection results based on limited visual information.

Secondly, the existing method does not model the word semantics of different levels, and the probability of generating false positives is increased.

Problems with the prior art and considerations:

how to solve the technical problem of universal text detection in natural scenes.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a feature-based text detection method and apparatus, which implement general text detection in natural scenes through steps S1 to S4, etc.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a text detection method based on characteristics, based on the first neural network and regional generation network that connect sequentially, the said first neural network includes removing the last full connection layer and then connecting with the characteristic pyramid network with the basic network, including the following steps, S1 semanteme cuts apart, the processor obtains the picture from the memorizer, obtain the first characteristic picture containing global characteristics from the picture through the first neural network; s2 feature fusion, wherein the processor acquires an interested region formed by the first neural network and the region generation network, aligns the interested region and acquires a second feature map and a third feature map with the same size, the second feature map is the interested region with required information for detection, the third feature map is the interested region with required information for mask, the processor fuses the first feature map and the second feature map and acquires a fourth feature map after channel information fusion, and the processor fuses the first feature map and the third feature map and acquires a fifth feature map after channel information fusion; s3, detecting, and performing category prediction and frame refinement on the fourth feature map by the processor to obtain a horizontal rectangular frame; s4 masking, the processor performs convolution operation on the fifth feature map and obtains a corresponding mask map.

The further technical scheme is as follows: in step S1, the processor acquires a global segmentation map from the picture through the first neural network; in step S3, the processor performs category prediction and border refinement on the fourth feature map and obtains two categories of information of the region of interest and border regression information; in step S4, the processor performs a convolution operation on the fifth feature map and acquires a corresponding partial cut.

The further technical scheme is as follows: the method further comprises the step of S5 weak supervision, wherein the processor acquires coordinate information of four vertexes of a picture corresponding to an interested area formed by an area generation network, acquires a global segmentation graph formed by a semantic segmentation module, acquires binary classification information and frame regression information of the interested area formed by a detection module, acquires a local segmentation graph formed by a mask module, and completes training in a weak supervision training mode through a model M, wherein the model M is an initial model obtained by training on a data set labeled with characters and words.

The further technical scheme is as follows: in step S5, the model M is an initial model trained on a data set labeled with characters and words.

A text detection device based on features comprises a first neural network, an area generation network, a semantic segmentation module, a detection module, a mask module and a feature fusion module which are sequentially connected, wherein the first neural network comprises a feature pyramid network which is connected after a final full connection layer is removed by a basic network; the semantic segmentation module is used for acquiring the picture from the memory by the processor and acquiring a first feature map containing global features from the picture through a first neural network; the system comprises a feature fusion module, a processor, a first neural network, a region generation network, a second neural network, a third feature graph and a fourth feature graph, wherein the feature fusion module is used for acquiring a region of interest formed by the first neural network and the region generation network, aligning the region of interest and acquiring a second feature graph and a third feature graph with the same size, the second feature graph is the region of interest with required detection information, the third feature graph is the region of interest with required mask information, the processor fuses the first feature graph and the second feature graph and acquires the fourth feature graph after channel information fusion, and the processor fuses the first feature graph and the third feature graph and acquires the fifth feature graph after channel information fusion; the detection module is used for the processor to carry out category prediction and frame refinement on the fourth feature map and obtain a horizontal rectangular frame; and the mask module is used for the processor to carry out convolution operation on the fifth feature map and obtain a corresponding mask map.

The further technical scheme is as follows: the semantic segmentation module is also used for acquiring a global segmentation map from the picture through a first neural network by the processor; the detection module is also used for performing class prediction and frame refinement on the fourth feature map by the processor and acquiring two kinds of classification information and frame regression information of the region of interest; and the mask module is also used for the processor to carry out convolution operation on the fifth feature map and acquire a corresponding local part cutting map.

The further technical scheme is as follows: the system further comprises a weak supervision module which is a program module and used for acquiring coordinate information of four vertexes of a picture corresponding to an interested area formed by an area generation network, acquiring a global segmentation graph formed by a semantic segmentation module, acquiring binary classification information and frame regression information of the interested area formed by a detection module, acquiring a local segmentation graph formed by a mask module, and completing training in a weak supervision training mode through a model M, wherein the model M is an initial model obtained by training on a data set labeled with characters and words.

The further technical scheme is as follows: in the weak supervision module, the model M is an initial model obtained by training on a data set labeled by characters and words.

A feature-based text detection apparatus comprises a memory, a processor and the program module stored in the memory and executable on the processor, wherein the processor implements the steps of the feature-based text detection method when executing the program module.

A feature-based text detection apparatus is a computer-readable storage medium having stored thereon the program module, which when executed by a processor, implements the steps of the feature-based text detection method described above.

Adopt the produced beneficial effect of above-mentioned technical scheme to lie in:

a text detection method based on characteristics, based on the first neural network and regional generation network that connect sequentially, the said first neural network includes removing the last full connection layer and then connecting with the characteristic pyramid network with the basic network, including the following steps, S1 semanteme cuts apart, the processor obtains the picture from the memorizer, obtain the first characteristic picture containing global characteristics from the picture through the first neural network; s2 feature fusion, wherein the processor acquires an interested region formed by the first neural network and the region generation network, aligns the interested region and acquires a second feature map and a third feature map with the same size, the second feature map is the interested region with required information for detection, the third feature map is the interested region with required information for mask, the processor fuses the first feature map and the second feature map and acquires a fourth feature map after channel information fusion, and the processor fuses the first feature map and the third feature map and acquires a fifth feature map after channel information fusion; s3, detecting, and performing category prediction and frame refinement on the fourth feature map by the processor to obtain a horizontal rectangular frame; s4 masking, the processor performs convolution operation on the fifth feature map and obtains a corresponding mask map. It realizes the general text detection in natural scene through steps S1 to S4 and so on

A text detection device based on features comprises a first neural network, an area generation network, a semantic segmentation module, a detection module, a mask module and a feature fusion module which are sequentially connected, wherein the first neural network comprises a feature pyramid network which is connected after a final full connection layer is removed by a basic network; the semantic segmentation module is used for acquiring the picture from the memory by the processor and acquiring a first feature map containing global features from the picture through a first neural network; the system comprises a feature fusion module, a processor, a first neural network, a region generation network, a second neural network, a third feature graph and a fourth feature graph, wherein the feature fusion module is used for acquiring a region of interest formed by the first neural network and the region generation network, aligning the region of interest and acquiring a second feature graph and a third feature graph with the same size, the second feature graph is the region of interest with required detection information, the third feature graph is the region of interest with required mask information, the processor fuses the first feature graph and the second feature graph and acquires the fourth feature graph after channel information fusion, and the processor fuses the first feature graph and the third feature graph and acquires the fifth feature graph after channel information fusion; the detection module is used for the processor to carry out category prediction and frame refinement on the fourth feature map and obtain a horizontal rectangular frame; and the mask module is used for the processor to carry out convolution operation on the fifth feature map and obtain a corresponding mask map. The universal text detection under the natural scene is realized through a semantic segmentation module, a detection module, a mask module, a feature fusion module and the like.

A feature-based text detection apparatus comprises a memory, a processor and the program module stored in the memory and executable on the processor, wherein the processor implements the steps of the feature-based text detection method when executing the program module. Through the device, the universal text detection under the natural scene is realized.

A feature-based text detection apparatus is a computer-readable storage medium having stored thereon the program module, which when executed by a processor, implements the steps of the feature-based text detection method described above. Through the device, the universal text detection under the natural scene is realized.

See detailed description of the preferred embodiments.

Drawings

FIG. 1 is a flow chart of example 1 of the present invention;

FIG. 2 is a schematic block diagram of embodiment 2 of the present invention;

FIG. 3 is a data flow diagram in the present invention;

fig. 4 is a schematic block diagram of a region generation network in the present invention.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the application, its application, or uses. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application, but the present application may be practiced in other ways than those described herein, and it will be apparent to those of ordinary skill in the art that the present application is not limited to the specific embodiments disclosed below.

Example 1:

as shown in fig. 1, the present invention discloses a feature-based weak supervision text detection method, which generates a network RPN based on a first neural network and a region sequentially connected, wherein the first neural network comprises a feature pyramid network FPN after a final full connection layer is removed by a basic network ResNet50, and comprises the following steps:

s1 semantic segmentation

The processor acquires a picture from the memory, and acquires a first feature map and a global segmentation map from the picture through a first neural network, wherein the first feature map is a feature map containing global features; the processor sends the first feature graph to the feature fusion module and sends the segmentation graph to the weak supervision module.

S2 feature fusion

The processor acquires a first feature map, acquires a region of interest (ROI) sent by a first neural network and a region generation network (RPN), performs region of interest alignment (ROIAlign) operation on the ROI and acquires regions of interest with the same size, the same size of the regions of interest comprises a second feature map and a third feature map, the second feature map is the region of interest with basic information required by detection, the third feature map is the region of interest with basic information required by masking, the processor fuses the first feature map and the second feature map to obtain a fourth feature map and sends the fourth feature map to the detection module, the fourth feature map is the feature map obtained by fusing the channel information of the first feature map and the second feature map, the processor fuses the first feature map and the third feature map to obtain a fifth feature map and sends the fifth feature map to the masking module, and the fifth feature map is the feature map obtained by fusing the channel information of the first feature map and the third feature map.

S3 detection

The processor acquires a fourth feature map, performs category prediction and frame refinement on the fourth feature map, and acquires a horizontal rectangular frame, classification information of an interested area and frame regression information.

S4 mask

And the processor acquires the fifth feature map, performs convolution operation on the fifth feature map and acquires a corresponding mask map and a corresponding local segmentation map.

S5 weak supervision

The processor acquires coordinate information of four vertexes of a picture corresponding to an interested region formed by a region generation network RPN, acquires a global segmentation graph sent by a semantic segmentation module, acquires binary classification information and frame regression information of the interested region sent by a detection module, acquires a local segmentation graph sent by a mask module, and completes training in a weakly supervised training mode through a model M, wherein the model M is an initial model obtained by training on a data set with characters and words.

The region generation network RPN, the basic network ResNet50, and the feature pyramid network FPN are prior art, and are not described herein again.

Example 2:

as shown in fig. 2, the present invention discloses a feature-based weak supervision text detection apparatus, which comprises a first neural network and a region generation network RPN, a semantic segmentation module, a detection module, a mask module, a feature fusion module and a weak supervision module, which are connected in sequence, wherein the first neural network comprises a feature pyramid network FPN after removing a final full connection layer by a basic network ResNet 50.

The semantic segmentation module is used for acquiring the picture from the memory by the processor, acquiring a first feature map and a global segmentation map from the picture by the processor through a first neural network, wherein the first feature map is a feature map containing global features; the processor sends the first feature graph to the feature fusion module and sends the segmentation graph to the weak supervision module.

The feature fusion module is used for the processor to acquire a first feature map, the processor to acquire a region of interest (ROI) sent by a first neural network and a region generation network (RPN), the processor to perform ROI alignment ROIAlign operation on the region of interest (ROI) and acquire regions of interest of the same size, the regions of interest of the same size comprise a second feature map and a third feature map, the second feature map is the region of interest with basic information required by detection, the third feature map is the region of interest with basic information required by masking, the processor fuses the first feature map and the second feature map to acquire a fourth feature map and sends the fourth feature map to the detection module, the fourth feature map is the feature map obtained by fusing channel information of the first feature map and the second feature map, the processor fuses the first feature map and the third feature map to acquire a fifth feature map and sends the fifth feature map to the masking module, and the fifth feature map is a feature map obtained by fusing the channel information of the first feature map and the third feature map.

And the detection module is used for acquiring the fourth feature map by the processor, performing category prediction and frame refinement on the fourth feature map, and acquiring the horizontal rectangular frame, the classification information of the region of interest and the frame regression information.

And the mask module is used for acquiring the fifth feature map, performing convolution operation on the fifth feature map and acquiring a corresponding mask map and a local segmentation map by the processor.

The weak supervision module is used for acquiring coordinate information of four vertexes of a picture corresponding to an interested region formed by the region generation network RPN, acquiring a global segmentation graph sent by the semantic segmentation module, acquiring binary classification information and frame regression information of the interested region sent by the detection module, acquiring a local segmentation graph sent by the mask module, and completing training in a weak supervision training mode through a model M, wherein the model M is an initial model obtained by training on a data set with characters and words.

Example 3:

the invention discloses a feature-based text detection device, which comprises a memory, a processor and a computer program in embodiment 2, wherein the computer program is stored in the memory and can run on the processor, and the steps of embodiment 1 are realized when the processor executes the computer program.

Example 4:

a computer-readable storage medium storing the computer program of embodiment 2, which when executed by a processor implements the steps of embodiment 1, is disclosed.

Technical contribution of the present application:

in order to solve the problems, the invention provides a weak supervision text detector based on multi-level features. On the other hand, the weak supervision method provided by the invention can obviously reduce the labeling cost, effectively utilizes the existing mass weak labeling data set to train the network, and the diversity of the data enables the network to learn richer characteristics, thereby improving the performance of the model under difficult samples and ensuring the robustness of the model.

The technical scheme of the invention mainly comprises 5 modules: the system comprises a semantic segmentation module, a detection module, a mask module, a multi-path feature fusion module and a weak supervision module.

Basic network: ResNet 50. The network structure is widely applied to the fields of target classification and the like and is used as a part of a computer vision task backbone classical neural network. The network contains a total of 4 residual blocks and finally performs a full join operation to facilitate the classification task. The invention abandons the last full connection layer and initializes the network by using the ResNet parameter which is trained on the ImageNet data set in advance, thereby avoiding the network from being incapable of converging and accelerating the network training speed.

The Feature Pyramid Network (FPN) provides a method for effectively generating multi-dimensional feature expression of the same scale picture under a single picture view by utilizing a feature expression structure of different dimensions of the same scale picture from bottom to top in a conventional CNN model. It can efficiently enable conventional CNN models, and thus can generate more expressive feature maps for use by next-stage computer vision tasks. Essentially it is a method to enhance the expression of CNN features in the backbone network. And the FPN feeds the features into an RPN network for the next operation after the features are fused in a spatial scale.

As shown in fig. 4, in classical object detection, such as RCNN and FastRCNN, the method used to extract candidate frames is usually selectivesearch, which needs to facilitate the whole feature map and is time-consuming and labor-consuming, so the invention uses the region generation network to generate candidate frames more quickly and extract regions of interest (ROI) in advance to speed up the detection. The RPN network receives the feature map extracted by the CNN, firstly fixes the dimension of the feature map in 256 dimensions through a convolution layer of 3x3, and then divides the feature map into two branches, wherein the first branch is used for predicting the probability whether characters exist at the position of an original map corresponding to each pixel point of the feature map, and the second branch is used for refining each preset anchor.

k represents the number of prefabricated anchors, the first branch dimension is 2k, and 2 represents the probability of characters and non-characters; the second branch dimension is 4k, 4 represents the refinement for each prefabricated anchor to further adjust the preset anchor size, position.

As shown in fig. 3, a semantic segmentation module. The semantic segmentation module is used to extract the whole part of the text, which contains global information, specifically, in the semantic segmentation module, a feature map is further encoded through a convolution operation, where the feature map refers to a feature map after FPN fusion, and the encoding means: the further feature extraction, so called encoding, is performed by a convolution + pooling module because this step gradually reduces the size, width and height of the feature map and increases the depth of the feature map, which is called encoding, and this encoding process gradually increases the receptive field of CNN. The decoding process is the opposite operation, in the process of the encoding operation, the dimension of the feature map will gradually increase and the size will gradually decrease, both the size at this position and the size at the next position refer to the width and the height of the feature map, the receptive field of the CNN will gradually expand, thus reaching the purpose of summarizing the global features, and the conceptual explanation is reached, because the encoding process will gradually decrease the size and the width height of the feature map, therefore, the feature map that goes beyond the next position can contain some global features, the idea of the codec is common in the segmentation field, for example, the idea of the codec is adopted in the full convolution network such as FCN. The feature is used as a detection module and a mask module to provide a global feature, which can bring about the advantage that some mispredictions can be reduced, because some features like textures similar to characters can be included locally. In the decoding process, the steps are just opposite to the encoding process, the size of the feature map is gradually enlarged by decoding, and the dimension of the feature map is further reduced.

And a detection module. The detection module is used to generate horizontal rectangular boxes to cover words and characters. Aiming at each ROI area, feature fusion is firstly carried out on feature maps at the same position of a global feature map through a feature fusion module, a semantic segmentation module is used for providing the global segmentation map and coding to obtain the feature map containing the global features, an RPN network is used for extracting the ROI to reduce calculation, ROI with different widths and heights can be generated, and due to the convolution characteristic, interpolation operation needs to be carried out on the feature map of input detection and mask to be fixed in size, namely ROIAlign. The same position means that for each ROI area, the area corresponding to the global feature map is firstly located, and then the same ROI operation is performed on the area to transform to the same size. The specific operation of fusion is seen in a fusion module, and then the class prediction and the frame refinement are carried out on the fused features again through 3-layer convolution.

And masking Mask module. The mask Maks module also receives the feature map sub-region of each ROI region, extracts the ROI region through RPN extraction, and performs ROIAlign operation on the region extracted by the FPN fused module to transform the region to a specified size (14 × 14), similar to the detection branch, firstly positions the region corresponding to the ROI region on the global segmentation feature map, then fuses the two through the feature fusion module, and outputs the mask map of the corresponding character or word interior through convolution operation after fusion.

And a feature fusion module. The feature fusion module fuses information from the global feature map and information of word levels, or the word levels are replaced by character levels, a plurality of regions of interest (ROI) are obtained after the RPN, for convenience of subsequent convolution operation, ROIAlign operation is carried out on the ROI region firstly, the same ROIAlign operation is carried out on the corresponding region positioned in the feature map of the semantic segmentation module, then, the region is subjected to a 3x3 convolution layer and a 1x1 convolution layer, and finally, the fused features are used for classification and coordinate regression. At the mask branch, for each word-level instance, the corresponding characters, words, and global-level features may be fused in a multi-path fusion architecture.

And a weak supervision module. The network related to the invention needs to detect words and character texts, so a data set labeled at the character level is needed, but most of the data sets are labeled only at the word level at present, and the labeling of texts at the character level is time-consuming. The invention provides a weak supervised learning mechanism de-training model. Firstly, a complete data containing word level and character level is described, a weak supervision method is described, the weak supervision method can utilize a large amount of unlabelled data or only weakly labeled data to train a network, and supervision signals required by the network come from an rpn module, a semantic segmentation module, a detection module and a mask module. All four modules require word-level supervision, wherein a character-level supervision signal is required in the detection module. The data here refers to training data that contains two types of labels. Described herein is a weakly supervised training approach. The model M is obtained by pre-training the model, wherein the model M is an initial model M obtained by training a data set with characters and words, and the model M is further trained in a weak supervision training mode. Then, for a new data set A only containing word level, a character training sample is generated through a pre-training model M.

Specifically, the model M is used for predicting the data set A, a group of character candidate samples can be obtained for each picture in the data set A, and the scores of word level prediction results at corresponding positions can be judged according to the word level results, so that more reliable character level predictions are screened out to serve as pseudo labels. Specifically, the corresponding regional word-level result is predicted correctly and the global segmentation graph has a corresponding score exceeding 0.8, which is considered to be a credible pseudo label.

The technical scheme is as follows:

the invention provides a train ticket content detection method based on a text center core correction boundary of a text segment with equal width, which comprises the following specific processes:

the short side of the photograph is first resized to 800 pixels. In order to ensure the recognition effect, the system can automatically adjust the brightness of the photos, each brightness photo can be predicted by the same method, and finally, the final prediction result is selected by voting according to all the prediction results.

2. The photo is firstly subjected to feature extraction through a ResNet50 network, and the last feature graph of the ResNet network four residual block, namely feature graphs in ResNet different blocks, is taken as the input of the next stage. In the feature pyramid module, each featuremap (C, D, E, F) is feature fused by performing element-by-element addition with the last featuremap in an upsampling manner. And (4) upsampling all fused features to a uniform size to perform concat operation, and performing channel feature fusion through convolution of two layers of 3x3 to obtain a feature map G. In the RPN, a feature map G after the fusion in the previous stage refers to a feature map obtained after feature fusion of the FPN, the feature map is fixed to 256 dimensions through a layer of convolution, the feature map is divided into two branches, the first branch is used for predicting the probability whether characters exist in the position of an original image corresponding to each pixel point of the feature map G through the convolution of a layer of 1x1, and the second branch is used for performing position refinement on each preset anchor through the convolution of 1x 1. After passing through the RPN network, the feature map G is pre-screened, and a series of ROI are obtained. And for each ROI, in the detection branch, firstly fusing feature maps at corresponding positions in the global semantic segmentation map through a feature fusion module, and then refining and classifying the ROI. In the mask branch, the same feature fusion module is firstly carried out, and then semantic segmentation is carried out in each ROI area to obtain a segmentation map of each ROI area.

3. Model training

The loss function of the model is:

L＝L_rpn+L_seg+L_det+L_mask

wherein Lrpn, Lseg, Ldet and Lmask respectively represent loss functions of rpn module, seg module, det module and mask module.

The data enhancement method used in the training process comprises the following steps: random cutting, random brightness adjustment, saturation or hue adjustment, and random size change, namely bilinear interpolation.

The optimizer chooses ADADELTA to calculate the gradient and does back propagation. The trained batch size was set to 8, the training pictures were 640 x 640 in size, the training used ResNet50 pre-trained on ImageNet in advance, all newly added layers used Gaussian random numbers, the mean was 0, the variance was 0.001, and a total of 1200 epochs were trained. To ensure the convergence speed of the model, we adopt a training strategy of warmup.

Firstly, pre-training is carried out on a synthetic data set to accelerate network convergence, then a network learning rate (1e-3- >1e-5) is reduced, a model is trained on the data set with character marks, and after convergence, training is continued on the data set without the character marks but with the word marks by using a weak supervision method, so that the precision and the robustness of the model are improved.

4. Model application

After 1200 epochs of training, a plurality of models can be obtained, and the optimal model is selected, namely the objective function value is minimum, so that the method is used for practical application. To achieve better results, the invention requires pictures that are as clear as possible, have as good lighting conditions as possible, and are taken horizontally. Larger size pictures will result in slower system running speed but with limited improvement in accuracy, the invention will scale the size of the short edge of the picture to 800 pixels to balance accuracy and speed.

Second characteristic diagram: the second feature map is subjected to a triple-layer convolution operation on the basis of the region of interest ROI, and the convolution kernels are all 3 × 3 in size, so as to extract basic information required by detection and fuse the basic information with the global features provided by the first feature map.

A third characteristic diagram: the third feature map is subjected to a triple-layer convolution operation on the basis of the region of interest ROI, and the convolution kernels are all 3 × 3 in size, so that basic information required for extracting the mask is fused with the global features provided by the feature map one.

Fourth characteristic diagram: the fourth feature map is a convolution operation with 1x1 for further channel information fusion after the feature fusion module.

A fifth characteristic diagram: the function is the same as the fourth characteristic diagram.

The weak supervision module receives coordinate information of four points of the original image corresponding to the region of interest from the RPN, the detection module detects two kinds of classification information of the region of interest, the classification information comprises characters or a background and frame regression information, the frame regression is fine adjustment of coordinates, and the weak supervision module receives local segmentation map information from the mask module.

After the application runs secretly for a period of time, the feedback of field technicians has the advantages that:

the technical scheme of the application comprehensively considers the attributes of all layers of the text and is mainly used for solving the problem of universal text detection in a natural scene.

The invention provides a method for utilizing the multilevel characteristics of a text: characters in the natural scene picture are detected by three characteristics of character level, word level and global level, and detection precision and recall rate are effectively improved.

Meanwhile, the patent provides a weak supervision method for utilizing a large amount of text data sets without fine marks.

Claims

1. A text detection method based on features is characterized in that: the method comprises the following steps that S1 semantic segmentation is carried out, a processor acquires a picture from a memory, and a first feature map containing global features is acquired from the picture through the first neural network; s2 feature fusion, wherein the processor acquires an interested region formed by the first neural network and the region generation network, aligns the interested region and acquires a second feature map and a third feature map with the same size, the second feature map is the interested region with required information for detection, the third feature map is the interested region with required information for mask, the processor fuses the first feature map and the second feature map and acquires a fourth feature map after channel information fusion, and the processor fuses the first feature map and the third feature map and acquires a fifth feature map after channel information fusion; s3, detecting, and performing category prediction and frame refinement on the fourth feature map by the processor to obtain a horizontal rectangular frame; s4 masking, the processor performs convolution operation on the fifth feature map and obtains a corresponding mask map.

2. The feature-based text detection method of claim 1, wherein: in step S1, the processor acquires a global segmentation map from the picture through the first neural network; in step S3, the processor performs category prediction and border refinement on the fourth feature map and obtains two categories of information of the region of interest and border regression information; in step S4, the processor performs a convolution operation on the fifth feature map and acquires a corresponding partial cut.

3. The feature-based text detection method of claim 2, wherein: the method further comprises the step of S5 weak supervision, wherein the processor acquires coordinate information of four vertexes of a picture corresponding to an interested area formed by an area generation network, acquires a global segmentation graph formed by a semantic segmentation module, acquires binary classification information and frame regression information of the interested area formed by a detection module, acquires a local segmentation graph formed by a mask module, and completes training in a weak supervision training mode through a model M, wherein the model M is an initial model obtained by training on a data set labeled with characters and words.

4. The feature-based text detection method of claim 3, wherein: in step S5, the model M is an initial model trained on a data set labeled with characters and words.

5. A feature-based text detection apparatus, characterized by: the system comprises a first neural network, an area generation network, a semantic segmentation module, a detection module, a mask module and a feature fusion module which are sequentially connected, wherein the first neural network comprises a base network, a full connection layer of which is removed, and a feature pyramid network; the semantic segmentation module is used for acquiring the picture from the memory by the processor and acquiring a first feature map containing global features from the picture through a first neural network; the system comprises a feature fusion module, a processor, a first neural network, a region generation network, a second neural network, a third feature graph and a fourth feature graph, wherein the feature fusion module is used for acquiring a region of interest formed by the first neural network and the region generation network, aligning the region of interest and acquiring a second feature graph and a third feature graph with the same size, the second feature graph is the region of interest with required detection information, the third feature graph is the region of interest with required mask information, the processor fuses the first feature graph and the second feature graph and acquires the fourth feature graph after channel information fusion, and the processor fuses the first feature graph and the third feature graph and acquires the fifth feature graph after channel information fusion; the detection module is used for the processor to carry out category prediction and frame refinement on the fourth feature map and obtain a horizontal rectangular frame; and the mask module is used for the processor to carry out convolution operation on the fifth feature map and obtain a corresponding mask map.

6. The feature-based text detection device of claim 5, wherein: the semantic segmentation module is also used for acquiring a global segmentation map from the picture through a first neural network by the processor; the detection module is also used for performing class prediction and frame refinement on the fourth feature map by the processor and acquiring two kinds of classification information and frame regression information of the region of interest; and the mask module is also used for carrying out convolution operation on the fifth feature map by the processor and acquiring a corresponding local part cutting map.

7. The feature-based text detection device of claim 6, wherein: the system further comprises a weak supervision module which is a program module and used for acquiring coordinate information of four vertexes of an interested area formed by an area generation network in a corresponding picture, acquiring a global segmentation graph formed by a semantic segmentation module, acquiring binary classification information and frame regression information of the interested area formed by a detection module, acquiring a local segmentation graph formed by a mask module, and completing training in a weak supervision training mode through a model M, wherein the model M is an initial model obtained by training on a data set labeled with characters and words.

8. The feature-based text detection device of claim 7, wherein: in the weak supervision module, the model M is an initial model obtained by training on a data set labeled by characters and words.

9. A feature-based text detection apparatus, characterized by: comprising a memory, a processor and program modules stored in the memory and executable on the processor, the processor implementing the steps of the feature-based text detection method of any one of claims 1 to 4 when executing the program modules.

10. A feature-based text detection apparatus, characterized by: being a computer readable storage medium, storing program modules of claims 5 to 8, which when executed by a processor implement the steps of the feature based text detection method of any of claims 1 to 4.