CN112150493B

CN112150493B - Semantic guidance-based screen area detection method in natural scene

Info

Publication number: CN112150493B
Application number: CN202011004389.9A
Authority: CN
Inventors: 黄胜; 冉浩杉; 张盛峰; 李洋洋; 付川
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2020-09-22
Filing date: 2020-09-22
Publication date: 2022-10-04
Anticipated expiration: 2040-09-22
Also published as: CN112150493A

Abstract

The invention researches and provides a natural scene screen area detection method based on semantic guidance aiming at the problems of screen position positioning in a natural scene, rough screen edge generated by an edge detection technology based on a full Convolutional Network (full probabilistic logical Network) and the like. An edge detection network based on semantic guidance is provided for screen edge detection, the network is divided into two parts, one part is composed of a reverse convolution module and used for completing an image segmentation task, and the other part is used for performing an image edge detection task after feature maps with different scales are fused. And simultaneously carrying out image segmentation and training of image edge detection tasks on the algorithm model, and finally fusing the output of the two tasks to obtain a final edge image. In the screen area positioning stage, edge image straight line detection is carried out through Hough Transform (Hough Transform), coincident straight lines are removed, screen corner points meeting conditions are taken out, area angles are corrected through Affine Transform (affinity Transform), and finally screen content images are obtained.

Description

Semantic guidance-based screen area detection method in natural scene

Technical Field

The invention relates to the field of deep learning and computer vision, in particular to an edge detection network and a screen positioning method based on semantic guidance.

Background

With the increasing of science and technology, the computing power of portable devices such as mobile phones is continuously enhanced, mobile devices with cameras are also increasingly popularized, and photographing and shooting can be conveniently carried out by using the devices. People often need to record important information played in a screen by using portable equipment such as a mobile phone, but when the screen is shot, backgrounds outside the screen are inevitably shot at the same time, and the backgrounds greatly interfere with subsequent screen content processing.

On the other hand, in a natural scene, when the screen content is shot by using a portable device such as a mobile phone, the screen content is inevitably interfered by many factors in the natural scene, and the interference affects the accuracy of the subsequent screen edge detection processing result, so that a screen positioning technology suitable for the natural condition is needed to help accurately position the screen position, so as to achieve the purpose of reducing the interference of external noise brought under the natural condition on the screen content analysis. In a natural scene, the research on the screen positioning technology is still less, and further exploration and research are urgently needed on the aspect.

In the field of computer vision, a traditional edge detection method is generally used for detecting a screen, the traditional method is used for carrying out edge detection processing on the whole image, and finally, a target screen edge is found out from a plurality of image edges in a mode of matching through artificial features. However, the conventional edge detection method has the unavoidable disadvantage that on one hand, the conventional edge detection method can introduce many interference edge pixels of natural scenes to detect all edges in the whole picture, thereby improving the difficulty of subsequently searching for target edges through artificial features. On the other hand, most of the conventional edge detection methods need to manually set a threshold value to adjust the sensitivity of detecting an edge, so that too high detection results in too many interference factors being detected, and manual feature matching cannot be performed, and too low detection results in the required screen edge being not detected.

In another chinese patent application publication No. CN102236784A, it is disclosed that the detection of the edges of the screen is performed by the method of hough transform scanning suspected edges in the image and multi-line fitting. Another US patent application publication US20080266253A discloses a system for tracking a spot in a computer projection area. The system captures an image through binarization and screens quadrangles from the binarized pixels to obtain a screen area. However, the algorithms for detecting the screen edge by using the traditional method cannot meet the requirements of different scenes, and the anti-interference capability is weak.

The edge detection algorithm based on deep learning is widely researched in the last years, with the development of artificial intelligence and the proposal of some edge detection algorithms based on a deep Convolutional Neural network (Convolutional Neural Networks) network, such as a classical edge detector (HED), a Convolutional Neural network (RCF) and the like, the detection method based on deep learning has achieved a good effect, and with the improvement of the architecture performance of the deep Convolutional Neural network, the detection performance of the detection method based on deep learning is better and better.

Meanwhile, in consideration of the problem that the edge of an image output by an edge detection network based on deep learning is rough and fuzzy, the invention designs an edge detection network based on semantic guidance, and combines abundant semantic information in an image segmentation task into edge detection by combining the image segmentation task and the image edge detection task, so as to obtain a more refined screen edge image.

Disclosure of Invention

The invention aims to design a method for obtaining a screen area in a natural scene by an edge detection network based on semantic guidance and a screen area positioning algorithm. And a screen area detection system is realized on the basis of the method, an edge detection network combined with semantic guidance is carried out on a GPU module of a server side, a screen edge corner screening algorithm used in a subsequent screen area positioning stage is carried out on a CPU module of a front end or a client side, and the front end calculation amount is reduced through front-end and back-end separation operation, so that the screen detection efficiency of the screen area detection system is improved.

The invention provides a method for detecting a screen area in a natural scene based on semantic guidance, which comprises the following steps: the image preprocessing module is used for preprocessing images shot in natural scenes, and comprises image denoising, contrast enhancement and the like; and (3) an edge detection network based on semantic guidance, namely fusing abundant semantic information of a predicted image in the image segmentation task, fusing a final output prediction image of the image edge detection end task and an output prediction image in the image segmentation task, and performing deep supervision by using an edge detection task label to obtain a refined edge detection image.

The invention mainly comprises two parts: a semantic-guided edge detection network and a screen edge corner screening algorithm. The method specifically comprises the following steps:

1. acquiring a scene screen image shot by a mobile phone image of a user, and preprocessing the natural scene image;

2. constructing an edge detection network based on semantic guidance;

3. pre-training the network by utilizing open source data and simulation data in related fields;

4. fine tuning a pre-established neural network by using a small amount of self-made screen data sets marked with natural scenes in a transfer learning mode;

5. and performing screen edge detection on the prepared screen edge data in the test set on the network after the transfer learning is completed, and obtaining a final screen edge image.

6. And performing post-processing operation by using the refined screen edge image obtained by the edge detection neural network, wherein the post-processing operation comprises removing repeated straight lines and non-edge straight lines, and screening out four most possible screen corner points by combining with screen edge characteristics.

7. After the screen edge feature screening algorithm obtains the screen corner point with the highest confidence coefficient, affine change is used for adjusting the image inclination angle, and the transformation process of affine transformation is represented as follows:

wherein

And

expressed as vectors and translation quantities of all pixel points of the image, A is the size of the rotation, magnification and scaling of the image expressed by the affine matrix,this expression is equivalent to the following equation in homogeneous coordinates:

method for carrying out affine transformation on pixel point vectors of areas in original image screen

Is mapped to the angle of the right screen, and the pixel point vector becomes

And finishing angle correction transformation.

The edge detection network based on semantic guidance in the steps is the main content of the invention, and provides a double-channel neural network structure based on a full convolution neural network, wherein the network can perform task learning of image segmentation and image edge detection through the double-channel neural network structure, and comprises a feature extraction module, an image segmentation module, an image edge detection module and a semantic guidance fusion module.

The feature extraction module is composed of a full Convolution network formed by removing a full connection layer of the VGG16, in order to increase the receptive field of the network under the condition of not losing a large amount of local information, a Hybrid scaled Convolution (Hybrid scaled Convolution) method is added into the last two layers of Convolution layers, a group of three Convolution kernels with different expansion rates (scaling rates) are arranged in the Convolution layers for Convolution in sequence, and the holes generated by the expansion Convolution can be reduced and the receptive field can be increased.

In the image segmentation module, a deconvolution channel is constructed with the left end of the network, up-sampling operation is performed through four deconvolution layers, the final high-level semantic feature graph of the backbone network is deconvolved to the size same as that of the original image, then deep level supervision is performed by using image segmentation labels, the network is enabled to perform image segmentation task training, and segmented images with the size of the original image are finally output.

The image edge detection Module performs image Feature Fusion through a multi-scale Feature Fusion Module (Feature Fusion Module) with attention mechanism, and the Module uses a SE Resnext Module obtained by combining SE Block and Resnext Block. After the feature graph output of each layer of Block Block in the backbone network enters a multi-scale feature fusion module, the feature graph output of different scales passes through an SE ResneXt module, resnetXt operation with a residual error group convolution structure is carried out to enrich input feature graph semantic information, then the feature graph semantic information is sent into the SE module, a learnable weight of each channel is given, so that the model actively learns the importance degree of each channel of the feature graph, and can promote useful features and inhibit the features which are not used for the current task according to the importance degree. And finally, performing deep supervision by using the image edge label, so that the network performs image segmentation task training, and finally outputting an edge image with the size of the original image.

The semantic guidance fusion module fuses the image features extracted by the edge detection module and the image segmentation module, and outputs more precise image edge features by the semantic feature guidance model extracted by the image segmentation module. And performing dimension splicing and dimension reduction on the output results of the tasks at the two ends to fuse abundant semantic information in image segmentation with the image edge detection task, thereby obtaining a fine image edge detection result.

Further, in order to train the network better, a method of weighting cross entropy loss function is adopted, so that labels can be fully supervised to characteristic diagrams of all layers. We express the loss function for each layer as:

in the formula:

wherein Pr (x) _j (ii) a W) is the characteristic image pixel x in the mth layer _j The activation value in the prediction graph is the activation function a _j ＝sigmoid(x _j )。|Y ⁺ I and Y ^- And | respectively refer to a pixel set which is the edge of the screen area and a pixel set which is not the edge of the screen in the group Truth, and W represents all parameters needing training in the network.

The formula of the weight of each layer when the jth pixel point value of each layer passes through the multi-scale feature fusion module on the left side is expressed as follows, wherein w is set ₁ ＝w ₂ ＝w ₃ ＝w ₄ =0.2, and w ₅ ＝0.28。

Combining the losses of the above layers, the loss function of the fused layer is expressed as follows:

wherein | Y ^(fusion) I is expressed as sigmoid (A) ^(fusion) ),A ^(fusion) ＝{a _j ^(side) |j＝1,2,…,|Y||}，A ^(fusion) As a set of output values for each layer.

Finally, fusing the image segmentation task and the image edge, adding the final loss functions, and calculating, wherein the final loss functions are expressed as follows:

L _fusion ＝L ^{(edge_fusion)} +L ^(seg_fusion)

the two loss functions are added to serve as a final loss function, so that the network can better fuse rich semantic information in an image segmentation task, and the model can be converged faster in the training process.

Furthermore, the screen corner screening algorithm part mainly utilizes the proposed edge detection network to obtain a refined screen edge image for screen corner screening. Firstly, straight line detection is carried out through Hough transform, repeated straight lines are removed through a straight line duplicate removal method, all straight line intersection points are placed into a set, and the area is formed by the enclosed area of every four intersection pointsAnd sequencing the perimeters, and selecting the edge straight line with the largest area and the longest perimeter as the corner point of the screen image. The linear de-weight method comprises the following steps: setting a distance threshold T _d And an angle threshold T _θ If the distance between any two straight lines is less than the distance threshold T _d And the angle difference of the two straight lines is less than the angle threshold value T _θ One of the lines of smaller length is deleted. And finally, carrying out affine transformation on the acquired corner points of the edge of the screen to correct the angle of the screen area, so as to obtain a screen content image.

Due to the adoption of the technical scheme, the invention has the following advantages:

1. the invention designs an edge detection network guided by semantic information by guiding a model to predict image edges by utilizing the semantic information obtained in an image segmentation task. The network makes full use of abundant semantic features in the image segmentation task, deconvolves important image features extracted from a backbone network to the size of an original image by using a series of deconvolution, and performs deep supervision by using image segmentation task labels to finally obtain segmented images. And performing fusion operation on the edge image output obtained by multi-scale fusion at the right end and the segmentation image, and adding an edge image label for deep supervision, so that advanced semantic features are fully utilized, and a more refined edge image is obtained.

2. The invention provides a multi-scale Feature map Fusion Module (Feature Fusion Module) with attention mechanism, which is used for fusing different scale Feature maps output in a backbone network and fusing multi-scale Feature information into an edge image. According to the invention, by adding the SE ResneXt module into the multi-scale feature map fusion module, the feature map is firstly sent into ResnetXt with a residual error group convolution structure to enrich input feature map semantic information, and then sent into the SE module, and a learnable weight is given to each channel, so that the model can actively learn the importance degree of each channel of the feature map.

Drawings

In order to make the purpose, technical scheme and beneficial effect of the invention more clear, the invention provides the following drawings for description:

FIG. 1 is a schematic flow chart of a method for detecting a screen area in a natural scene based on semantic guidance according to the present invention;

FIG. 2 is a schematic flow diagram of an image edge detection network module incorporating semantic information guidance according to the present invention;

FIG. 3 is a schematic diagram of the network architecture for image edge detection with semantic information guidance;

FIG. 4 is a multi-scale feature fusion module of the present invention with attention mechanism;

FIG. 5 is a schematic diagram of a post-processing flow of the screen region detection method of the present invention.

Detailed description of the preferred embodiments

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

The invention provides a semantic guidance-based screen region detection algorithm in a natural scene, which specifically comprises the following steps as shown in figure 1:

step 1, inputting an image, and performing two simple preprocessing operations of denoising and contrast enhancement on the image;

and 2, constructing an edge detection neural network fusing image segmentation semantic information, and inputting the image into the network to detect the screen edge in a natural scene.

Step 3, selecting four screen corners in the screen edge image by using a screen edge corner screening algorithm, and recording the positions of the corners;

step 4, sending the four angular points into affine transformation for screen tilt correction to obtain screen content with a correct angle;

and 5, intercepting the screen area image after affine transformation to obtain a final screen content image.

The specific implementation manner provides a specific implementation step of a method for detecting a screen area in a natural scene based on semantic guidance. The screen area detection module in the natural scene comprises: the device comprises an image preprocessing module, an edge detection module, a screen area positioning module, an affine transformation module and a content acquisition module.

Step 1: the method comprises the steps of obtaining a scene screen image shot by a mobile phone, inputting the scene screen image into a preprocessing module, preprocessing the image by using operations such as denoising and contrast enhancement, and enhancing the edge characteristics of the input image.

Step 2: inputting the preprocessed image into an edge detection module, and constructing an edge detection network based on semantic guidance as shown in fig. 2 in the edge detection module, wherein the edge detection network is respectively a feature extraction module, an image segmentation module, an edge detection module and a semantic fusion module. The feature extraction module is a backbone network of the edge detection network, and a VGG16 network with a full connection layer removed is used; the image segmentation module carries out image segmentation tasks by using the semantic features extracted by the feature extraction module and is supervised by image segmentation labels; the edge detection module carries out an edge detection task by using the detail features of each layer extracted by the feature extraction module, and is supervised by an image edge label; and the semantic fusion module performs semantic guidance fusion by using the semantic features extracted by the image segmentation module and the edge features extracted by the edge detection module to obtain a final edge image.

And 3, step 3: the edge detection network is constructed by using a Tensorflow framework, as shown in FIG. 3, an image segmentation channel in the network deconvolves important image features extracted from a backbone network to the size of an original image by using a series of deconvolution, and deep supervision is performed by using an image segmentation task label, so that a segmented image is finally obtained. And a multi-scale characteristic image fusion module used by the image edge detection channel performs main network multi-scale output characteristic image fusion and uses an image edge task label to perform deep supervision. And finally, performing fusion operation on the edge image output and the segmentation image obtained by multi-scale fusion at the right end, and adding an edge image label for deep supervision, so that advanced semantic features are fully utilized, and a more refined edge image is obtained.

And 4, step 4: in an edge detection Module of the network, a multi-scale Feature Fusion Module (Feature Fusion Module) is constructed. The module is used for fusing different scale characteristic graphs output in a backbone network and fusing multi-scale characteristic information into an edge image. As shown in fig. 3, the multi-scale feature module receives feature images of different scales output by blocks of a backbone network, and performs residual learning and channel weight learning through an SE resendt module, so that output feature information is richer, channels with important feature information can be distinguished in all received channel numbers, and unimportant feature channels are suppressed.

And finally, carrying out 1 × 1 convolution dimensionality reduction operation and upsampling operation on the feature maps with different scales, and carrying out dimensionality splicing on the feature maps obtained by 5 channels to obtain an output feature map with the original image size and the channel number of 5. And learning corresponding weights of the 5 channels through an SE Block module, distinguishing the importance of each channel, finally obtaining the final output of an image edge detection task through 1 multiplied by 1 convolution dimensionality reduction operation, and supervising through an edge detection label.

Weight learning of each channel of the input feature map is performed by using SE Block, and the learned weight information is recorded as z _c ∈R ^c By scaling u _c The scale of (a) is produced by W × H. At this time we represent the c-th element weight calculation at z as follows:

output z _c Can be regarded as a set of description information of the weight of the road channel map, which represents the set of the weight value occupied by the current channel.

And fusing the features of each layer in the backbone network through a multi-scale feature fusion module, distinguishing important feature information from unimportant feature information through SE Block, and finally outputting a predicted edge image of the edge detection module.

Step 5: defining a loss function of an edge detection network based on semantic guidance, fusing semantic features extracted by an image segmentation module and image edge features extracted by image edge detection, and defining a new loss function to train the network. In order to make the network training more sufficient, a method of weight cross entropy loss function is adopted, so that the labels can be fully supervised to each layer of feature diagram. We express the loss function for each layer as:

in the formula:

wherein Pr (x) _j (ii) a W) is the characteristic image pixel x in the mth layer _j The activation value in the prediction map is the activation function a _j ＝sigmoid(x _j )。|Y ⁺ I and Y ^- And | respectively refers to a pixel set at the edge of a screen area in the group Truth and a pixel set at the edge of a non-screen area, and W represents all parameters needing training in the network.

wherein | Y ^(fusion) | is expressed as sigmoid (A) ^(fusion) ),A ^(fusion) ＝{a _j ^(side) |j＝1,2,…,|Y||}，A ^(fusion) As a set of output values for each layer.

L＝L ^{(edge_fusion)} +L ^(seg_fusion)

the network carries out supervision training by using double labels, a VGG16 network is adopted as a main network, the weights of main networks of tasks at two ends are shared, and finally, the edge of a fine image screen is obtained by fusing the tasks at two ends.

And 6: and training the constructed edge detection network. By means of transfer learning, the open source data and the simulation data of the related fields are utilized to pre-train the network, and then the self-made marked screen data set is used for fine tuning the pre-trained network.

And 7: and storing the trained edge detection network, deploying the network to a GPU module of the server, and adjusting the network state to a port monitoring state. When the client sends the input image through the monitoring port, the edge detection network deployed on the server automatically performs inference prediction to obtain an edge image corresponding to the input image, and sends the edge image to the client through the corresponding port.

And step 8: and predicting the screen edge image in the natural scene. And calling an edge detection network of the server side, inputting the preprocessed input image, and returning the refined screen edge image.

And step 9: after-processing operation is performed on the screen edge image, a schematic diagram of a post-processing flow is shown in fig. 5, and firstly, a hough transform is called by an OpenCV library to perform straight line detection on the screen edge image, so that screen edge straight lines in all similar directions in the edge image are obtained.

Step 10: removing coincident straight lines from all straight lines, wherein the straight line duplicate removal method comprises the following steps: setting a distance threshold T _d And an angle threshold T _θ If the distance between any two straight lines is less than the distance threshold T _d And the angle difference of the two straight lines is less than the angle threshold value T _θ Then the straight line with the smaller length is deleted.

Step 11: and sequencing the remaining straight line intersections to serve as a set, taking four points each time to calculate the perimeter and the enclosed area, and considering the two points as the screen edge corner points under the natural scene when the two points are the maximum.

Step 12: and correcting the inclination angle of the screen by using the angular points of the screen and affine transformation to finally obtain the screen content image.

Claims

1. A method for detecting a screen area in a natural scene based on semantic guidance is characterized in that a screen picture shot in the natural scene can be processed to obtain the screen content, and the method specifically comprises the following steps:

step 1, acquiring a scene screen image shot by a mobile phone of a user, and preprocessing an input image;

step 2, constructing an edge detection network based on semantic guidance, wherein the edge detection network comprises a feature extraction module, an image segmentation module, an image edge detection module and a semantic guidance fusion module; the image segmentation module constructs an extended path through deconvolution to extract image semantic information characteristics and segment images; the image edge detection module extracts and fuses edge features through a multi-scale Feature fusion module (Feature fusion module) with an attention mechanism; the semantic guidance fusion module fuses the semantic features extracted by the image segmentation module with the edge features of the image edge detection module to obtain a refined edge image under semantic guidance;

step 3, fine-tuning the network by using a self-made screen edge data set in a transfer learning mode;

step 4, performing screen edge detection on the input image on the trained neural network to obtain a screen edge image;

and 5, performing post-processing operation by using the obtained screen edge image, screening out four screen corner points in the image by combining with the screen edge characteristics, and performing inclination angle correction through affine transformation to obtain a final screen content image.

2. The method as claimed in claim 1, wherein the feature extraction module is composed of a full Convolution network formed by removing a full connection layer of the VGG16, and in order to increase the receptive field of the network without losing a large amount of local information, a Hybrid scaled Convolution (Hybrid scaled Convolution) method is added to the last two layers of Convolution layers, and a set of three Convolution kernels with different Dilation rates (scaling rates) are set in the Convolution layers for sequential Convolution, so that the holes generated by the Dilation Convolution can be reduced and the receptive field can be increased.

3. The method for detecting the screen area under the natural scene based on the semantic guidance as claimed in claim 1, wherein the image edge detection Module performs image Feature Fusion through a multi-scale Feature Fusion Module (Feature Fusion Module) with attention mechanism, and the Module uses an SE ResnexT Module obtained by combining SE Block and ResnexT Block; after entering a multi-scale feature fusion module, feature graph outputs of different scales of each layer of Block blocks in a backbone network pass through an SE Resnext module, resnetXt operation with a residual group convolution structure is carried out to enrich input feature graph semantic information, then Squeeze and Excitation (SE) operation is carried out, learnable weight is given to each channel, so that the model actively learns the importance degree of each channel of the feature graph, and can promote useful features and inhibit features which are not useful for a current task according to the importance degree.

4. The method for detecting the screen area under the natural scene based on the semantic guidance as claimed in claim 1, wherein the semantic guidance fusion module is used for fusing the image features extracted by the edge detection module and the image segmentation module, and outputting finer image edge features by using a semantic feature guidance model extracted by the image segmentation module; a new model loss function is defined in a semantic guidance fusion module to fuse two kinds of output characteristic information and is trained under the guidance of an edge label, and the newly defined loss function is expressed as follows:

L＝L _fusion (f(F _seg ,F _edge |X；W)；W _f )

wherein F _seg Semantic features extracted for the image segmentation module, F _edge Edge features extracted for the image edge detection module, f: (* I W) represents a feature map fusion operation, wherein W represents parameters of a convolution operation; l is _fusion (F；W _f ) Represents the cross entropy function employed, expressed as:

wherein, F ⁱ For the ith pixel in the feature map, pr (y) _i |F ⁱ ) Is at pixel y _i N is the total number of image pixels, W _f A set of training parameters in the image segmentation task.

5. The method for detecting the screen area under the natural scene based on the semantic guidance according to claim 1, wherein the post-processing operation of the screen edge image mainly comprises: carrying out linear detection on the screen edge image based on Hough transform, removing coincident lines, sequencing linear intersection points to serve as a set, taking four points each time to calculate the perimeter and the enclosed area, and considering the maximum points as screen edge angular points in a natural scene; and finally, correcting the inclination angle of the screen by using the angular points of the screen and affine transformation to finally obtain the screen content image.