CN115424026A

CN115424026A - End-to-end foggy day image multi-target detection model based on knowledge embedding

Info

Publication number: CN115424026A
Application number: CN202210960202.5A
Authority: CN
Inventors: 袁兴生
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2022-08-11
Filing date: 2022-08-11
Publication date: 2022-12-02

Abstract

The invention provides an end-to-end foggy day image multi-target detection model based on knowledge embedding, and relates to the technical field of pattern recognition. The end-to-end foggy image multi-target detection model based on knowledge embedding comprises an image defogging sub-network, a target detection sub-network and semantic association feature learning based on knowledge embedding, wherein the image defogging sub-network comprises a public module and a feature recovery module, and the feature recovery module comprises an up-sampling sub-module, a multi-scale mapping sub-module and an image generation sub-module. The method can obtain higher target detection precision under the condition that a plurality of heterogeneous targets exist in a scene at the same time and a limited small amount of data training sets, has positive significance on the aspect of promoting understanding and application of foggy image scenes, and is high in detection quality and high in efficiency.

Description

End-to-end foggy day image multi-target detection model based on knowledge embedding

Technical Field

The invention relates to the technical field of pattern recognition, in particular to an end-to-end foggy day image multi-target detection model based on knowledge embedding.

Background

Under a complex imaging scene in a foggy day, the quality of an image obtained by an image acquisition system is seriously degraded, the performance of a target detection algorithm is influenced, and the conditions of missing detection and false detection are caused, so that the environment perception capability of an aerial, ground or offshore unmanned system platform is influenced. The target detection method under the foggy scene can be divided into two types, one type is a two-stage method, namely a non-associated target detection method based on defogging-detection, firstly, an image enhancement and recovery method is used for carrying out sharpening processing on a foggy image, and then, a target is detected by using a target detection method, wherein the defogging process in the 1 st stage possibly brings the problems of artifacts and color distortion to the image, so that the target detection precision of all images cannot be improved, and the method is generally not suitable for scenes with high real-time requirements. The other type is an end-to-end method, a defogging network and a target detection network are subjected to combined optimization training, defogging and target detection tasks are performed at the same time, the influence of image degradation is reduced through shared feature extraction, and the target detection precision of the foggy day image is improved.

The end-to-end foggy day image target detection model mainly comprises KODNet and DONet network models. KODNet designs an anchor frame surface ratio in a depth detection model so as to guide target detection in a real foggy scene; the DONet defogging model and the target detection model are cascaded and are subjected to combined learning, so that the problems of difficulty in target detection and low detection precision in a foggy scene are effectively solved, and the problems of artifacts, detail loss and color distortion generated by using an image enhancement and recovery method are avoided. However, end-to-end foggy image target detection has the characteristics of difficult data set acquisition and difficult labeling, and particularly under the condition that a plurality of heterogeneous targets exist in a scene at the same time, the data set has noises of missing detection and wrong labeling, so that the target detection performance is reduced in a foggy scene. How to obtain higher target detection precision under a limited small amount of data training sets is an urgent problem to be solved for foggy day image multi-target detection.

Disclosure of Invention

Technical problem to be solved

Aiming at the defects of the prior art, the invention provides an end-to-end foggy image multi-target detection model based on knowledge embedding, and solves the problem of reduced target detection performance in a foggy scene.

(II) technical scheme

In order to realize the purpose, the invention is realized by the following technical scheme: an end-to-end foggy day image multi-target detection model based on knowledge embedding comprises an image defogging sub-network, a target detection sub-network and semantic association feature learning based on knowledge embedding, wherein the image defogging sub-network comprises a public module and a feature recovery module, and the feature recovery module comprises an up-sampling sub-module, a multi-scale mapping sub-module and an image generation sub-module.

Preferably, the fog image defogging subnetwork is used for generating the characteristics, and the characteristics are shared with the detection subnetwork in the joint training process to improve the target detection accuracy under the fog condition, and the fog image defogging subnetwork is completed on the basis of the atmospheric light intensity scattering model.

Preferably, the common module extracts features in the input image including important feature information for learning visual enhancement, object recognition and localization at the same time.

Preferably, the upsampling sub-model is the size of the output image and the input image in the feature recovery sub-network, but the size of the feature map extracted by the common module is one fourth of the size of the input image.

Preferably, the multi-scale mapping sub-module is a feature f _C2 After the resolution ratio is increased by the up-sampling submodule, the obtained feature map is transmitted to the multi-scale mapping submodule to carry out multi-scale feature extraction.

Preferably, the image generation sub-module is the last stage of the image restoration sub-network, and the scene restoration is completed through the image generation sub-module.

Preferably, the image target detection sub-network model adopts RetinaNet as a backbone network for the foggy image target detection sub-network model, the RetinaNet provides a top-down line by using a characteristic pyramid network, and the transverse connection enables a network layer with higher resolution to be constructed from an abundant semantic layer, so that the detection precision of the small target in the foggy scene is greatly improved.

Preferably, the knowledge embedding-based semantic associated feature learning is a knowledge-guided semantic associated feature learning method, and features covering more comprehensive judgment information are learned by structurally expressing prior knowledge of category, attribute association and category association and embedding a deep network model.

The working principle is as follows: firstly, extracting important characteristic information including learning visual enhancement, target identification and positioning of characteristics in an input image by using a common module of an end-to-end foggy-day image multi-target detection model, and then repairing a foggy-day degraded image by using a characteristic repairing model; secondly, extracting a feature map of the whole image by using a RetinaNet network structure, and then adding a Feature Pyramid Network (FPN) to the RetinaNet structure to construct a multi-scale feature at the top end of the network structure, so that the problem of feature construction of a target in different scales is solved; by adopting the knowledge-embedded feature learning expression method, the problems that the appearance features of a few cases are only covered by the labeled samples in the scene with few samples, and the learned model expression capability and generalization capability are poor are solved, and the influence of data concentrated omission and wrong label noise on the detection of various targets in the foggy scene is avoided.

(III) advantageous effects

The invention provides an end-to-end foggy day image multi-target detection model based on knowledge embedding. The method has the following beneficial effects:

the invention provides an end-to-end foggy day image multi-target detection model based on knowledge embedding, wherein a defogging network and a detection network are optimized in a combined manner, a result after image recovery is reconstructed under the guidance of target detection information, and target structure detail characteristics and color characteristics recovered after image defogging are learned in the detection network, so that the target detection precision is improved.

Drawings

FIG. 1 is a flow chart of a network structure for detecting foggy day image targets according to the present invention;

FIG. 2 is a flow diagram of a knowledge-guided semantic feature learning framework of the present invention;

FIG. 3 is a flow chart of the target detection result of the knowledge-embedding-based end-to-end foggy day image multi-target detection model.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment is as follows:

as shown in fig. 1 to 3, an embodiment of the present invention provides a knowledge-embedding-based end-to-end foggy image multi-target detection model, which includes an image defogging sub-network, a target detection sub-network, and knowledge-embedding-based semantic association feature learning, where the image defogging sub-network includes a common module and a feature recovery module, where the feature recovery module is that input image features extracted from the common module may degrade due to fog, thereby causing a decrease in target detection performance, and the feature recovery sub-network employs an FR module in order to recover model features output by the common module, and the feature recovery module includes an upsampling sub-module, a multi-scale mapping sub-module, and an image generation sub-module.

Fog image defogging subnetwork is used to generate feature (f) _C2 ) And the characteristics are shared with a detector sub-network in the process of joint training to improve the target detection precision under the foggy condition, the fog image defogging sub-network is completed on the basis of an atmospheric light intensity scattering model,

image defogging is realized through an atmospheric scattering model, and the model formula is as follows:

to facilitate the estimation of the transmittance t (x) and the global atmospheric light intensity value α, the formula can be rewritten as:

J(x)＝G(x)I(x)-G(x)+1

(formula 2)

Here, the first and second liquid crystal display panels are,

the image defogging sub-network combines the transmissivity t (x) and the atmospheric light intensity value alpha in the visual enhancement process and utilizes a network model for estimation.

The common module extracts features in the input image including important feature information of simultaneous learning of visual enhancement, target identification and localization, the common module is not designed independently, but is designed together with some residual modules of the target detection sub-network so as to maintain a simple structure, and particularly, 16 residual modules are divided into four residual stages in the detection sub-network (respectively represented as Conv _2, conv _3, conv _4and Conv _5, as shown in fig. 1), considering that the features acquired through the shallow neural network may contain more spatial information, which is beneficial to visual enhancement, but the spatial information of the deep network is reduced in the process of pooling, so the former 10 layers of convolutional network layers of the detection sub-network are selected to form a common module model, and Conv _2 is used as the output of the model, and the feature map acquired from the common module is synchronously transferred to the feature recovery module for visual enhancement, and transferred to Conv _3 for target detection.

The up-sampling sub-model is that the size of the output image and the input image are the same in the feature recovery sub-network, but the size of the feature map extracted by the common module is one fourth of the size of the input image, therefore, the feature recovery sub-network module matches the resolution of the input image sum by using the up-sampling sub-module, and in the deep learning based defogging research, the bilinear interpolation technology can be applied to the image defogging process based on the convolutional neural network, and the pooled feature map generates the defogged output image by bilinear up-sampling, therefore, the up-sampling sub-model in the model firstly uses the convolutional layer to reduce the feature dimension, especially, the number of channels of the feature is reduced by 7 times, and then, the bilinear interpolation is used to increase the size of the feature to be the same as the size of the input image.

The multiscale mapping submodule is feature f _C2 The resolution of the image is increased by an up-sampling submodule, an obtained feature map is transmitted to a multi-scale mapping submodule for multi-scale feature extraction, the multi-scale feature extraction is widely applied to an image defogging method and is effective in the aspect of visibility enhancement, the multi-scale feature submodule is composed of four parallel convolutions, including 1 × 1 convolution, 3 × 3 convolution, 5 × 5 convolution and 7 × 7 convolution, the number of the passages of the convolutions is 4, and the final feature is estimated by another 3 × 3 convolution to obtain G (x), including the transmittance t (x) and the global atmospheric light intensity value alpha.

The image generation sub-module is the last stage of the image recovery sub-network, the scene restoration is completed through the module, and the module takes G (x) as input and adopts an element-wise multiplication layer, a feature vector extraction layer and a feature vector addition layer to calculate a transformation formula 2.

To train visibility enhancement, the image recovery sub-network exploits the mean-squared error (MSE) penalty, which can be described as:

where n is the image slice size, Y _i Is the image after the real recovery of the image,

is an estimated post-recovery image, further emphasizing that although the recovery subnetwork can generate the defogged image directly, its goal is not to generate the input defogged image of the detection subnetwork, but rather from the common moduleCharacteristic f independent of fog _C2 To learn visibility-enhancing tasks.

The image target detection sub-network model adopts Retina Net as a main network for the foggy image target detection sub-network model, retina Net utilizes a Feature Pyramid Network (FPN) to provide a top-down line, and transverse connection enables a network layer with higher resolution to be constructed from an abundant semantic layer, so that the detection precision of small targets in foggy scenes is greatly improved, because a deep layer network contains abundant semantic information but lacks position information after pooling processing, the transverse connection between the deep layer network and a corresponding shallow layer network can enrich the position information, and the precision is improved.

In order to efficiently detect the target, a detection subnetwork follows a specific strategy, firstly, a RetinaNet network structure is used for extracting a feature map of the whole image, then, at the top end of the network structure, a Feature Pyramid Network (FPN) is used for being attached to the RetinaNet structure to construct multi-scale features, the problem of feature construction of the target in different scales is solved, and finally, a multi-target identification and positioning task is added to an FPN network layer through a simplified Full Convolution Network (FCNs) to complete target detection and bounding box regression (bounding box regression).

In order to train the object classification, the detection network adopts a loss function focal loss, lambda _c To balance the variables, are described as

L _cls (p _c )＝-λ _c (1-p _c ) ^γ log(p _c ) (formula 4)

Here, λ _c ∈[0,1]Is of the object class 1, 1-lambda _c For the target class-1, γ is an adjustable focusing parameter (γ ≧ 0), p _c Is defined as

Where y is the determined reference class (y ∈ { ±. 1 }), and p is the target class probability of the class label y =1, and is obtained through model estimation.

Locating targets, detecting networks in advanceSmooth loss is adopted between the measuring frame (k) and the reference frame (g), and the matching pair between the anchor frame (l) and the reference frame (g) is expressed as (l) ^m ,g ^m ) _{m＝1,2,...q'} Where q denotes the number of matching pairs, and for each matched anchor frame, a reference frame regression is defined as

Then a corresponding prediction box is represented as

Where x, y, ω and h represent the two center coordinates, width and height, of the box, the positioning penalty is expressed as:

here, the number of the first and second electrodes,

the semantic association feature learning based on knowledge embedding is a semantic association feature learning method adopting knowledge guidance, the prior knowledge of category-attribute association and category association is expressed in a structured mode, a deep network model is embedded, the feature covering more comprehensive judgment information is learned, in a foggy image target detection model, a semantic association feature learning model is connected in a main network RetinaNet, firstly, a knowledge graph of category-attribute association is built for each type of target, the attribute knowledge feature of the image is learned by using a graph propagation network, then, an attention mechanism of semantic association is introduced, the feature of attribute association for each type of target is guided and learned by using the attribute knowledge feature, the coexistence probability of different types of targets of the image is learned based on the attribute association feature, a K knowledge graph of category association is built according to the coexistence probability, and the semantic feature of context association is learned through graph propagation and interaction network.

Assuming that a foggy day image scene has C object classes and K object attributes, constructing a knowledge of class and attribute association for each class CGraph G _c ＝{V _c ,A _c }，V _c ＝{v _c,0 ,v _c,1 ,v _c,2 ,…,v _c,K Is a set of nodes, where v _c,0 Represents the class c node, v _c,k A node representing an attribute k; a. The _c Representing a node incidence matrix, wherein a _c,i,j Representing the association probability of the node i and the node j, and then constructing a knowledge graph G = { V, A }, V = { V } of the association of the category and the category ₁ ,v ₂ ,...,v _k In which v is _c Representing class c nodes; a represents a node incidence matrix, wherein a _ij Indicating the coexistence probability of class i and class j.

Given a foggy image, a target detection subnetwork is first utilized to extract multi-scale global features

Aiming at a plurality of target classes, extracting the characteristics of the class c and the corresponding k attributes by utilizing a Glove model and using the characteristics for initializing the graph G _c Of corresponding category and attribute nodes, i.e.

Then, a Graph Convolution Network (GCN) is introduced to explore information propagation and interaction of different nodes and update node characteristics, namely

Using a priori information A _c Initializing adjacency matrices

Then, in the training process, the learning category and attribute relation is jointly optimized and passes through L _c The information between the nodes of the graph is deeply interacted and explored through the sub-convolution operation, and H can be obtained _c ＝{h _c,0 ,h _c,2 ,...,h _c,K And f, expressing by cascading each node feature and mapping the node feature to attribute knowledge of the category c

Introducing an attention mechanism based on knowledge guidance, and expressing x by using the c attribute knowledge of each category _c Guiding learning of features associated with semantic attributes, specifically, for each position (w, h) of an image feature f, first fusing the position feature and the corresponding knowledge expression, and learning an importance factor of the position, namely

This operation is repeated for each location, resulting in an importance factor for each location

And normalizing the data by utilizing a softmax function to obtain a final normalized importance factor

Finally, obtaining semantic attribute association characteristics of the category c by utilizing weighted average pooling operation

And executing the operation aiming at all the categories to obtain the characteristics { f) associated with all the categories and the corresponding attributes thereof ₁ ,f ₂ ,...,f _C Therein, feature vector f _c Features covering regions associated with attributes of the category are primarily learned.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. An end-to-end foggy-day image multi-target detection model based on knowledge embedding comprises an image defogging sub-network, a target detection sub-network and semantic association feature learning based on knowledge embedding, and is characterized in that: the image defogging subnetwork comprises a public module and a feature recovery module, wherein the feature recovery module comprises an up-sampling submodule, a multi-scale mapping submodule and an image generation submodule.

2. The knowledge-embedding-based end-to-end foggy day image multi-target detection model as claimed in claim 1, wherein: the fog image defogging subnetwork is used for generating characteristics and sharing the characteristics with the detection subnetwork in the joint training process to improve the target detection precision under the fog condition, and the fog image defogging subnetwork is completed on the basis of an atmospheric light intensity scattering model.

3. The knowledge-embedding-based end-to-end foggy day image multi-target detection model as claimed in claim 1, wherein: the common module extracts features in the input image including important feature information for simultaneous learning of visual enhancement, target recognition and localization.

4. The knowledge-embedding-based end-to-end foggy day image multi-target detection model as claimed in claim 1, characterized in that: the up-sampling sub-model is that in the feature recovery sub-network, the sizes of the output image and the input image are the same, but the size of the feature map extracted by the common module is one fourth of that of the input image.

5. The knowledge-embedding-based end-to-end foggy day image multi-target detection model as claimed in claim 1, wherein: the multi-scale mapping sub-module is a feature f _C2 After the resolution ratio is increased by the up-sampling submodule, the obtained feature map is transmitted to the multi-scale mapping submodule to carry out multi-scale feature extraction.

6. The knowledge-embedding-based end-to-end foggy day image multi-target detection model as claimed in claim 1, wherein: the image generation sub-module is the last stage of the image recovery sub-network, and scene restoration is completed through the image generation sub-module.

7. The knowledge-embedding-based end-to-end foggy day image multi-target detection model as claimed in claim 1, wherein: the image target detection sub-network model adopts RetinaNet as a backbone network for the foggy image target detection sub-network model, the RetinaNet provides a top-down line by utilizing a characteristic pyramid network, and the transverse connection enables a network layer with higher resolution to be constructed from an abundant semantic layer, so that the detection precision of small targets in foggy scenes is greatly improved.

8. The knowledge-embedding-based end-to-end foggy day image multi-target detection model as claimed in claim 1, characterized in that: the semantic association feature learning based on knowledge embedding is a semantic association feature learning method adopting knowledge guidance, and by structurally expressing prior knowledge of category and attribute association and category association and embedding a deep network model, the learning covers features of more comprehensive judgment information.