CN112132156B

CN112132156B - Image saliency target detection method and system based on multi-depth feature fusion

Info

Publication number: CN112132156B
Application number: CN202010832414.6A
Authority: CN
Inventors: 陈振学; 闫星合; 刘成云; 孙露娜; 段树超; 朱凯; 陆梦旭; 李明
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2020-08-18
Filing date: 2020-08-18
Publication date: 2023-08-22
Anticipated expiration: 2040-08-18
Also published as: CN112132156A

Abstract

The application discloses a method and a system for detecting an image saliency target by fusion of multiple depth features, wherein the method comprises the following steps: acquiring image information to be detected in a set scene; inputting the image information into a trained multi-depth feature fusion neural network model; the multi-depth feature fusion neural network model adopts convolution to perform feature extraction in the encoding stage, and combines the convolution and a bilinear interpolation up-sampling method to restore the information of an input image in the decoding stage, so as to output a feature map with significance information; adopting a multi-level network to learn the feature graphs of different levels, and fusing the feature graphs of different levels; and outputting a final saliency target detection result. The application utilizes the multi-depth feature fusion neural network to carry out the saliency target detection on the images in the scene, ensures the detection precision and accelerates the speed of the subsequent processing process; and adding a contour detection branch, and refining the boundary details of the object to be detected by using the contour characteristics.

Description

Image saliency target detection method and system based on multi-depth feature fusion

Technical Field

The application relates to the technical field of image saliency target detection, in particular to a method and a system for detecting an image saliency target by fusion of multiple depth features.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Salient object detection refers to separating the person or thing in the image that most draws attention to human vision from the background by using a computer to simulate the human vision attention mechanism. The image is composed of a plurality of pixel points, the brightness, the color and other attributes of the pixel points are different, and the corresponding salient feature values are also different. Unlike conventional object detection and semantic segmentation tasks, salient object detection focuses only on the most visually noticeable part and does not classify it, and the general detection result is at the pixel level, so salient object detection is often used as a pre-step of each image processing method to improve the accuracy of the subsequent processing flow.

At present, saliency target detection is applied to the fields of medical image segmentation, intelligent photography, image retrieval, virtual background, intelligent unmanned systems and the like. The saliency target detection is a basic task in the intelligent unmanned system, and lays a foundation for subsequent target identification and decision making. In recent years, the artificial intelligence industry rapidly develops, people pursue unmanned operation in the fields of intelligent life and industry, and intelligent unmanned systems become hot spots for people to study.

Taking an unmanned system as an example, unmanned is a complex computer task, a visual attention mechanism of a driver needs to be simulated in a changed scene to perform quick and accurate perception, a back-end computer is required to well perceive the surrounding whole environment and different scenes, conventional target detection can only detect specific objects, and a detection result is in an inaccurate block diagram form, accurate and quick response cannot be performed on an unknown sudden picture, so that remarkable target detection is a key technology in unmanned. The vehicle-mounted camera or the laser radar inputs real-time road pictures, a binarized saliency feature map is output through a saliency target detection algorithm, and then a scene with emphasis is segmented, so that pictures with semantic information are obtained, the advance and obstacle avoidance of the automobile are controlled, and the rapid, accurate and calculation resource saving can be realized.

Early significance detection features such as: color, brightness, direction, center-to-surrounding contrast, etc., can only be detected by localized regions, and then methods such as Markov chain method and frequency domain tuning have been developed to bring global features into the detection range in a mathematical angle, but still have difficulty in achieving higher accuracy. While unmanned systems require ultra-high precision and extremely fast response speeds to ensure safety and real-time. Meanwhile, the problems of too small target to be detected, complex background, unclear target contour and the like can be encountered in the unmanned process, and the detection result and the precision of subsequent processing operation are influenced.

Disclosure of Invention

In order to solve the problems, the application provides a multi-depth feature fusion image saliency target detection method and system, which utilize a multi-depth feature fusion neural network to carry out saliency detection on images in an application scene, and improve the speed of processing steps such as subsequent segmentation while guaranteeing the detection precision.

In some embodiments, the following technical scheme is adopted:

a multi-depth feature fused image saliency target detection method comprises the following steps:

acquiring image information to be detected in a set scene;

inputting the image information into a trained multi-depth feature fusion neural network model;

the multi-depth feature fusion neural network model adopts convolution to perform feature extraction in the encoding stage, and combines the convolution and a bilinear interpolation up-sampling method to restore the information of an input image in the decoding stage, so as to output a feature map with significance information;

adopting a multi-level network to learn the feature graphs of different levels, and fusing the feature graphs of different levels;

and outputting a final saliency target detection result.

Further, adding a contour detection branch, extracting the contour feature information of the salient target through a multi-depth feature fusion neural network model, and refining the boundary details of the target to be detected by using the contour features; and then fusing the salient feature information of the image to be detected and the salient target contour feature information.

In other embodiments, the following technical solutions are adopted:

an image saliency target detection system for multi-depth feature fusion, comprising:

the device is used for acquiring the image information to be detected in the set scene;

means for inputting the image information into a trained multi-depth feature fusion neural network model;

the device is used for extracting the characteristics of the multi-depth characteristic fusion neural network model by adopting convolution in the encoding stage, restoring the information of the input image by combining the up-sampling method of convolution and bilinear interpolation in the decoding stage, and outputting a characteristic map with significance information;

means for learning feature maps of different levels using a multi-level network, the feature maps of different levels being fused;

means for outputting a final salient object detection result.

In other embodiments, the following technical solutions are adopted:

a terminal device comprising a processor and a computer-readable storage medium, the processor configured to implement instructions; the computer readable storage medium is for storing a plurality of instructions adapted to be loaded by a processor and to perform the image saliency object detection method of multi-depth feature fusion described above.

In other embodiments, the following technical solutions are adopted:

a computer readable storage medium having stored therein a plurality of instructions adapted to be loaded by a processor of a terminal device and to perform the above-described image saliency object detection method of multi-depth feature fusion.

Compared with the prior art, the application has the beneficial effects that:

the application utilizes the multi-depth feature fusion neural network to carry out the saliency target detection on the images in the scene, ensures the detection precision and accelerates the speed of the subsequent processing process;

the encoder-decoder structure in the multi-depth feature fusion neural network can meet the problem of significance detection precision, and the multi-level, multi-task and multi-channel feature map fusion fully utilizes shallow and deep information;

the application adds the contour feature detection, can optimize the detail information of the salient target edge, obtain the detection result with higher accuracy and more definite contour, and is definitely helpful for other processing tasks such as subsequent scene segmentation.

The saliency target detection algorithm provided by the application can effectively provide help for intelligent unmanned systems, such as unmanned systems and the like, simultaneously meet the requirements of accuracy and real-time performance, and can overcome the problems of too small target to be detected, complex background, unclear target outline, large occupied memory for calculation and long training time.

Drawings

FIG. 1 is a flow chart of a salient object detection method in an embodiment of the present application;

FIG. 2 is a schematic diagram of an image preprocessing method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a multi-depth feature fusion neural network framework in an embodiment of the application;

fig. 4 is a schematic diagram of a re-weighting module for network important components in an embodiment of the present application.

Detailed Description

It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the application. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present application. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

Example 1

The saliency target detection method can be applied to the fields of medical image segmentation, intelligent photography, image retrieval, virtual background, intelligent unmanned systems and the like.

In this embodiment, the multi-depth feature fusion neural network refers to a saliency detection neural network fused with multi-level, multi-task, multi-channel depth features.

In this embodiment, the method of the present application is described in detail by taking an unmanned scene as an example:

a method for detecting an image saliency target by multi-depth feature fusion, referring to fig. 1, includes:

acquiring image information to be detected in a set scene;

the multi-depth feature fusion neural network model adopts convolution to perform feature extraction in the encoding stage, and combines the convolution and a bilinear interpolation up-sampling method to restore the information of an input image in the decoding stage, so as to output a feature map with significance information; adopting a multi-level network to learn the feature graphs of different levels, and fusing the feature graphs of different levels;

and outputting a final saliency target detection result.

Adding a contour detection branch to assist a significance detection task, and refining boundary details of a target to be detected by using contour features;

introducing a self-adaptive channel re-weighting branch to re-calibrate the characteristic channel weight of the convolution layer;

the neural network is optimized by a cross entropy loss function.

Specifically, S1: and collecting the image of the field driving, carrying out binarization significance labeling on the image, and determining the label so as to form a training set and a testing set.

The specific process of step S1 is as follows:

s1.1: the images can be obtained by video shooting and separation, the video is extracted every 10 frames, the images are obtained, and the images are input into a neural network.

S1.2: and labeling each pixel point, wherein one class corresponds to one number, and the number of the class is 2 to distinguish the foreground from the background, so that a gray level image is obtained and is used as a true value of an output image.

S2: referring to fig. 2, on the basis of the training set, the input and the labeling images are randomly scaled, cut, boundary filled and turned, so that the training set is expanded, and the precision is improved more along with the expansion of the training set.

The method has the advantages that the number of the pixels in each image is large, each pixel is marked, time and labor are wasted, omission or mismarking exists, but a large number of images are helpful for improving the precision, so that the method can preprocess the images, and a better effect can be achieved by using fewer images.

The specific process of step S2 is as follows:

s2.1: the input and labeled truth images are randomly scaled down or up in each training.

S2.2: if the image is bigger than the original image, the image is cut from random points, if the image is smaller than the original image, the boundary is filled, and finally the image is turned over randomly horizontally or vertically.

S2.3: the images of each training are different, and the training set is expanded.

S3: and establishing a background model by calculating the mean value and the variance of each pixel point in the image, normalizing the pixel points and extracting the characteristics.

The specific process of step S3 is as follows:

s3.1: and calculating the average value and variance of all the image pixel points to obtain a background model.

S3.2: the average value is subtracted from the image and divided by the variance to obtain data meeting normal distribution, the average brightness value of the image is removed, and the calculation accuracy of the network can be improved through data normalization.

S4: the training set of the preprocessed scene images is input into a convolution network shown in fig. 3 for training, and in the training process, image features in different aspects are learned by using a multi-level, multi-task and multi-channel structure, and a plurality of feature images are fused, so that the speed is kept, and meanwhile, the precision is improved.

Wherein, multitasking refers to a detection mode taking a saliency target detection task as a main part and a saliency contour detection task as an auxiliary part; the feature graphs formed by different convolution layers in the network structure are fused and utilized in a multi-level manner, so that the aim of combining multi-scale features is fulfilled; the channels are re-weighted according to the contribution degree of the characteristic channels to the significance stimulus, so that the characteristic channels with large contribution are given larger weight in the characteristic calculation. Channels refer to the number of channels in an image, such as red, green, and blue channels in an RGB image, but each channel contributes differently to the significance stimulus. The application introduces a self-adaptive channel re-weighting branch to re-calibrate the characteristic channel weight of the convolution layer; the contribution of the characteristic channel is re-given with weight, the weight is optimized continuously, and the accuracy of target detection is improved.

The specific process of step S4 is:

s4.1: the encoding section samples the image to 1/2 of the original image 2048×1024 by a convolution layer with a step size of 2 and a convolution kernel of 3×3, thereby reducing the burden of computation. Two convolution filters with a step size of 1 and a kernel of 3 x 3 do not change the image size, but can capture shallow features. The size of the feature map obtained after these 3 convolution operations is 1024×512×32 pixels.

S4.2: four convolutions are performed on the input image by using the re-weighted convolution structure in fig. 4, w, h and c in fig. 4 represent the width, the height and the channel number of the feature map respectively, so as to obtain four groups of saliency feature maps with different scales.

S4.3: the high-order feature maps are fused according to the re-weighted fusion unit in fig. 4, and the up-sampling operation in 4.4 is performed in the fusion process.

S4.4: the image is up-sampled twice by bilinear interpolation. The feature size obtained was 512×256×the number of categories. The pixels of the four pixel points (i, j), (i, j+1), (i+1, j), (i+1, j+1) are known, and the pixels of the (i+u, j+v) point are obtained by a bilinear difference method:

f(i+u,j+v)＝(1-u)*(1-v)*f(i,j)+(1-u)*v*f(i,j+1)+u*(1-v)*f(i+1,j)+u*v*f(i+1,j+1)

s4.5: and merging the upsampled feature map with the shallow features extracted from the encoder to form a multi-level feature map. The number of channels increases after fusion, so that the step length is 1 again, the convolution kernel is convolution of 1×1, and the number of channels is maintained as the category number. The feature size obtained is still 512×256×class number.

S4.6: finally, up-sampling a plurality of groups of images with different scales by corresponding times through bilinear interpolation, so that the obtained predicted output image has the same size as the original image; the feature size is 2048×1024×class number.

S4.7: each convolution network in the multi-depth feature fusion neural network optimizes the network through cross entropy loss, and the formula of the cross entropy function is as follows:

loss(x,class)＝weight[class]*(-x[class]+log(∑ _j exp(x[j])))

wherein x represents the predicted output of a pixel, class represents the true class of the pixel, namely foreground or background, weight represents the weighting coefficient for weighting each class, x class represents the predicted output of a pixel with the true label being class, and x j represents the predicted output of a pixel with the true label being j.

In addition, a second saliency target outline auxiliary detection branch is introduced, the re-weighting module is the same as that in fig. 4, and the branch only pays attention to low-level features to form two groups of multi-scale feature graphs;

and carrying out feature extraction and fusion by adopting the same method as the above to obtain the fused contour feature information.

And fusing the fused target contour feature information and the target significance feature information to obtain a final fusion prediction graph.

And finally, carrying out data enhancement processing which is the same as the training set but does not carry out random scaling, cutting, boundary filling and overturning on each test set, and calculating the detection precision by utilizing quantitative indexes such as a P-R curve, an ROC curve, an MAE and the like.

The embodiment solves the problem of detecting the salient targets in the intelligent unmanned system by utilizing the multi-depth feature fusion neural network. Extracting images from actual application scenes (such as road scenes), randomly scaling, cutting, filling boundaries, turning over the images, and expanding a training set; normalizing pixel points in the image to enable pixel values to be between 0 and 1, and eliminating the influence of other transformation functions on image transformation; and fusing the characteristic graphs of different levels through an encoding-decoding structure, performing characteristic extraction by convolution in an encoding stage, and restoring the information of the input image by combining convolution and bilinear interpolation in a decoding stage to obtain the image with the significant characteristic information. The encoder-decoder structure can meet the detection precision problem, multi-level, multi-task and multi-channel combined feature map fusion fully excavates multi-aspect depth information, the accuracy is further improved, and the 1 multiplied by 1,3 multiplied by 3 small convolution kernels used in feature extraction improve the network running speed.

The saliency target detection algorithm provided by the embodiment can effectively provide help for unmanned operation, meets the requirements of accuracy and real-time performance, and can solve the problems of too small target to be detected, complex background, unclear target outline, large occupied memory for calculation and long training time.

Example two

In one or more embodiments, an image saliency target detection system for multi-depth feature fusion is disclosed, comprising:

means for outputting a final salient object detection result.

It should be noted that, the specific working mode of the above device is implemented by using the method disclosed in the first embodiment, which is not described herein.

Example III

In one or more embodiments, a terminal device is disclosed that includes a server including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the image saliency target detection method of multi-depth feature fusion of embodiment one when executing the program. For brevity, the description is omitted here.

It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate array FPGA or other programmable logic device, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may include read only memory and random access memory and provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store information of the device type.

In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or by instructions in the form of software.

The method for detecting the image saliency target by fusion of multiple depth features in the first embodiment can be directly embodied as execution completion of a hardware processor or execution completion of the execution by a combination of hardware and software modules in the processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method. To avoid repetition, a detailed description is not provided herein.

While the foregoing description of the embodiments of the present application has been presented in conjunction with the drawings, it should be understood that it is not intended to limit the scope of the application, but rather, it is intended to cover all modifications or variations within the scope of the application as defined by the claims of the present application.

Claims

1. The image saliency target detection method for multi-depth feature fusion is characterized by comprising the following steps of:

acquiring image information to be detected in a set scene;

outputting a final salient target detection result;

the multi-depth feature fusion neural network model adopts convolution to extract features in the encoding stage, and specifically comprises the following steps:

sampling the image to 1/2 of the original image by a convolution layer;

capturing shallow detail features of the image through contour branches of the two convolution layers;

the up-sampling method combining convolution and bilinear interpolation in the decoding stage restores the information of the input image and outputs a feature map with significance information, which is specifically as follows:

performing four convolutions on the input image by using a re-weighted convolution structure to obtain four groups of saliency feature images with different scales;

upsampling the image to twice by bilinear interpolation;

the feature map after up-sampling is fused with the corresponding scale feature map obtained in the encoding stage to form a multi-level feature map;

performing convolution operation on the multi-level feature map, and maintaining the number of channels as the number of categories;

up-sampling a plurality of groups of images with different scales by corresponding multiples through bilinear interpolation, so that the obtained predicted output image has the same size as the original image;

the re-weighted convolution structure specifically comprises: a branch of channel weight storage is introduced into the basic convolution unit, and the weight of each channel for significance contribution is adjusted through training.

2. The method for detecting the image saliency target by multi-depth feature fusion as claimed in claim 1, wherein a contour detection branch is added, the saliency target contour feature information is extracted through a multi-depth feature fusion neural network model, and the boundary details of the target to be detected are refined by the contour features; and then fusing the salient feature information of the image to be detected and the salient target contour feature information.

3. The method for detecting an image saliency target of multi-depth feature fusion as claimed in claim 1, wherein the training process for the multi-depth feature fusion neural network model comprises:

different image information under a set scene is obtained, binarization significance labeling is carried out on pixel points in each image, and a label is determined, so that a training set and a testing set are formed;

randomly scaling, cutting, filling boundaries and turning over the training set image to expand the data set;

establishing a background model by calculating the mean value and variance of each pixel point in the image, normalizing the pixel points, and extracting features;

inputting the extracted features into a multi-depth feature fusion neural network model for training;

and verifying the trained neural network model by using the test set image.

4. The method for detecting the image saliency target by multi-depth feature fusion as claimed in claim 3, wherein a background model is built by calculating the mean value and the variance of each pixel point in the image, the pixel points are normalized, and the feature extraction is carried out, and the specific process comprises the following steps:

calculating the average value and variance of all the image pixel points to obtain a background model;

the pixel value of each pixel point of the image is subtracted by the average value and divided by the variance to obtain data meeting normal distribution.

5. An image saliency target detection system for multi-depth feature fusion, comprising:

means for outputting a final salient object detection result;

sampling the image to 1/2 of the original image by a convolution layer;

upsampling the image to twice by bilinear interpolation;

6. A terminal device comprising a processor and a computer-readable storage medium, the processor configured to implement instructions; a computer readable storage medium for storing a plurality of instructions adapted to be loaded by a processor and to perform the multi-depth feature fused image salient object detection method of any one of claims 1-4.

7. A computer readable storage medium having stored therein a plurality of instructions adapted to be loaded by a processor of a terminal device and to perform the multi-depth feature fused image salient object detection method of any one of claims 1-4.