CN112132156A

CN112132156A - Multi-depth feature fusion image saliency target detection method and system

Info

Publication number: CN112132156A
Application number: CN202010832414.6A
Authority: CN
Inventors: 陈振学; 闫星合; 刘成云; 孙露娜; 段树超; 朱凯; 陆梦旭; 李明
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2020-08-18
Filing date: 2020-08-18
Publication date: 2020-12-25
Anticipated expiration: 2040-08-18
Also published as: CN112132156B

Abstract

The invention discloses a method and a system for detecting an image saliency target by multi-depth feature fusion, wherein the method comprises the following steps: acquiring to-be-detected image information under a set scene; inputting the image information into a trained multi-depth feature fusion neural network model; the multi-depth feature fusion neural network model adopts convolution to extract features in an encoding stage, restores information of an input image in a decoding stage by combining an up-sampling method of convolution and bilinear interpolation, and outputs a feature map with significance information; learning feature maps of different levels by adopting a multi-level network, and fusing the feature maps of the different levels; and outputting a final significant target detection result. According to the method, the multi-depth feature fusion neural network is utilized to perform significance target detection on the image in the scene, so that the detection precision is guaranteed, and the speed of the subsequent processing process is increased; and adding a contour detection branch, and refining the boundary details of the target to be detected by using contour characteristics.

Description

Multi-depth feature fusion image saliency target detection method and system

Technical Field

The invention relates to the technical field of image saliency target detection, in particular to a method and a system for detecting an image saliency target by multi-depth feature fusion.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

The salient object detection means that a computer is used for simulating a human visual attention mechanism to separate people or things which can attract human visual attention most from the background. The image is composed of a plurality of pixel points, the attributes of the pixel points such as brightness, color and the like are different, and the corresponding significant characteristic values are also different. Different from the traditional target detection and semantic segmentation tasks, the salient target detection only focuses on the part which can draw the most visual attention without classifying the part, and the detection result is in a pixel level, so the salient target detection is often used as a pre-step of each image processing method to improve the accuracy of the subsequent processing flow.

At present, salient object detection is applied to the fields of medical image segmentation, intelligent photography, image retrieval, virtual backgrounds, intelligent unmanned systems and the like. The significance target detection is a basic task in an intelligent unmanned system, and lays a foundation for subsequent target identification and decision making. In recent years, the artificial intelligence industry is rapidly developed, people pursue unmanned operation in the fields of intelligent life and industry, and an intelligent unmanned system becomes a hot point of research of people.

Taking an unmanned system as an example, unmanned driving is a complex computer task, a visual attention mechanism of a driver needs to be simulated in a changing scene for quick and accurate perception, a back-end computer is required to well perceive the whole surrounding environment and different scenes, conventional target detection can only detect a specific object, and a detection result is in an inaccurate block diagram form and cannot accurately and quickly respond to unknown sudden pictures, so that significant target detection is a key technology in unmanned driving. The vehicle-mounted camera or the laser radar inputs a real-time road picture, outputs a binary significance characteristic map through a significance target detection algorithm, and performs scene segmentation with emphasis to obtain a picture with semantic information, so that the advancing and obstacle avoidance of the automobile are controlled, and the method is fast, accurate and computing resources are saved.

Early significance detection features such as: color, brightness, direction, center-to-periphery contrast, etc. can only be detected locally, and then methods such as Markov chain method and frequency domain tuning are developed to bring global features into the detection range in a mathematical angle, but still higher accuracy is difficult to achieve. The unmanned system needs ultra-high precision and extremely fast response speed to ensure safety and real-time performance. Meanwhile, problems of undersize target to be detected, complex background, unclear target outline and the like can be encountered in the unmanned driving process, and influences are brought to the detection result and the precision of subsequent processing operation.

Disclosure of Invention

In order to solve the problems, the invention provides a method and a system for detecting an image saliency target by multi-depth feature fusion.

In some embodiments, the following technical scheme is adopted:

a method for detecting a multi-depth feature fused image saliency target comprises the following steps:

acquiring to-be-detected image information under a set scene;

inputting the image information into a trained multi-depth feature fusion neural network model;

the multi-depth feature fusion neural network model adopts convolution to extract features in an encoding stage, restores information of an input image in a decoding stage by combining an up-sampling method of convolution and bilinear interpolation, and outputs a feature map with significance information;

learning feature maps of different levels by adopting a multi-level network, and fusing the feature maps of the different levels;

and outputting a final significant target detection result.

Further, adding a contour detection branch, extracting the contour characteristic information of the salient target through a multi-depth characteristic fusion neural network model, and refining the boundary details of the target to be detected by using the contour characteristics; and then, fusing the salient feature information of the image to be detected and the salient target contour feature information.

In other embodiments, the following technical solutions are adopted:

a multi-depth feature fused image saliency target detection system comprising:

the device is used for acquiring the information of the image to be detected in a set scene;

means for inputting the image information to a trained multi-depth feature fusion neural network model;

the device is used for extracting the features of the multi-depth feature fusion neural network model by convolution in the encoding stage, restoring the information of the input image by combining an up-sampling method of convolution and bilinear interpolation in the decoding stage and outputting a feature map with significance information;

the device is used for learning the feature maps of different levels by adopting a multi-level network and fusing the feature maps of different levels;

and a means for outputting a final saliency target detection result.

In other embodiments, the following technical solutions are adopted:

a terminal device comprising a processor and a computer-readable storage medium, the processor being configured to implement instructions; the computer readable storage medium is used for storing a plurality of instructions, and the instructions are suitable for being loaded by a processor and executing the image saliency target detection method of the multi-depth feature fusion.

In other embodiments, the following technical solutions are adopted:

a computer-readable storage medium, wherein a plurality of instructions are stored, and the instructions are suitable for being loaded by a processor of a terminal device and executing the multi-depth feature fusion image saliency target detection method.

Compared with the prior art, the invention has the beneficial effects that:

according to the method, the multi-depth feature fusion neural network is utilized to perform significance target detection on the image in the scene, so that the detection precision is guaranteed, and the speed of the subsequent processing process is increased;

the encoder-decoder structure in the multi-depth feature fusion neural network can meet the problem of significance detection precision, and the multi-level, multi-task and multi-channel feature map fusion fully utilizes shallow information and deep information;

the method adds contour feature detection, can optimize detail information of the edge of the salient object, obtains a detection result with higher accuracy and more definite contour, and is undoubtedly helpful for other processing tasks such as subsequent scene segmentation and the like.

The saliency target detection algorithm provided by the invention can effectively provide help for an intelligent unmanned system, such as an unmanned system, and the like, meets the requirements on accuracy and real-time performance, and can solve the problems of too small target to be detected, complex background, unclear target outline, large occupied memory for calculation and long training time.

Drawings

FIG. 1 is a flow chart of a salient object detection method in an embodiment of the present invention;

FIG. 2 is a schematic diagram of an image preprocessing method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a multi-depth feature fusion neural network framework in an embodiment of the present invention;

fig. 4 is a schematic diagram of a network important component re-weighting module in an embodiment of the present invention.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

Example one

The method for detecting the salient object can be applied to the fields of medical image segmentation, intelligent photography, image retrieval, virtual backgrounds, intelligent unmanned systems and the like.

In this embodiment, the multi-depth feature fusion neural network refers to a significance detection neural network that fuses multi-level, multi-task, and multi-channel depth features.

In this embodiment, the method of the present invention is described in detail by taking an unmanned driving scene as an example:

a method for detecting a salient object of an image by multi-depth feature fusion refers to FIG. 1, and includes:

acquiring to-be-detected image information under a set scene;

the multi-depth feature fusion neural network model adopts convolution to extract features in an encoding stage, restores information of an input image in a decoding stage by combining an up-sampling method of convolution and bilinear interpolation, and outputs a feature map with significance information; learning feature maps of different levels by adopting a multi-level network, and fusing the feature maps of the different levels;

and outputting a final significant target detection result.

Adding a contour detection branch to assist a saliency detection task, and refining the boundary details of the target to be detected by using contour features;

introducing a self-adaptive channel re-weighting branch circuit, and re-calibrating the convolutional layer characteristic channel weight;

the neural network is optimized by a cross entropy loss function.

Specifically, S1: and collecting the image of the driving on the spot, carrying out binaryzation significance labeling on the image, determining a label, and further forming a training set and a test set.

The specific process of step S1 is:

s1.1: the images can be obtained by shooting and separating videos, extracting every 10 frames of the videos to obtain the images, and inputting the images into the neural network.

S1.2: and labeling each pixel point, wherein one category corresponds to one number, and the number of the categories is 2 so as to distinguish the foreground from the background to obtain a gray level image which is used as a true value of an output image.

S2: referring to fig. 2, on the basis of the training set, the input and labeled images are randomly scaled, cropped, boundary filled and flipped, so that the training set is expanded, and the precision is improved more with the expansion of the training set.

The method has the advantages that the number of the pixel points in each image is large, each pixel point is labeled, time and labor are wasted, omission or wrong labeling is caused, but a large number of images are helpful for improving the precision, so that the method can be used for preprocessing the images, and a good effect can be achieved by using fewer images.

The specific process of step S2 is:

s2.1: in each training, the input and truth labeled images are randomly reduced or enlarged.

S2.2: if the image is larger than the original image, cutting is started from random points, if the image is smaller than the original image, the boundary is filled, and finally the random horizontal or vertical turning is carried out.

S2.3: the images of each training are different, and the training set is expanded.

S3: and establishing a background model by calculating the mean value and the variance of each pixel point in the image, normalizing the pixel points and extracting the characteristics.

The specific process of step S3 is:

s3.1: and calculating the average value and variance of all image pixel points to obtain a background model.

S3.2: the average value is subtracted from the image and divided by the square difference to obtain data meeting normal distribution, the average brightness value of the image is removed, and the calculation accuracy of the network can be improved through data normalization.

S4: inputting the training set of the preprocessed scene images into a convolution network shown in fig. 3 for training, learning image features in different aspects by using a multi-level, multi-task and multi-channel structure in the training process, and fusing a plurality of feature maps to improve the precision while maintaining the speed.

The multi-task is a detection mode which takes a saliency target detection task as a main mode and is assisted by a saliency contour detection task; the multi-level combines the feature maps formed by different convolution layers in the network structure to achieve the purpose of combining the multi-scale features; the multi-channel re-weights the channels according to the contribution degree of the characteristic channel to the significant stimulation, so that the characteristic channel with large contribution is endowed with larger weight in the characteristic calculation. The channel refers to the number of channels in an image, for example, three color channels of red, green and blue exist in an RGB image, but each channel has different contribution to the significance stimulus. The invention introduces a self-adaptive channel re-weighting branch circuit to re-calibrate the convolutional layer characteristic channel weight; namely, the weight is endowed to the contribution of the characteristic channel again, the weight is optimized continuously, and the accuracy of target detection is improved.

The specific process of step S4 is:

s4.1: the encoding section samples the image to 1/2 of 2048 × 1024 of the original image by the convolutional layer having a step size of 2 and a convolution kernel of 3 × 3, thereby reducing the burden of calculation. The two step sizes are 1 and the convolution filter with a kernel of 3 x 3 does not change the image size but can capture shallow features. The size of the feature map obtained after these 3 convolution operations is 1024 × 512 × 32 pixels.

S4.2: and performing four times of convolution on the input image by using the re-weighted convolution structure in fig. 4, wherein w, h and c in fig. 4 respectively represent the width, height and channel number of the feature map, and four groups of significant feature maps with different scales are obtained.

S4.3: and fusing the high-order characteristic graphs according to the re-weighting fusion unit in fig. 4, wherein the up-sampling operation in 4.4 is performed in the fusion process.

S4.4: the image is up-sampled by a bilinear interpolation method to double. The resulting feature size was 512 × 256 × the number of classes. Knowing the pixels of the four pixel points (i, j), (i, j +1), (i +1, j), (i +1, j +1), the pixel of the point (i + u, j + v) is obtained by a bilinear difference method as follows:

f(i+u,j+v)＝(1-u)*(1-v)*f(i,j)+(1-u)*v*f(i,j+1)+u*(1-v)*f(i+1,j)+u*v*f(i+1,j+1)

s4.5: and fusing the up-sampled feature map with the shallow feature extracted from the encoder to form a multi-level feature map. The number of channels after fusion is increased, so that the convolution with the step length of 1 and the convolution kernel of 1 multiplied by 1 is used again, and the number of channels is maintained as the number of categories. The resulting feature size is still 512 x 256 x the number of classes.

S4.6: finally, sampling corresponding multiples on a plurality of groups of images with different scales through bilinear interpolation to enable the size of the obtained prediction output image to be the same as that of the original image; the feature size is 2048 × 1024 × the number of categories.

S4.7: each convolution network in the multi-depth feature fusion neural network optimizes the network through cross entropy loss, and the formula of a cross entropy function is as follows:

loss(x,class)＝weight[class]*(-x[class]+log(∑_jexp(x[j])))

wherein, x represents the prediction output of a certain pixel, class represents the real category of the pixel, namely foreground or background, weight [ class ] represents the weighting coefficient for weighting each category, x [ class ] represents the prediction output of the pixel with the real label of class, and x [ j ] represents the prediction output of the pixel with the real label of j.

In addition, a second significant target contour auxiliary detection branch is introduced, the re-weighting module of the branch is the same as that in fig. 4, and the branch only focuses on low-level features to form two groups of multi-scale feature maps;

and extracting and fusing the features by the same method as the above method to obtain the fused contour feature information.

And fusing the fused target contour characteristic information and the target significance characteristic information to obtain a final fusion prediction image.

And finally, performing data enhancement processing on each test set, which is the same as the training set but is not subjected to random scaling, clipping, boundary filling and overturning, and calculating the detection precision by using quantitative indexes such as a P-R curve, an ROC curve, an MAE and the like.

The embodiment solves the problem of significant target detection in the intelligent unmanned system by using the multi-depth feature fusion neural network. Extracting images from practical application scenes (such as road scenes), randomly zooming, cutting, filling boundaries and turning the images, and expanding a training set; normalizing pixel points in the image to enable pixel values to be between 0 and 1, and eliminating the influence of other transformation functions on image transformation; and in the decoding stage, restoring the information of the input image by combining convolution and bilinear interpolation to obtain an image with output significant characteristic information. The encoder-decoder structure can meet the detection precision problem, multi-level, multi-task and multi-channel combined feature map fusion fully excavates multi-aspect depth information, further improves accuracy, and the 1 × 1 and 3 × 3 small convolution kernels used in feature extraction improve network operation speed.

The saliency target detection algorithm provided by the embodiment can effectively provide help for unmanned driving, meets the requirements of accuracy and real-time performance, and can solve the problems of too small target to be detected, complex background, unclear target outline, large occupied memory for calculation and long training time.

Example two

In one or more embodiments, disclosed is a multi-depth feature fused image saliency target detection system, comprising:

and a means for outputting a final saliency target detection result.

It should be noted that the specific working manner of the apparatus is implemented by using the method disclosed in the first embodiment, and is not described again.

EXAMPLE III

In one or more embodiments, a terminal device is disclosed, which includes a server including a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor executes the computer program to implement the image saliency target detection method of multi-depth feature fusion in the first embodiment. For brevity, no further description is provided herein.

It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate arrays FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may include both read-only memory and random access memory, and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.

In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software.

The method for detecting the image saliency target based on multi-depth feature fusion in the first embodiment may be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here.

Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.

Claims

1. A method for detecting a multi-depth feature fused image saliency target is characterized by comprising the following steps:

acquiring to-be-detected image information under a set scene;

and outputting a final significant target detection result.

2. The method for detecting the image saliency target of claim 1 characterized by adding a contour detection branch, extracting the contour feature information of a saliency target through a multi-depth feature fusion neural network model, and refining the boundary details of the target to be detected by using the contour features; and then, fusing the salient feature information of the image to be detected and the salient target contour feature information.

3. The method for detecting the image saliency target of claim 1 characterized in that the training process for the neural network model of multi-depth feature fusion comprises:

acquiring different image information under a set scene, performing binarization significance labeling on pixel points in each image, determining labels, and further forming a training set and a test set;

randomly zooming, cutting, filling a boundary and turning the training set image to expand the data set;

establishing a background model by calculating the mean value and variance of each pixel point in the image, normalizing the pixel points and extracting the characteristics;

inputting the extracted features into a multi-depth feature fusion neural network model for training;

and verifying the trained neural network model by using the test set image.

4. The method for detecting the image saliency target by fusion of the multiple depth features as claimed in claim 3, wherein a background model is established by calculating the mean and variance of each pixel point in the image, the pixel points are normalized, and the feature extraction is performed, the specific process includes:

calculating the average value and variance of all image pixel points to obtain a background model;

and subtracting the average value from the pixel value of each pixel point of the image and dividing the average value by the square difference to obtain data meeting normal distribution.

5. The method for detecting the image saliency target by multi-depth feature fusion according to claim 1, wherein the multi-depth feature fusion neural network model performs feature extraction by convolution in a coding stage, specifically:

1/2 sampling the image through a convolutional layer to the original image;

shallow detail features of the image are captured by the contour legs of the two convolutional layers.

6. The method for detecting the saliency target of the image fused with the multi-depth features as claimed in claim 1, wherein an up-sampling method combining convolution and bilinear interpolation is used in a decoding stage to restore the information of the input image and output a feature map with saliency information, and specifically:

carrying out four times of convolution on the input image by utilizing a re-weighted convolution structure to obtain four groups of significant feature maps with different scales;

the image is up-sampled by two times by a bilinear interpolation method;

fusing the up-sampled characteristic diagram with the corresponding scale characteristic diagram obtained in the encoding stage to form a multi-level characteristic diagram;

performing convolution operation on the multi-level feature graph, and maintaining the number of channels as the number of categories;

and (4) sampling a plurality of groups of images with different scales by corresponding multiples through bilinear interpolation, so that the obtained prediction output image has the same size as the original image.

7. The method for detecting the image saliency target by fusion of multiple depth features as claimed in claim 6, wherein said re-weighted convolution structure is specifically: a branch of channel weight storage is introduced into a basic convolution unit, and the weight of each channel on significance contribution is adjusted through training.

8. A multi-depth feature fused image saliency target detection system, comprising:

and a means for outputting a final saliency target detection result.

9. A terminal device comprising a processor and a computer-readable storage medium, the processor being configured to implement instructions; a computer readable storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the method for multi-depth feature fused image saliency target detection of any one of claims 1-7.

10. A computer-readable storage medium having stored therein a plurality of instructions, wherein the instructions are adapted to be loaded by a processor of a terminal device and to perform the multi-depth feature fused image saliency target detection method of any one of claims 1-7.