CN112528899B

CN112528899B - Image salient object detection method and system based on implicit depth information recovery

Info

Publication number: CN112528899B
Application number: CN202011500709.XA
Authority: CN
Inventors: 程明明; 吴宇寰; 刘云
Original assignee: Nankai University
Current assignee: Nankai University
Priority date: 2020-12-17
Filing date: 2020-12-17
Publication date: 2022-04-12
Anticipated expiration: 2040-12-17
Also published as: CN112528899A

Abstract

The application discloses an image salient object detection method and system based on implicit depth information recovery, and the method comprises the steps of obtaining a target image and image depth information corresponding to the target image; simultaneously inputting the target image and the image depth information corresponding to the target image into a trained neural network model based on implicit depth information recovery; respectively realizing feature extraction on a target image and feature extraction on image depth information based on a neural network model for recovering implicit depth information; performing cross-modal feature fusion on features obtained by extracting the two features respectively; performing feature fusion on the features obtained by the cross-modal feature fusion and the features obtained by performing feature extraction on the target image based on the neural network model recovered by the implicit depth information to obtain final fusion features; and predicting the final fusion characteristics to obtain a predicted salient object image.

Description

Image salient object detection method and system based on implicit depth information recovery

Technical Field

The application relates to the technical field of computer vision, in particular to an image saliency object detection method and system based on implicit depth information recovery.

Background

The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.

SOD (Salient Object Detection) is a basic task in the field of computer vision, and its goal is to identify and segment the most eye-catching Object in a given natural image. This basic task is also often used by many other computer-vision tasks as a preprocessing step, such as object tracking, image editing, weakly supervised learning, etc. Although salient object detection based on deep learning is rapidly developed and has achieved great success, current salient object detection methods usually only take natural scenes as input, so these methods often fail due to poorly resolved foreground and background texture features.

Based on the above recognition, people have begun to investigate the help of depth information on salient object detection, because depth information can provide coarse spatial information, although it is sometimes not very reliable because of the limitations of depth sensors. They have been largely successful in studying the help of depth information to Salient Object Detection, and have developed many depth image Salient Object Detection (RGB-D SOD) methods with high accuracy. However, they often use cumbersome network structures as the basis, such as VGG, ResNets, DenseNets. These network structures tend to have large model sizes and require a large amount of computational power as support. This limitation makes these newly developed depth image salient object detection methods difficult to be applied in real scenes, especially on mobile devices with depth sensors, low computational power, and low energy consumption acceptance. Based on the above observations, it is not practical to want to use on mobile devices salient object detection methods of depth images previously developed by others.

Disclosure of Invention

In order to solve the problem that basic convolutional neural network structures used by the existing depth image salient object detection method are often heavy network structures, such as the defects of VGG, ResNet and DenseNets, the application provides an image salient object detection method and system based on implicit depth information recovery; the method is designed based on mobile convolution neural networks such as MobileNet and ShuffleNet for the first time, and can be applied to mobile equipment due to low calculated amount and high speed. Because the expression capability of the features extracted by the mobile convolutional neural network is poor, an implicit depth information recovery module is designed based on the implicit depth information recovery technology newly designed in the application, and the feature expression capability of the bottom-up network based on the mobile convolutional neural network is greatly enhanced.

In a first aspect, the application provides an image salient object detection method based on implicit depth information recovery;

the image salient object detection method based on implicit depth information recovery comprises the following steps:

acquiring a target image and image depth information corresponding to the target image;

simultaneously inputting the target image and the image depth information corresponding to the target image into a trained neural network model based on implicit depth information recovery; respectively realizing feature extraction on a target image and feature extraction on image depth information based on a neural network model for recovering implicit depth information; performing cross-modal feature fusion on features obtained by extracting the two features respectively; performing feature fusion on the features obtained by the cross-modal feature fusion and the features obtained by performing feature extraction on the target image based on the neural network model recovered by the implicit depth information to obtain final fusion features; and predicting the final fusion characteristics to obtain a predicted salient object image.

In a second aspect, the present application provides an image salient object detection system based on implicit depth information recovery;

an image salient object detection system based on implicit depth information recovery, comprising:

an acquisition module configured to: acquiring a target image and image depth information corresponding to the target image;

a detection module configured to: simultaneously inputting the target image and the image depth information corresponding to the target image into a trained neural network model based on implicit depth information recovery; respectively realizing feature extraction on a target image and feature extraction on image depth information based on a neural network model for recovering implicit depth information; performing cross-modal feature fusion on features obtained by extracting the two features respectively; performing feature fusion on the features obtained by the cross-modal feature fusion and the features obtained by performing feature extraction on the target image based on the neural network model recovered by the implicit depth information to obtain final fusion features; and predicting the final fusion characteristics to obtain a predicted salient object image.

In a third aspect, the present application further provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs are stored in the memory, and when the electronic device is running, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first aspect.

In a fourth aspect, the present application also provides a computer-readable storage medium for storing computer instructions which, when executed by a processor, perform the method of the first aspect.

In a fifth aspect, the present application also provides a computer program (product) comprising a computer program for implementing the method of any of the preceding first aspects when run on one or more processors.

Compared with the prior art, the beneficial effects of this application are:

based on a novel implicit depth information recovery technology, a novel convolutional neural network is designed for rapidly detecting a significant object. The network uses a depth image, namely a natural image and corresponding depth information as input and comprises two bottom-up networks, a cross-modal feature fusion module, an implicit depth information recovery module and a top-down network. And the natural images and the corresponding depth information maps are respectively selected by the input of the two bottom-up networks. And the input selection of the cross-modal feature fusion module performs feature fusion on the last feature output by the two bottom-up networks to obtain the feature after the cross-modal feature fusion.

The implicit depth information recovery module selects all the features except the last feature output by a bottom-up network and features output by a cross-modal feature fusion module by taking image information as input, and can greatly improve the precision of a salient object output by a method in an inference mode on the premise of not increasing the inference calculation amount of the method.

The method uses a mobile network for the first time to solve the salient object detection object task of the depth image. In addition, for the problem of insufficient feature expression capability of mobile network extraction, the method is also based on the implicit depth information recovery technology newly designed in the application, an implicit depth information recovery module is designed, the feature expression capability of a bottom-up network based on a mobile convolutional neural network is greatly enhanced, and therefore the detection accuracy of the salient objects of the method is greatly enhanced.

The method solves the salient object detection task of the depth image by using the mobile network for the first time, overcomes the problem of overlarge calculated amount of other methods, and can be applied to mobile equipment in an actual scene. Meanwhile, the implicit depth information recovery module designed by the application can enhance the expression capability of the features extracted by the mobile convolutional neural network under the condition of not increasing the inference calculation amount of the network model, so that the significant object detection effect of the network model is greatly improved.

Advantages of additional aspects of the application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.

FIG. 1 illustrates the steps of the present application;

FIG. 2 is a block diagram of an implicit depth information recovery module used in the present application;

FIG. 3 is a block diagram of a cross-modal feature fusion module of the present application;

fig. 4(a) -4 (k) are graphs comparing the present application with other methods.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and it should be understood that the terms "comprises" and "comprising", and any variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The embodiments and features of the embodiments in the present application may be combined with each other without conflict.

Example one

The embodiment provides an image salient object detection method based on implicit depth information recovery;

s101: acquiring a target image and image depth information corresponding to the target image;

s102: simultaneously inputting the target image and the image depth information corresponding to the target image into a trained neural network model based on implicit depth information recovery;

respectively realizing feature extraction on a target image and feature extraction on image depth information based on a neural network model for recovering implicit depth information; performing cross-modal feature fusion on features obtained by extracting the two features respectively;

performing feature fusion on the features obtained by the cross-modal feature fusion and the features obtained by performing feature extraction on the target image based on the neural network model recovered by the implicit depth information to obtain final fusion features;

and predicting the final fusion characteristics to obtain a predicted salient object image.

As one or more embodiments, a neural network model based on implicit depth information recovery, the network structure comprising:

a first bottom-up network and a second bottom-up network connected in parallel;

the output end of the first bottom-up network and the output end of the second bottom-up network are both connected with the input end of the cross-modal characteristic fusion module;

the output end of the cross-modal characteristic fusion module is respectively connected with the input end of the implicit depth recovery module and the input end of the top-down network;

the output end of each basic module of the first bottom-up network is connected with the implicit depth recovery module; and the number of the first and second electrodes,

the output end of each basic module of the first bottom-up network is connected with the corresponding feature fusion unit in the top-down network.

Further, the first bottom-up network and the second bottom-up network are both implemented by using MobileNet V1.

Wherein the first bottom-up network and the second bottom-up network are both MobileNetV1 networks with the last 3 layers removed, and removing the last 3 layers includes: a pooling layer, a full link layer, and a SoftMax layer.

The MobileNetV1 network processed as described above has 5 convolutional layers with convolution step size 2. Therefore, the two bottom-up networks are both marked by the convolutional layer with the convolution step length of 2, and are divided into 5 stages according to the connection sequence, and the initial layer of each stage is the convolutional layer with the convolution step length of 2. The operation of each stage can be considered as one basic module, so the processed MobileNetV1 network has 5 basic modules, which are the first to fifth basic modules in turn.

The output ends of the first four stages of the first bottom-up network and the output end of the cross-modal characteristic fusion module are connected with the input end of the implicit depth recovery module;

the internal structure of the first bottom-up network and the second bottom-up network is the same.

The first bottom-up network comprising: the first base module, the second base module, the third base module, the fourth base module and the fifth base module are connected in sequence; wherein, first basic module includes: a first convolution layer, a first depth separable convolution layer and a second convolution layer connected in sequence;

a second base module comprising: a second, third and fourth depth-separable convolutional layer connected in sequence;

a third base module comprising: a fourth, fifth and sixth depth-separable convolutional layer connected in sequence;

a fourth base module comprising: a sixth, seventh and eighth depth-separable convolutional layer connected in sequence;

a fifth base module comprising: an eighth depth-separable convolutional layer, a ninth depth-separable convolutional layer, and a tenth convolutional layer connected in this order.

A fifth basic block, having only one convolution with step size 2; the step size of the third convolution of the fifth basic block is actually 1.

Further, the first bottom-up network is used for realizing feature extraction of the target image.

And further, the second bottom-up network is used for realizing the extraction of the features in the image depth information.

Further, the cross-modal feature fusion module includes:

the first multiplier, the depth separable convolution layer DwConv, the global average pooling layer GAP, the first full-connection layer, the second full-connection layer and the second multiplier are sequentially connected in series;

the input end of the first multiplier is respectively connected with the output end of the first bottom-up network and the output end of the first bottom-up network;

wherein the output terminal of the depth separable convolutional layer DwConv is connected with the input terminal of the second multiplier;

and the output end of the second multiplier is used as the output end of the cross-modal characteristic fusion module.

Further, the cross-modal feature fusion module is configured to implement fusion of features extracted by the first bottom-up network and features extracted by the second bottom-up network, and input the fused features to an input end of the top-down network.

Further, the cross-modal feature fusion module is further configured to implement fusion of features extracted by the first bottom-up network and features extracted by the second bottom-up network, input the fused features into the implicit depth recovery module, and assist the implicit depth recovery module in generating the predicted image depth information.

Further, the top-down network comprises:

the first feature fusion unit, the second feature fusion unit, the third feature fusion unit, the fourth feature fusion unit and the fifth feature fusion unit are connected in sequence;

the input end of the first feature fusion unit is connected with the output end of the cross-modal feature fusion module;

the input end of the second feature fusion unit is also connected with the output end of a fourth basic module of the first bottom-up network;

the input end of the third feature fusion unit is also connected with the output end of a third basic module of the first bottom-up network;

the input end of the fourth feature fusion unit is also connected with the output end of the second basic module of the first bottom-up network;

the input end of the fifth feature fusion unit is also connected with the output end of the first basic module of the first bottom-up network;

the output end of the fifth feature fusion unit is connected with the input end of the convolution layer, the output end of the convolution layer is connected with the input end of the sigmoid function layer, and the output end of the sigmoid function layer is used for outputting the predicted significance map.

Further, the top-down network is used for realizing the fusion of the features fused by the cross-modal feature fusion module and the extracted features of each basic module of the first bottom-up network.

Further, the implicit depth recovery module includes:

a convolutional layer C1, a convolutional layer C2 and a convolutional layer C3 arranged in parallel;

the input values of convolutional layer C1 are the output value of a first basic module of the first bottom-up network and the output value of a second basic module of the first bottom-up network;

the input value of convolutional layer C2 is the output value of the third basic module of the first bottom-up network;

the input value of the convolutional layer C3 is the output value of the fourth basic module of the first bottom-up network and the output value of the cross-modal feature fusion module;

stacking the output characteristics of the convolutional layer C1, the convolutional layer C2 and the convolutional layer C3 according to the channels to obtain stacking characteristics;

inputting the stacking feature into the input end of the convolutional layer C4;

the output terminal of convolutional layer C4 is connected to the first depth-separable convolutional layer DwConv,

the second depth-separable convolutional layer DwConv is connected with the third depth-separable convolutional layer DwConv;

the third depth-separable convolutional layer DwConv is connected with the fourth depth-separable convolutional layer DwConv;

the fourth depth-separable convolutional layer DwConv is connected to an input terminal of convolutional layer C5;

the output of convolutional layer C5 outputs the recovered depth map.

Further, the implicit depth recovery module is used for assisting in achieving the measurement of whether the training meets the target requirement.

As one or more embodiments, the trained neural network model based on implicit depth information recovery includes:

constructing a neural network model based on implicit depth information recovery;

constructing a training set, wherein the training set is an original image of a known salient object and depth information of the original image;

inputting the training set into a neural network model based on implicit depth information recovery, and training the neural network model based on implicit depth information recovery;

the implicit depth recovery module outputs a recovered depth map;

calculating a loss function value based on the restored depth map and the depth information of the original image;

and when the loss function value reaches the minimum value, stopping training to obtain the trained neural network model based on implicit depth information recovery.

The image depth information is a single-channel image including information on the distance from the viewpoint to the surface of the scene object.

Two bottom-up networks using image and depth information as inputs both use networks for the mobile side, such as MobileNet, ShuffleNet, etc., instead of the bulkier networks such as VGG, ResNets, DenseNets, etc.

The implicit depth recovery module inputs all the features except the last feature level feature output from the bottom to the top network by taking the image information as input and the features output by the cross-modal feature fusion module; the monitoring information output by the implicit depth recovery module is the depth information input by the application. The module is removed after the network training is finished, so that any additional features generated by the implicit depth information recovery module cannot be used in the actual use stage of the network.

The input has only the top-most features of the outputs of the two bottom-up networks, but not other features.

The method for detecting the fast significant object of the depth image based on the recovery of the implicit depth information comprises the following steps:

a. designing a new convolutional neural network model, which comprises the following modules: the system comprises two parallel bottom-up networks, a cross-modal feature fusion module, an implicit depth information recovery module and a top-down network.

b. Inputting a depth image, namely an image and depth information, and respectively extracting features by using two bottom-up networks and a moving convolutional neural network. The bottom-up network taking the image information as input sequentially extracts bottom-up features c1, c2, c3, c4 and c5, and the bottom-up network taking the depth information as input extracts feature d 5.

Two parallel bottom-up networks, a variety of mobile convolutional neural networks can be selected, such as the well-known MobileNetV1, MobileNetV2, MobileNetV3, ShuffleNetV1, ShuffleNetV2, MNASNet, and the like. The present application mobilonetv 1 is an example. The last three layers of MobileNetV1, namely the pooling layer, the full-link layer, and the softmax classification layer, are removed as two parallel bottom-up networks in the present application, i.e., the two parallel bottom-up networks are two parallel MobileNetV1 networks processed as described above. The MobileNetV1 network processed as described above has 5 convolutional layers with convolution step size 2. Therefore, the two bottom-up networks are both marked by the convolutional layer with the convolution step length of 2, and are divided into 5 stages according to the connection sequence, and the initial layer of each stage is the convolutional layer with the convolution step length of 2. The c1, c2, c3, c4 and c5 features are extracted from 5 stages in sequence by using the image information Im as an input from the bottom to the top network. The bottom-up network with the depth information Dg as input extracts the d5 feature only from the last stage.

c. And c5 and d5 features in the step a are input into a cross-modal feature fusion module to obtain fused features c5 d.

The c5 feature and the d5 feature are input into a cross-modal feature fusion module. The module may select the configuration of fig. 2. The c5 and d5 features are multiplied by element and subjected to a depth separable convolution with a convolution kernel size of 3x3, step size of 1, and pad size of 1, which can be expressed as follows:

wherein, C₅And D₅Each representing the c5 and d5 features of the input, DwConv is a depth separable convolution,

the input two feature matrixes are multiplied by elements (element-wise multiplication), and X represents the number 1 intermediate feature obtained by the operation. In the present application, X is taken as global average pooling (global average pooling), and then two full-link layers (the number of output neurons is equal to that of input neurons) are used for further global feature extraction, so as to obtain the attention feature vector of the X feature, where the process can be represented by the following formula:

Y＝fc2(fc1(GAP(X)))，

wherein fc1 and fc2 represent two fully connected layers used, the number of output neurons of the two fully connected layers is equal to the number of input neurons of the two fully connected layers, GAP represents global average pooling, the last two dimensions of the input features of the GAP are subjected to global averaging to obtain feature vectors after spatial averaging, and Y represents an output attention feature vector. After obtaining the intermediate feature X # 1 and the attention feature vector Y, the present application first expands the vector Y to the same size as the intermediate feature X # 1, and then combines them using the following operations:

where σ represents a sigmoid operation that normalizes the input elements to between (0,1), and K represents the output feature filtered by the attention feature vector. In the application, K is the output of the cross-modal feature fusion module, and K is the c5d feature.

d. C1, c2, c3, c4 and c5d obtained in the steps a and b are input into an implicit depth information recovery module together to obtain a recovered depth information map, and the recovered depth information loss is calculated by using the input depth information.

Inputting the characteristics of c1, c2, c3, c4 and c5d into an implicit depth information recovery module, outputting a recovered depth map by the implicit depth information recovery module, and calculating the loss of the recovered depth map by using the input depth information map of the network. The implicit depth information recovery module may use the structure in fig. 3. In the implicit depth information recovery module, all input features are firstly up-sampled or down-sampled to the same dimension, then the convolution operation with the convolution kernel size of 1x1 and the output channel number of 256 is carried out, and then all the features are up-sampled or down-sampled to the same size of c3, so that all the features can be stacked into a new feature according to the channels. Subsequently, the present application uses a convolution operation with a convolution kernel size of 1 × 1 and an output channel number of 256 to reduce the number of channels of new features, which can save a large amount of unnecessary computation, thereby training the network more quickly. The new feature channel is then reduced to a single channel by convolving 4 convolution kernels of size 3x3 and the number of output channels equal to the original channel, and finally superimposing a convolution with the number of output channels 1 and a convolution kernel size 1x1, which can be conveniently used to recover the depth map, since the depth map is also single-channel. By sigmoid operating on the output features so as to normalize their elements to between (0,1), a recovered depth map Dr can be obtained, and the recovered depth map loss is calculated according to the network input depth information Dg:

wherein the content of the first and second substances,

to recover the depth map loss, SSIM represents a structural similarity index, which can be calculated by:

wherein mu_xAnd mu_yRepresents the mean of x and y,

and

represents the variance, σ, of x and y_xyRepresenting the covariance of x and y, and constant terms are used to stabilize the results.

e. And b, fusing the c1, c2, c3, c4 and c5d obtained in the steps a and b in pairs to construct a top-down network, predicting a saliency object by using the fused last feature, comparing the predicted saliency object with a saliency map labeled by human, calculating the loss of the predicted saliency map, performing gradient back propagation by combining the loss obtained in the step c, and updating the parameters of the whole network model.

C1, c2, c3, c4, c5d features are input into the top-down network. The structure of the network may be selected from the structure in fig. 1, but other similar structures may be selected. Note that the order of feature input is c5d, c4, c3, c2, c1, rather than c1, c2, c3, c4, c5d, since top-down networking requires fusing higher-level features first, then fusing lower-level features.

The top-down network has 5 phases:

the first stage inputs only one feature c5d and outputs one middle layer feature;

the second stage inputs the characteristics of the middle layer output by the first stage, which are superposed with the characteristics of c4 according to channels after being subjected to upsampling;

the third stage inputs the characteristics of the middle layer output by the second stage, which are superposed with the characteristics of c3 according to channels after being up-sampled;

the fourth stage inputs the characteristics of the middle layer output by the third stage, which are superposed with the characteristics of c2 according to channels after being subjected to up-sampling;

fifth stage input the output of the fourth stage intermediate layer features are upsampled and c2 channel-superimposed features and output a final feature.

Each stage has a feature fusion module, except for the input and output parts, which can select a convolution kernel with a size of 3 × 3, a step size of 1, a pad of 1, and the number of output channels equal to the number of input channels, or other similar modules.

The final characteristics are input into a prediction convolution with a convolution kernel size of 1x1, a significance map of network prediction can be obtained through sigmoid operation, the significance map is compared with a significance map labeled by a human, and a commonly used loss function such as a loss function of binary cross entropy and the like is used for calculating significance loss.

f. Inputting different depth images, and repeating the steps (b) to (e), thereby training the network model iteratively and updating the parameters. And after the network model is trained, removing the implicit depth information recovery module.

It should be noted that the above steps are only preferred embodiments of the present application, and it should be noted that, for those skilled in the art, many modifications and substitutions can be made without departing from the technical principle of the present application, and these modifications and substitutions should also be regarded as the protection scope of the present application.

Fig. 4(a) -4 (k) are comparisons of the method of the present application with other well-known methods. Fig. 4(a) is an original natural image; FIG. 4(b) is a human annotation image; FIG. 4(c) is an image obtained by the present application; it can be found that the method of the application can well detect the salient object, and other famous methods may not detect the whole salient object due to unreliable depth information or complex foreground and background texture characteristics, so that the accuracy superiority of the method of the application relative to other methods is reflected. In addition, the method is based on the mobile network, so that the method can overcome the problem that other methods based on the heavy network are difficult to apply to the mobile equipment.

Example two

The embodiment provides an image salient object detection system based on implicit depth information recovery;

It should be noted here that the above-mentioned acquiring module and detecting module correspond to steps S101 to S102 in the first embodiment, and the above-mentioned modules are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.

In the foregoing embodiments, the descriptions of the embodiments have different emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The proposed system can be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the above-described modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules may be combined or integrated into another system, or some features may be omitted, or not executed.

EXAMPLE III

The present embodiment also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein, a processor is connected with the memory, the one or more computer programs are stored in the memory, and when the electronic device runs, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first embodiment.

It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate arrays FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may include both read-only memory and random access memory, and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.

In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software.

The method in the first embodiment may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

Example four

The present embodiments also provide a computer-readable storage medium for storing computer instructions, which when executed by a processor, perform the method of the first embodiment.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. The image salient object detection method based on implicit depth information recovery is characterized by comprising the following steps:

simultaneously inputting the target image and the image depth information corresponding to the target image into a trained neural network model based on implicit depth information recovery; respectively realizing feature extraction on a target image and feature extraction on image depth information based on a neural network model for recovering implicit depth information; performing cross-modal feature fusion on features obtained by extracting the two features respectively; performing feature fusion on the features obtained by the cross-modal feature fusion and the features obtained by performing feature extraction on the target image based on the neural network model recovered by the implicit depth information to obtain final fusion features; predicting the final fusion characteristics to obtain a predicted salient object image;

the neural network model based on implicit depth information recovery, the network structure includes:

a first bottom-up network and a second bottom-up network connected in parallel;

the internal structures of the first bottom-up network and the second bottom-up network are the same;

the first bottom-up network comprising: the first base module, the second base module, the third base module, the fourth base module and the fifth base module are connected in sequence;

the output ends of a first basic module, a second basic module, a third basic module and a fourth basic module of the first bottom-up network are all connected with the implicit depth recovery module; and the number of the first and second electrodes,

the output ends of the first basic module, the second basic module, the third basic module and the fourth basic module of the first bottom-up network are connected with the corresponding feature fusion units in the top-down network.

2. The image salient object detection method based on implicit depth information recovery as claimed in claim 1,

a first base module comprising: a first convolution layer, a first depth separable convolution layer and a second convolution layer connected in sequence;

3. The image salient object detection method based on implicit depth information recovery as claimed in claim 1, wherein the cross-modal feature fusion module comprises:

the input end of the first multiplier is respectively connected with the output end of the first bottom-up network and the output end of the second bottom-up network;

4. The image salient object detection method based on implicit depth information recovery as claimed in claim 1, wherein the top-down network comprises:

5. The image salient object detection method based on implicit depth information recovery as claimed in claim 1, wherein the implicit depth recovery module comprises:

the output of convolutional layer C5 outputs the recovered depth map.

6. The method for detecting the image salient object based on the implicit depth information recovery as claimed in claim 1, wherein the training step of the trained neural network model based on the implicit depth information recovery comprises:

the implicit depth recovery module outputs a recovered depth map;

7. The image salient object detection system based on implicit depth information recovery is characterized by comprising the following steps:

a detection module configured to: simultaneously inputting the target image and the image depth information corresponding to the target image into a trained neural network model based on implicit depth information recovery; respectively realizing feature extraction on a target image and feature extraction on image depth information based on a neural network model for recovering implicit depth information; performing cross-modal feature fusion on features obtained by extracting the two features respectively; performing feature fusion on the features obtained by the cross-modal feature fusion and the features obtained by performing feature extraction on the target image based on the neural network model recovered by the implicit depth information to obtain final fusion features; predicting the final fusion characteristics to obtain a predicted salient object image;

a first bottom-up network and a second bottom-up network connected in parallel;

8. An electronic device, comprising: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs being stored in the memory, the processor executing the one or more computer programs stored in the memory when the electronic device is running, to cause the electronic device to perform the method of any of the preceding claims 1-6.

9. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the method of any one of claims 1 to 6.