CN111967477A

CN111967477A - RGB-D image saliency target detection method, device, equipment and storage medium

Info

Publication number: CN111967477A
Application number: CN202010637797.1A
Authority: CN
Inventors: 高伟; 廖桂标
Original assignee: Peking University Shenzhen Graduate School
Current assignee: Peking University Shenzhen Graduate School
Priority date: 2020-07-02
Filing date: 2020-07-02
Publication date: 2020-11-20

Abstract

The invention discloses a method, a device and equipment for detecting a RGB-D image saliency target and a computer readable storage medium, wherein the method for detecting the RGB-D image saliency target adopts an attention mechanism instead of directly using and fusing a mode of layering RGB modal characteristics and layering depth modal characteristics, so that the introduction of useless or redundant information in modal characteristics is avoided, and the performance of saliency target detection is improved; the multi-stage cross-modal feature fusion is carried out by designing a cross-modal guidance strategy, so that the effectiveness and complementarity among cross-modal multi-features can be fully utilized, and more accurate cross-modal significant feature expression is formed while the influence of a depth image with poor quality is reduced; by designing a bidirectional fusion structure, the multi-level cross-modal fusion features are subjected to multi-scale fusion, so that high-level and low-level features in multiple levels can be effectively aggregated, and the detection performance of a significant target and the robustness of a detection algorithm are further improved.

Description

RGB-D image saliency target detection method, device, equipment and storage medium

Technical Field

The invention relates to the technical field of image processing, in particular to a method, a device and equipment for detecting a salient object of an RGB-D image and a computer-readable storage medium.

Background

With the rapid development of computer image processing technology, there are two main ways for detecting the significant target of the RGB-D image, one of which is a way based on manual feature extraction technology and prior knowledge. The method generally comprises the steps of calculating priors in RGB images and depth images, such as background priors, center priors, depth priors and the like, comparing features of colors, brightness, textures, depths and the like of different regions, and fusing with certain post-processing technologies in a multiplication or addition mode, so as to detect a significant target; the other is a way based on deep learning techniques. Such approaches can be generally divided into early, mid and late fusions. In the early fusion, features of an RGB image and a depth image are extracted through a single shallow layer network and then fused, and the features are input into a subsequent single-stream network to learn high-level features. The mid-term fusion usually realizes cross-modal fusion by extracting hierarchical features through a double-flow network. Late fusion is to combine the deepest feature information to generate a fused saliency map. With the wide application of depth cameras and portable intelligent devices, saliency detection using RGB-D images is also more popular in scenes such as robot navigation and navigation. Because the quality of the depth map is easily affected by noise such as sensor temperature, background illumination, distance and reflectivity of an observed object, and the like, even if the conventional RGB-D image saliency target detection method is adopted, an error or missing region of the depth map may still occur. The unreliable depth information can mislead the salient object detection process of the whole RGB-D image, so that the problem that the detection capability of the existing salient object detection method of the RGB-D image has limitation is caused.

Disclosure of Invention

The invention mainly aims to provide a method for detecting a salient object of an RGB-D image, aiming at solving the technical problem that the detection capability of the existing method for detecting the salient object of the RGB-D image is limited.

In order to achieve the above object, the present invention provides a method for detecting a RGB-D image saliency target, wherein the RGB-D image includes an RGB image to be detected and a depth image registered with the RGB image, and the method for detecting a RGB-D image saliency target includes:

acquiring layered RGB modal characteristics of the RGB image and layered depth modal characteristics of the depth image;

performing multi-stage cross-modal feature fusion on the layered RGB modal features and the layered depth modal features based on an attention mechanism and a cross-modal guidance strategy to obtain multi-stage cross-modal fusion features of the layered RGB modal features and the layered depth modal features;

and carrying out two-way fusion on the multi-level cross-modal fusion features based on a two-way fusion structure to obtain layered significant fusion features with different scales, and carrying out multi-scale fusion on the layered significant fusion features to obtain significant target images corresponding to the RGB images and the depth images.

Optionally, the step of performing multi-stage cross-modal feature fusion on the layered RGB modal features and the layered depth modal features based on an attention mechanism and a cross-modal guidance policy to obtain multi-stage cross-modal fusion features of the layered RGB modal features and the layered depth modal features includes:

in the first stage, respectively using a spatial attention mechanism to perform feature screening on the layered RGB modal features and the layered depth modal features to obtain RGB significant feature responses of the layered RGB modal features and depth significant feature responses of the layered depth modal features;

based on a cross-modal guidance strategy, guiding the layered depth modal characteristics to perform characteristic re-screening by using the RGB significant characteristic response so as to obtain RGB depth significant characteristic response;

in the second stage, the RGB significant feature response, the depth significant feature response and the RGB depth significant feature response are subjected to confrontation combination to obtain a plurality of confrontation features, and the plurality of confrontation features are fused into a multi-level cross-modal fusion feature of the layered RGB modal feature and the layered depth modal feature.

Optionally, the step of guiding the layered depth modal feature to perform feature re-screening by using the RGB significant feature response based on the cross-modal guidance policy to obtain an RGB depth significant feature response includes:

performing parallel asymmetric convolution on the hierarchical depth modal features to roughly locate a preliminary significant region in the hierarchical depth modal features;

performing fusion guidance on the preliminary salient region by using the RGB feature responses based on the cross-modal guidance strategy so as to locate a target salient region in the preliminary salient region;

and acquiring a feature weight corresponding to the layered depth modal feature of the positioned target significant region, and performing weighted fusion output on the layered depth modal feature of the positioned target significant region and the layered depth modal feature according to the feature weight to serve as the RGB depth significant feature response.

Optionally, in the second stage, the confrontation combining the RGB significant feature response, the depth significant feature response, and the RGB depth significant feature response to obtain a plurality of confrontation features, and fusing the plurality of confrontation features into a multi-level cross-modal fusion feature of the layered RGB modal feature and the layered depth modal feature includes:

in the second stage, performing countermeasure combination on the RGB significant feature response, the depth significant feature response and the RGB depth significant feature response by using multiplication to obtain a plurality of countermeasure features;

and respectively distributing different weights to the plurality of antagonistic features by using parallel pooling operation, and performing convolution and addition fusion operation on the plurality of antagonistic features distributed with different weights to obtain the multilevel trans-modal fusion feature.

Optionally, the step of performing bidirectional fusion on the multi-level cross-modal fusion features based on a bidirectional fusion structure to obtain hierarchical significant fusion features of different scales, and performing multi-scale fusion on the hierarchical significant fusion features to obtain significant target images corresponding to the RGB image and the depth image includes:

performing high-low level fusion on the cross-modal fusion features of each level in the multi-level cross-modal fusion features by a top-down path and a bottom-up path to obtain the layered significant fusion features;

performing upsampling according to the size of the RGB image to obtain a plurality of target sub-images corresponding to the layered significant fusion features;

and splicing the multiple target sub-images and performing convolution channel number conversion to obtain the salient target image.

Optionally, the step of performing high-low level fusion on the cross-modal fusion features of each level in the multi-level cross-modal fusion features in a top-down and bottom-up bidirectional path to obtain the hierarchical significant fusion feature includes:

in a feature fusion stage, performing first convolution operation on cross-modal fusion features of each level in the multi-level cross-modal fusion features to obtain a multi-level first convolution result, and fusing the first convolution result of each level in the multi-level first convolution result with a first convolution result of a corresponding high level respectively in a top-down path to obtain forward fusion results corresponding to each level and different scales;

in the multi-scale fusion stage, performing second convolution operation on the cross-modal fusion features of each level to obtain a multi-level second convolution result, and fusing the second convolution result of each level in the multi-level second convolution result with a second convolution result corresponding to a low level and a forward fusion result corresponding to the same level by a path from bottom to top to obtain reverse fusion results corresponding to each level and different scales;

and passing the reverse fusion results corresponding to each level and different scales through a preset module connected with a channel attention mechanism and a space attention mechanism in parallel to obtain the layered significant fusion features.

Optionally, the step of acquiring layered RGB modal characteristics of the RGB image and layered depth modal characteristics of the depth image includes:

the method comprises the steps of obtaining an RGB image to be detected and a depth image in registration with the RGB image, inputting the RGB image and the depth image into a preset double-current deep convolutional network, and extracting layered RGB modal characteristics of the RGB image and layered depth modal characteristics of the depth image in a layered mode.

In order to achieve the above object, the present invention also provides an RGB-D image saliency target detection apparatus comprising:

the modal characteristic acquisition module is used for acquiring the layered RGB modal characteristics of the RGB image and the layered depth modal characteristics of the depth image;

a fusion feature obtaining module, configured to perform multi-stage cross-modal feature fusion on the layered RGB modal features and the layered depth modal features based on an attention mechanism and a cross-modal guidance policy, to obtain multi-stage cross-modal fusion features of the layered RGB modal features and the layered depth modal features;

and the target image acquisition module is used for carrying out two-way fusion on the multi-level cross-modal fusion features based on a two-way fusion structure to obtain different scales of layered significant fusion features, and carrying out multi-scale fusion on the layered significant fusion features to obtain significant target images corresponding to the RGB images and the depth images.

Optionally, the fused feature obtaining module includes:

a depth response obtaining unit, configured to, in a first stage, perform feature screening on the layered RGB modal features and the layered depth modal features respectively using a spatial attention mechanism, so as to obtain RGB significant feature responses of the layered RGB modal features and depth significant feature responses of the layered depth modal features;

the screening guidance unit is used for guiding the layered depth modal characteristics to perform characteristic re-screening by utilizing the RGB significant characteristic response based on a cross-modal guidance strategy so as to obtain RGB depth significant characteristic response;

and the confrontation combination unit is used for carrying out confrontation combination on the RGB significant feature response, the depth significant feature response and the RGB depth significant feature response at the second stage to obtain a plurality of confrontation features, and fusing the plurality of confrontation features into a multi-level cross-modal fusion feature of the layered RGB modal feature and the layered depth modal feature.

Optionally, the screening guidance unit is further configured to:

Optionally, the confrontation combination unit is further configured to:

Optionally, the target image acquiring module further includes:

the bidirectional fusion unit is used for performing high-low level fusion on the cross-modal fusion features of each level in the multi-level cross-modal fusion features through a top-down and bottom-up bidirectional path to obtain the hierarchical significant fusion features;

the target sampling unit is used for performing up-sampling according to the size of the RGB image to obtain a plurality of target sub-images corresponding to the layered significant fusion features;

and the splicing convolution unit is used for splicing the target sub-images and performing convolution channel number conversion to obtain the salient target image.

Optionally, the bidirectional fusion unit is further configured to:

Optionally, the modal feature acquisition module further includes:

the device comprises a layering extraction unit, a depth image processing unit and a depth image processing unit, wherein the layering extraction unit is used for acquiring an RGB image to be detected and a depth image in which the RGB image is registered, inputting the RGB image and the depth image into a preset double-current deep convolutional network, and extracting layering RGB modal characteristics of the RGB image and layering depth modal characteristics of the depth image in a layering manner.

Further, to achieve the above object, the present invention also provides an RGB-D image saliency target detection apparatus comprising: a memory, a processor and an RGB-D image saliency object detection program stored on said memory and executable on said processor, said RGB-D image saliency object detection program when executed by said processor implementing the steps of the RGB-D image saliency object detection method as described above.

In addition, to achieve the above object, the present invention also provides a computer readable storage medium having stored thereon an RGB-D image saliency object detection program that, when executed by a processor, implements the steps of the RGB-D image saliency object detection method as described above.

The invention provides a method, a device and equipment for detecting a salient object of an RGB-D image and a computer-readable storage medium. The RGB-D image saliency target detection method comprises the steps of obtaining the layered RGB modal characteristics of an RGB image and the layered depth modal characteristics of a depth image; performing multi-stage cross-modal feature fusion on the layered RGB modal features and the layered depth modal features based on an attention mechanism and a cross-modal guidance strategy to obtain multi-stage cross-modal fusion features of the layered RGB modal features and the layered depth modal features; and carrying out two-way fusion on the multi-level cross-modal fusion features based on a two-way fusion structure to obtain layered significant fusion features with different scales, and carrying out multi-scale fusion on the layered significant fusion features to obtain significant target images corresponding to the RGB images and the depth images. Through the mode, the mode of adopting an attention mechanism instead of directly using and fusing the layered RGB modal characteristics and the layered depth modal characteristics is adopted, so that the introduction of useless or redundant information in the modal characteristics is avoided, and the performance of the significant target detection is improved; the multi-stage cross-modal feature fusion is carried out by designing a cross-modal guidance strategy, so that the effectiveness and complementarity among cross-modal multi-features can be fully utilized, and accurate cross-modal significant feature expression is formed while the influence of a depth image with poor quality is reduced; by designing a bidirectional fusion structure, the multi-level cross-modal fusion features are subjected to multi-scale fusion, so that high-level and low-level features in multiple levels can be effectively aggregated, the detection performance of the saliency target is further improved, and the robustness of a detection algorithm is further improved, so that the technical problem that the detection capability of the saliency target detection method of the existing RGB-D image is limited is solved.

Drawings

FIG. 1 is a schematic diagram of an RGB-D image saliency target detection device of a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a RGB-D image saliency target detection method according to a first embodiment of the present invention;

FIG. 3 is a schematic diagram of an overall network structure in a first embodiment of the RGB-D image saliency target detection method of the present invention;

FIG. 4 is a diagram illustrating a multi-stage fusion strategy in a second embodiment of the RGB-D image saliency target detection method of the present invention;

FIG. 5 is a schematic diagram of a cross-modal guidance strategy in a second embodiment of the RGB-D image saliency target detection method of the present invention;

FIG. 6 is a schematic diagram illustrating details of a bi-directional multi-scale decoder according to a third embodiment of the RGB-D image saliency target detection method of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1, fig. 1 is a schematic structural diagram of an RGB-D image saliency target detection apparatus in a hardware operating environment according to an embodiment of the present invention.

The RGB-D image saliency target detection equipment provided by the embodiment of the invention can be a server, a PC (personal computer), and also can be terminal equipment such as a smart phone and a tablet personal computer.

As shown in fig. 1, the RGB-D image saliency target detection apparatus may include: a processor 1001, such as a CPU, a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a memory device separate from the processor 1001 described above.

Those skilled in the art will appreciate that the terminal structure shown in FIG. 1 does not constitute a limitation of the RGB-D image saliency target detection device, and may include more or less components than shown, or some components in combination, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and an RGB-D image saliency object detection program.

In the terminal shown in fig. 1, the network interface 1004 is mainly used for connecting to a backend server and performing data communication with the backend server; the user interface 1003 is mainly used for connecting a client (user side) and performing data communication with the client; and the processor 1001 may be configured to call the RGB-D image saliency object detection program stored in the memory 1005 and perform the following operations:

Further, the step of performing multi-stage cross-modal feature fusion on the layered RGB modal features and the layered depth modal features based on an attention mechanism and a cross-modal guidance policy to obtain multi-stage cross-modal fusion features of the layered RGB modal features and the layered depth modal features includes:

Further, the step of guiding the layered depth modal feature to perform feature re-screening by using the RGB significant feature response based on the cross-modal guidance policy to obtain an RGB depth significant feature response includes:

Further, in the second stage, the confrontation combining the RGB significant feature response, the depth significant feature response, and the RGB depth significant feature response to obtain a plurality of confrontation features, and fusing the plurality of confrontation features into a multi-level cross-modal fusion feature of the layered RGB modal feature and the layered depth modal feature includes:

Further, the step of performing bidirectional fusion on the multi-level cross-modal fusion features based on a bidirectional fusion structure to obtain hierarchical significant fusion features of different scales, and performing multi-scale fusion on the hierarchical significant fusion features to obtain significant target images corresponding to the RGB image and the depth image includes:

Further, the step of performing high-low level fusion on the cross-modal fusion features of each level in the multi-level cross-modal fusion features through a top-down and bottom-up bidirectional path to obtain the hierarchical significant fusion feature includes:

Further, the step of obtaining the layered RGB modal characteristics of the RGB image and the layered depth modal characteristics of the depth image includes:

Based on the hardware structure, the invention provides various embodiments of the RGB-D image saliency target detection method.

With the rapid development of computer image processing technology, there are two main ways for detecting the significant target of the RGB-D image, one of which is a way based on manual feature extraction technology and prior knowledge. The method generally comprises the steps of calculating priors in RGB images and depth images, such as background priors, center priors, depth priors and the like, comparing features of colors, brightness, textures, depths and the like of different regions, and fusing with certain post-processing technologies in a multiplication or addition mode, so as to detect a significant target; the other is a way based on deep learning techniques. Such approaches can be generally divided into early, mid and late fusions. In the early fusion, features of an RGB image and a depth image are extracted through a single shallow layer network and then fused, and the features are input into a subsequent single-stream network to learn high-level features. The mid-term fusion usually realizes cross-modal fusion by extracting hierarchical features through a double-flow network. Late fusion is to combine the deepest feature information to generate a fused saliency map.

With the wide application of depth cameras and portable intelligent devices, saliency detection using RGB-D images is also more popular in scenes such as robot navigation and navigation. Because the quality of the depth map is easily affected by noise such as sensor temperature, background illumination, distance and reflectivity of an observed object, and the like, even if the conventional RGB-D image saliency target detection method is adopted, an error or missing region of the depth map may still occur. The unreliable depth information can mislead the salient object detection process of the whole RGB-D image, so that the problem that the detection capability of the existing salient object detection method of the RGB-D image has limitation is caused.

In order to solve the problems, the invention provides a method for detecting the saliency target of an RGB-D image, which adopts an attention mechanism instead of directly using and fusing a mode of layering RGB modal characteristics and layering depth modal characteristics, avoids the introduction of useless or redundant information in modal characteristics, and improves the performance of saliency target detection; the multi-stage cross-modal feature fusion is carried out by designing a cross-modal guidance strategy, so that the effectiveness and complementarity among cross-modal multi-features can be fully utilized, and accurate cross-modal significant feature expression is formed while the influence of a depth image with poor quality is reduced; by designing a bidirectional fusion structure, the multi-level cross-modal fusion features are subjected to multi-scale fusion, so that high-level and low-level features in multiple levels can be effectively aggregated, the detection performance of the saliency target is further improved, and the robustness of a detection algorithm is further improved, so that the technical problem that the detection capability of the saliency target detection method of the existing RGB-D image is limited is solved.

Referring to fig. 2, fig. 2 is a schematic flowchart of a first embodiment of a RGB-D image saliency target detection method.

The first embodiment of the present invention provides a method for detecting a significant target of an RGB-D image, where the method for detecting a significant target of an RGB-D image is applied to an encoding end, and the method for detecting a significant target of an RGB-D image includes the following steps:

step S10, acquiring layered RGB modal characteristics of the RGB image and layered depth modal characteristics of the depth image;

in this embodiment, it should be noted that the RGB-D image saliency target detection method of the present invention is applicable to RGB-D images. An RGB-D image actually refers to two images: one is a normal RGB three-channel color image and the other is a Depth image, i.e., a Depth image. The Depth image is similar to a grayscale image except that each pixel value thereof is the actual distance of the sensor from the object. Usually, the RGB image and the Depth image are registered, so that there is a one-to-one correspondence between the pixel points. Hereinafter, Depth images are collectively referred to as Depth images.

For the RGB-D image which needs to detect the salient object currently, the layered modal characteristics of the RGB image and the depth image respectively need to be acquired. Wherein the hierarchical modal characteristics comprise a plurality of different levels of modal characteristics. Therefore, the layered RGB modal feature is a layered modal feature of an RGB image, and the layered depth modal feature is a layered modal feature of a depth image registered with the RGB image. The acquisition of the layered RGB modal features and the layered depth modal features is generally realized by using a deep convolutional network. For example, the RGB image and the depth image may be input into a dual-stream deep convolutional network Res2Net, respectively, to obtain the layered modal features corresponding to the RGB image and the depth image, respectively.

Step S20, based on an attention mechanism and a cross-modal guidance strategy, performing multi-stage cross-modal feature fusion on the layered RGB modal features and the layered depth modal features to obtain multi-stage cross-modal fusion features of the layered RGB modal features and the layered depth modal features;

in the present embodiment, the present invention designs a Multi-stage structure fusion strategy (MFM) in which an attention mechanism needs to be combined with a cross-modal guidance strategy. Firstly, an attention mechanism is adopted to highlight a remarkable object in the layered RGB modal characteristic and the layered depth modal characteristic, disordered background information is restrained, meanwhile, a cross-modal guidance strategy is adopted, and the enhanced RGB modal characteristic is utilized to guide another depth modal characteristic with poor quality, so that a useful characteristic is screened out from the depth modal characteristic, and the method is not used for simply discarding the depth image with low quality or directly using the depth image with poor quality neglected. The method comprises the steps of carrying out attention mechanism stage screening and cross-mode guidance fusion on a plurality of different levels of RGB modal characteristics and a plurality of different levels of depth modal characteristics, obtaining RGB image significant characteristic response, depth image significant characteristic response and RGB depth significant characteristic response obtained after guidance, and carrying out countermeasure combination on the RGB image significant characteristic response and the depth significant characteristic response to obtain a plurality of countermeasure characteristics. And finally, fusing the plurality of confrontation features to obtain the multi-level cross-modal fusion feature.

Step S30, carrying out two-way fusion on the multi-level cross-modal fusion features based on a two-way fusion structure to obtain layered significant fusion features with different scales, and carrying out multi-scale fusion on the layered significant fusion features to obtain significant target images corresponding to the RGB images and the depth images.

In the present embodiment, "bidirectional" in the bidirectional fusion structure means a bidirectional path from top to bottom and from bottom to top. Different levels in the deep network contain different characteristic responses to salient objects. In particular, deep features of the network provide differentiated and integrated semantic information, and lower layers of the network contain more local detail information for salient objects. In order to effectively position and detect global and local information of a salient object, the invention designs a multi-scale fusion mode based on a bidirectional structure, and aggregates multi-scale and comprehensive feature information of each level by a top-down path and a bottom-up path to obtain a layered salient fusion feature. And then sampling, splicing and other operations are carried out on the layered significant fusion characteristics, so that a high-quality significant target image corresponding to the original RGB-D image can be obtained.

As shown in fig. 3, fig. 3 is a schematic diagram of the overall network structure of the present invention. RGB refers to the RGB image to be detected, Depth refers to the Depth image registered with this RGB image. Conv1 to Conv5 each represent a convolution operation. fr₂、fr₃、fr₄And fr₅Representing the above-mentioned hierarchical RGB image features, fd₂、fd₃、fd₄And fd₅Representing the layered depth image features described above. Characterizing layered RGB image fr₂、fr₃、fr₄And fr₅With the hierarchical depth image feature fd₂、fd₃、fd₄And fd₅And inputting the data into a plurality of MFM modules according to corresponding levels to obtain the multi-level cross-modal fusion characteristics. Then, the cross-modal fusion features of each level are input into a Feature fusion module (FF) and a Multi-scale fusion Module (MF) corresponding to a Bi-directional Multi-scale decoder (BMD), and high and low level features are fused in a top-down and bottom-up path, so that the hierarchical significant fusion features can be obtained. And finally, carrying out UP (UP-sampling of bilinear interpolation) operation on the layered salient fusion features to obtain four sub-images P2, P3, P4 and P5, and finally carrying out splicing and channel number conversion on the four sub-images to obtain a final salient target image (salient image).

In the embodiment, the layered RGB modal characteristics of the RGB image and the layered depth modal characteristics of the depth image are obtained; performing multi-stage cross-modal feature fusion on the layered RGB modal features and the layered depth modal features based on an attention mechanism and a cross-modal guidance strategy to obtain multi-stage cross-modal fusion features of the layered RGB modal features and the layered depth modal features; and carrying out two-way fusion on the multi-level cross-modal fusion features based on a two-way fusion structure to obtain layered significant fusion features with different scales, and carrying out multi-scale fusion on the layered significant fusion features to obtain significant target images corresponding to the RGB images and the depth images. Through the mode, the mode of adopting an attention mechanism instead of directly using and fusing the layered RGB modal characteristics and the layered depth modal characteristics is adopted, so that the introduction of useless or redundant information in the modal characteristics is avoided, and the performance of the significant target detection is improved; cross-modal feature fusion is carried out by designing a cross-modal guidance strategy, so that effectiveness and complementarity among cross-modal multi-features can be fully utilized, and more accurate cross-modal significant feature expression is formed while the influence of a depth image with poor quality is reduced; by designing a bidirectional fusion structure, the multi-level cross-modal fusion features are subjected to multi-scale fusion, so that high-level and low-level features in multiple levels can be effectively aggregated, the detection performance of the saliency target is further improved, and the robustness of a detection algorithm is further improved, so that the technical problem that the detection capability of the saliency target detection method of the existing RGB-D image is limited is solved.

Further, not shown in the drawings, a second embodiment of the RGB-D image saliency target detection method according to the present invention is proposed based on the above-mentioned first embodiment shown in fig. 2. In the present embodiment, step S20 includes:

In this embodiment, the first stage is a stage of screening the layered RGB modal features and the layered depth modal features using a spatial attention mechanism and a cross-modal guidance strategy. The second stage is the antagonistic combination stage. Unlike methods that directly use and fuse RGB and depth cross-modal feature information, this introduces useless or redundant information, but uses an attention mechanism to generate a useful multi-feature response in a first stage, then in a second stage, generate confrontation features for further confrontation and combination of the generated information, and finally, further combine the confrontation features into an accurate multi-level cross-modal fusion feature.

As shown in fig. 4, fig. 4 is a schematic diagram of a multi-stage fusion strategy. Fd in the figure_iI.e. the above-mentioned hierarchical depth modal characteristic, fr_iThe layered RGB modal characteristics are obtained; in Stage 1 (first phase), SA represents a Spatial attention module (SA), CGA represents a Cross-modal guided attention module (CGA), Fd_iI.e. the above-mentioned deep significant characteristic response, Fr_iThat is, the RGB significant feature response, Frd_iNamely the RGB depth salient feature response; in Stage 2 (second Stage), ADD denotes the element addition operation, A_D′D、A_RD′And A_LCorresponding to the plurality of countermeasure features described above. Last cm_iNamely the multi-level cross-modal fusion feature. In the first stage, fd_iAnd fr_iRespectively inputting into SA module, highlighting significant objects in RGB and depth mode characteristics to suppress disordered background information, and obtaining Fd_iAnd Fr_i. Then f d_iAnd Fr_iAs input to the CGA module, Fr is utilized_iGuide fd_iFrom fd to_iMedium screening for effective characteristics to obtain Frd_i. In the second stage, Fd will be implemented using multiplication_i、Fd_iAnd Frd_iAgainst a combination of them to obtain A_D′D、A_RD′And A_L. Finally, parallel pooling pairs A will be used_D′D、A_RD′And A_LPerforming fusion to obtain cm_i。

Further, in this embodiment, the step of guiding the layered depth modal feature to perform feature re-screening by using the RGB significant feature response based on the cross-modal guidance policy to obtain an RGB depth significant feature response includes:

In the present embodiment, in the first stage, to effectively select important information in RGB or depth mode features, a spatial attention module is used to highlight salient objects in RGB and depth mode features and suppress cluttered background information. More importantly, the invention designs a cross-modal guidance module in consideration of the influence of poor quality of a depth map, and the core idea is that when an unreliable depth map is encountered, the reinforced RGB modal feature is utilized to guide another poor-quality depth modal feature, so that a useful feature is screened out from the depth modal feature, and the poor-quality depth map is not simply discarded or the quality problem is ignored for direct use. As shown in fig. 5, fig. 5 is a schematic diagram of a cross-modal guidance strategy. Characterizing depth mode of each level fd_iThe preliminary salient region in the image is first roughly located by parallel asymmetric convolution of 1x3 and 3x1, and then by Fr_iPerforming fusion guidance to further locate the salient regions with higher possibility, filtering out the error salient regions possibly appearing in the single depth map, finally obtaining the guided depth modal feature map weight through splicing and Sigmoid function, and combining the guided depth modal feature map weight with the original input fd_iMultiplying to obtain more accurate fusion characteristic expression Frd_i。

Further, in this embodiment, in the second stage, the step of performing countermeasure combination on the RGB significant feature response, the depth significant feature response, and the RGB depth significant feature response to obtain a plurality of countermeasure features, so as to fuse the plurality of countermeasure features into a multi-level cross-modal fusion feature of the layered RGB modal feature and the layered depth modal feature includes:

In the embodiment, the antagonism combination between cross-modal features is carried out by using multiplication, the action of the antagonism combination can enhance the inconsistency between the common significant region and the inhibition cross-modal features, and meanwhile, the parallel pooling operation is used for distributing different weights to the antagonism combination result, so that important significant features are expressed and background features which are irrelevant to inhibition are highlighted in an adaptive mode. The cross-modal fusion features generated in the above manner are not only influenced by the own modal features but also influenced by the anti-modal features.

The process of the multi-stage fusion strategy described above is expressed as follows:

wherein, W_k1And W_k2Representing a convolution operation. And x is the input of SA. And σ represents the ReLU activation function and Sigmoid function, respectively. fr_i(i-2, 3,4,5) and fd_i(i ═ 2,3,4,5) represents the paired side outputs of the RGB and Depth backbones.

And

respectively, element multiplication and element addition.

And

indicating that the input is uniformly converted to 256 channels with a 1x1 convolution. Conv, Conv1 and Conv2 represent convolution operations.

Convolution parameters for each layer.

Corresponding to the convolution operation of each layer. Cat represents the splicing operation. Avgppooling and MaxPooling represent global average pooling and global maximum pooling operations, respectively.

Further, clutter background information is suppressed by using a spatial attention module to highlight salient objects in RGB and depth modality features; by utilizing the enhanced RGB modal features to guide another depth modal feature with poor quality, useful features are screened out from the depth modal features, and a depth map with low quality is not simply discarded or quality problems of the depth map are ignored for direct use, so that a significant region in the depth modal features is more accurately positioned; the antagonism combination between cross-modal characteristics is carried out by multiplication, so that the inconsistency between a common significant region and the suppression cross-modal is enhanced; by simultaneously using parallel pooling operations to assign different weights to the results of the antagonistic combinations, it is possible to adaptively highlight important salient feature expressions and suppress irrelevant background features.

Further, not shown in the drawings, a third embodiment of the RGB-D image saliency target detection method according to the present invention is proposed based on the first embodiment shown in fig. 2. In the present embodiment, step S30 includes:

In the embodiment, the invention realizes the multi-scale fusion of multi-level cross-modal fusion features by designing a bidirectional multi-scale fusion decoder BMD. Different levels in the deep network contain different characteristic responses to salient objects. In particular, deep features of the network provide differentiated and integrated semantic information, and lower layers of the network contain more local detail information for salient objects. In order to effectively position and detect global and local information of a salient object, a multi-scale fusion method based on a bidirectional structure is designed to aggregate multi-scale and comprehensive feature information of each level by a top-down path and a bottom-up path. Firstly, cross-modal fusion feature cm output by MFM_i(i ═ 2,3,4,5) is fed to the FF module in a top-down path. Then, for each level of MF module, the fused feature F of FF output of the corresponding level_j(j ═ 2,3,4,5) and MFM output cross-modal fusion features cm_i(i-2, 3,4,5) are integrated in a bottom-up manner. By the method, the multi-scale cross-modal fusion characteristics of the high and low layers are fused and the overall structure and the corresponding detailed information of the object are more prominent.

Finally, the fused features S from the MF module_k(k 2,3,4,5) is first up-sampled to the original size P_i(i ═ 2,3,4,5), then spliced together, and the final saliency map S is obtained by a 3 × 3 convolution conversion channel number_mapThe process is represented as:

P_i＝up(S_k)；

S_map＝Conv(Cat(P₂,P₃,P₄,P₅)).

wherein Conv represents a 3 × 3 convolution, Cat represents a stitching operation, and Up represents an upsampling operation of bilinear interpolation.

Further, in this embodiment, the step of performing high-low level fusion on the cross-modal fusion features of each level in the multi-level cross-modal fusion features through a top-down and bottom-up bidirectional path to obtain the hierarchical significant fusion feature includes:

In the present embodiment, the first convolution operation is referred to as a 3 × 3 convolution operation, and the second convolution operation is referred to as a 1 × 1 convolution operation. The feature fusion stage refers to the FF module and the multi-scale fusion stage refers to the MF module. The "forward direction" in the forward direction fusion result refers to a top-down path direction, and the "reverse direction" in the reverse direction fusion result refers to a bottom-up path direction. The preset module refers to a compression and Excitation (scSE, Spatial and Channel Squeeze & Excitation Block) module.

As shown in fig. 6, fig. 6 is a detailed view of a bi-directional multi-scale decoder. In FF module in the figure, feature cm is fused across modes_iFirst convolution is carried out by 3x3 to convert into 64 channels, and the 64 channels are connected with a high-level input feature F_j+1Carrying out simple fusion, then with F_j+1Performing an addition operation to generate a cost-tier FF fused output F_j. Simultaneously in MF module, feature cm is fused across modes_iIs converted into 64 channels by convolution of 1x1 and is combined with the low-level feature S from bottom to top_k-1And the same level FF output F_jAnd performing addition fusion, and further refining the details of the salient objects through the scSE module to obtain the output of the MF module. The specific process formula is as follows:

wherein the content of the first and second substances,

and

respectively representing element multiplication and element addition, up refers to 2 times up-sampling by means of bilinear interpolation,

and

refer to the 3x3 convolution operation with the PReLu activation function and the 1x1 convolution operation with the ReLu activation function, respectively, and they will be cm_iThe number of output channels (i-2, 3,4,5) is uniformly converted from 256 to 64 channels. W_iAnd W_j+1Only the convolution operation without the use of an activation function,

and W_kRespectively, 3X3 convolution with ReLu activation and 1X1 convolution, W_k-1Then the 3x3 convolution with step size 2 downsamples the input and the scSE module used can further capture meaningful detail information.

Further, in the present embodiment, step S10 includes:

In this embodiment, the computer acquires an RGB image to be detected and a depth image corresponding to the RGB image, and inputs the RGB image and the depth image into a double-stream deep convolutional network, so as to extract layered RGB modal characteristics of the RGB image and layered depth modal characteristics of the depth image in a layered manner.

Furthermore, by designing a multi-scale fusion mode based on a bidirectional structure, multi-scale and comprehensive characteristic information of each level is aggregated by a top-down path and a bottom-up path, so that global and local information of a significant object can be more effectively positioned and detected; by adding a bottom-up path, the object detail detection capability is improved and the problem of top-down propagation and dilution of high-level semantic information is solved by fusing the object detail detection capability and the top-down corresponding level features in a multi-scale fusion module; by combining the multi-level high-level semantic information and the low-level detail information, the multi-scale detection performance of the algorithm is improved.

The invention also provides a RGB-D image saliency target detection device, which realizes the following steps:

The invention also provides RGB-D image saliency target detection equipment.

The RGB-D image saliency object detection apparatus comprises a processor, a memory and an RGB-D image saliency object detection program stored on the memory and executable on the processor, wherein the RGB-D image saliency object detection program, when executed by the processor, implements the steps of the RGB-D image saliency object detection method as described above.

The method for implementing the RGB-D image significant object detection program when executed may refer to various embodiments of the RGB-D image significant object detection method of the present invention, and will not be described herein again.

The invention also provides a computer readable storage medium.

The computer readable storage medium of the present invention stores thereon an RGB-D image saliency object detection program that, when executed by a processor, implements the steps of the RGB-D image saliency object detection method as described above.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A RGB-D image saliency target detection method is characterized in that the RGB-D image comprises an RGB image to be detected and a depth image registered with the RGB image, and the RGB-D image saliency target detection method comprises the following steps:

2. The RGB-D image saliency target detection method of claim 1, wherein the step of performing multi-stage cross-modal feature fusion of the layered RGB modal features with the layered depth modal features based on an attention mechanism and cross-modal guidance strategy to obtain a multi-level cross-modal fusion feature of the layered RGB modal features with the layered depth modal features comprises:

3. The RGB-D image saliency target detection method of claim 2, wherein the step of using the RGB saliency feature response to guide the feature re-screening of the hierarchical depth modality features to obtain RGB depth saliency feature response based on a cross-modality guidance strategy comprises:

4. The RGB-D image saliency target detection method of claim 2, wherein in the second stage, oppositionally combining the RGB saliency feature responses, the depth saliency feature responses and the RGB depth saliency feature responses to obtain a plurality of oppositional features to fuse the plurality of oppositional features into a multi-level cross-modal fusion feature of the layered RGB modal features and the layered depth modal features includes:

5. The RGB-D image saliency target detection method according to claim 1, wherein the step of performing bi-directional fusion on the multi-level cross-modal fusion features based on a bi-directional fusion structure to obtain hierarchical saliency fusion features of different scales, and performing multi-scale fusion on the hierarchical saliency fusion features to obtain saliency target images corresponding to the RGB image and the depth image includes:

6. The RGB-D image saliency target detection method according to claim 5, wherein the step of performing high-low level fusion of the cross-modal fusion features of each level of the multi-level cross-modal fusion features in a top-down and bottom-up bi-directional path to obtain the hierarchical saliency fusion features comprises:

7. The RGB-D image saliency target detection method of claim 1, wherein the step of acquiring hierarchical RGB modality features of the RGB image and hierarchical depth modality features of the depth image comprises:

8. An RGB-D image saliency target detection apparatus characterized by comprising:

9. An RGB-D image saliency target detection apparatus characterized by comprising: a memory, a processor and an RGB-D image saliency object detection program stored on the memory and executable on the processor, the RGB-D image saliency object detection program when executed by the processor implementing the steps of the RGB-D image saliency object detection method as claimed in any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon an RGB-D image saliency object detection program that, when executed by a processor, implements the steps of the RGB-D image saliency object detection method according to any one of claims 1 to 7.