CN111967477A - RGB-D image saliency target detection method, device, equipment and storage medium - Google Patents

RGB-D image saliency target detection method, device, equipment and storage medium Download PDF

Info

Publication number
CN111967477A
CN111967477A CN202010637797.1A CN202010637797A CN111967477A CN 111967477 A CN111967477 A CN 111967477A CN 202010637797 A CN202010637797 A CN 202010637797A CN 111967477 A CN111967477 A CN 111967477A
Authority
CN
China
Prior art keywords
modal
rgb
fusion
features
depth
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010637797.1A
Other languages
Chinese (zh)
Inventor
高伟
廖桂标
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University Shenzhen Graduate School
Original Assignee
Peking University Shenzhen Graduate School
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University Shenzhen Graduate School filed Critical Peking University Shenzhen Graduate School
Priority to CN202010637797.1A priority Critical patent/CN111967477A/en
Publication of CN111967477A publication Critical patent/CN111967477A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration by the use of more than one image, e.g. averaging, subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/30Determination of transform parameters for the alignment of images, i.e. image registration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/90Determination of colour characteristics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method, a device and equipment for detecting a RGB-D image saliency target and a computer readable storage medium, wherein the method for detecting the RGB-D image saliency target adopts an attention mechanism instead of directly using and fusing a mode of layering RGB modal characteristics and layering depth modal characteristics, so that the introduction of useless or redundant information in modal characteristics is avoided, and the performance of saliency target detection is improved; the multi-stage cross-modal feature fusion is carried out by designing a cross-modal guidance strategy, so that the effectiveness and complementarity among cross-modal multi-features can be fully utilized, and more accurate cross-modal significant feature expression is formed while the influence of a depth image with poor quality is reduced; by designing a bidirectional fusion structure, the multi-level cross-modal fusion features are subjected to multi-scale fusion, so that high-level and low-level features in multiple levels can be effectively aggregated, and the detection performance of a significant target and the robustness of a detection algorithm are further improved.

Description

RGB-D image saliency target detection method, device, equipment and storage medium
Technical Field
The invention relates to the technical field of image processing, in particular to a method, a device and equipment for detecting a salient object of an RGB-D image and a computer-readable storage medium.
Background
With the rapid development of computer image processing technology, there are two main ways for detecting the significant target of the RGB-D image, one of which is a way based on manual feature extraction technology and prior knowledge. The method generally comprises the steps of calculating priors in RGB images and depth images, such as background priors, center priors, depth priors and the like, comparing features of colors, brightness, textures, depths and the like of different regions, and fusing with certain post-processing technologies in a multiplication or addition mode, so as to detect a significant target; the other is a way based on deep learning techniques. Such approaches can be generally divided into early, mid and late fusions. In the early fusion, features of an RGB image and a depth image are extracted through a single shallow layer network and then fused, and the features are input into a subsequent single-stream network to learn high-level features. The mid-term fusion usually realizes cross-modal fusion by extracting hierarchical features through a double-flow network. Late fusion is to combine the deepest feature information to generate a fused saliency map. With the wide application of depth cameras and portable intelligent devices, saliency detection using RGB-D images is also more popular in scenes such as robot navigation and navigation. Because the quality of the depth map is easily affected by noise such as sensor temperature, background illumination, distance and reflectivity of an observed object, and the like, even if the conventional RGB-D image saliency target detection method is adopted, an error or missing region of the depth map may still occur. The unreliable depth information can mislead the salient object detection process of the whole RGB-D image, so that the problem that the detection capability of the existing salient object detection method of the RGB-D image has limitation is caused.
Disclosure of Invention
The invention mainly aims to provide a method for detecting a salient object of an RGB-D image, aiming at solving the technical problem that the detection capability of the existing method for detecting the salient object of the RGB-D image is limited.
In order to achieve the above object, the present invention provides a method for detecting a RGB-D image saliency target, wherein the RGB-D image includes an RGB image to be detected and a depth image registered with the RGB image, and the method for detecting a RGB-D image saliency target includes:
acquiring layered RGB modal characteristics of the RGB image and layered depth modal characteristics of the depth image;
performing multi-stage cross-modal feature fusion on the layered RGB modal features and the layered depth modal features based on an attention mechanism and a cross-modal guidance strategy to obtain multi-stage cross-modal fusion features of the layered RGB modal features and the layered depth modal features;
and carrying out two-way fusion on the multi-level cross-modal fusion features based on a two-way fusion structure to obtain layered significant fusion features with different scales, and carrying out multi-scale fusion on the layered significant fusion features to obtain significant target images corresponding to the RGB images and the depth images.
Optionally, the step of performing multi-stage cross-modal feature fusion on the layered RGB modal features and the layered depth modal features based on an attention mechanism and a cross-modal guidance policy to obtain multi-stage cross-modal fusion features of the layered RGB modal features and the layered depth modal features includes:
in the first stage, respectively using a spatial attention mechanism to perform feature screening on the layered RGB modal features and the layered depth modal features to obtain RGB significant feature responses of the layered RGB modal features and depth significant feature responses of the layered depth modal features;
based on a cross-modal guidance strategy, guiding the layered depth modal characteristics to perform characteristic re-screening by using the RGB significant characteristic response so as to obtain RGB depth significant characteristic response;
in the second stage, the RGB significant feature response, the depth significant feature response and the RGB depth significant feature response are subjected to confrontation combination to obtain a plurality of confrontation features, and the plurality of confrontation features are fused into a multi-level cross-modal fusion feature of the layered RGB modal feature and the layered depth modal feature.
Optionally, the step of guiding the layered depth modal feature to perform feature re-screening by using the RGB significant feature response based on the cross-modal guidance policy to obtain an RGB depth significant feature response includes:
performing parallel asymmetric convolution on the hierarchical depth modal features to roughly locate a preliminary significant region in the hierarchical depth modal features;
performing fusion guidance on the preliminary salient region by using the RGB feature responses based on the cross-modal guidance strategy so as to locate a target salient region in the preliminary salient region;
and acquiring a feature weight corresponding to the layered depth modal feature of the positioned target significant region, and performing weighted fusion output on the layered depth modal feature of the positioned target significant region and the layered depth modal feature according to the feature weight to serve as the RGB depth significant feature response.
Optionally, in the second stage, the confrontation combining the RGB significant feature response, the depth significant feature response, and the RGB depth significant feature response to obtain a plurality of confrontation features, and fusing the plurality of confrontation features into a multi-level cross-modal fusion feature of the layered RGB modal feature and the layered depth modal feature includes:
in the second stage, performing countermeasure combination on the RGB significant feature response, the depth significant feature response and the RGB depth significant feature response by using multiplication to obtain a plurality of countermeasure features;
and respectively distributing different weights to the plurality of antagonistic features by using parallel pooling operation, and performing convolution and addition fusion operation on the plurality of antagonistic features distributed with different weights to obtain the multilevel trans-modal fusion feature.
Optionally, the step of performing bidirectional fusion on the multi-level cross-modal fusion features based on a bidirectional fusion structure to obtain hierarchical significant fusion features of different scales, and performing multi-scale fusion on the hierarchical significant fusion features to obtain significant target images corresponding to the RGB image and the depth image includes:
performing high-low level fusion on the cross-modal fusion features of each level in the multi-level cross-modal fusion features by a top-down path and a bottom-up path to obtain the layered significant fusion features;
performing upsampling according to the size of the RGB image to obtain a plurality of target sub-images corresponding to the layered significant fusion features;
and splicing the multiple target sub-images and performing convolution channel number conversion to obtain the salient target image.
Optionally, the step of performing high-low level fusion on the cross-modal fusion features of each level in the multi-level cross-modal fusion features in a top-down and bottom-up bidirectional path to obtain the hierarchical significant fusion feature includes:
in a feature fusion stage, performing first convolution operation on cross-modal fusion features of each level in the multi-level cross-modal fusion features to obtain a multi-level first convolution result, and fusing the first convolution result of each level in the multi-level first convolution result with a first convolution result of a corresponding high level respectively in a top-down path to obtain forward fusion results corresponding to each level and different scales;
in the multi-scale fusion stage, performing second convolution operation on the cross-modal fusion features of each level to obtain a multi-level second convolution result, and fusing the second convolution result of each level in the multi-level second convolution result with a second convolution result corresponding to a low level and a forward fusion result corresponding to the same level by a path from bottom to top to obtain reverse fusion results corresponding to each level and different scales;
and passing the reverse fusion results corresponding to each level and different scales through a preset module connected with a channel attention mechanism and a space attention mechanism in parallel to obtain the layered significant fusion features.
Optionally, the step of acquiring layered RGB modal characteristics of the RGB image and layered depth modal characteristics of the depth image includes:
the method comprises the steps of obtaining an RGB image to be detected and a depth image in registration with the RGB image, inputting the RGB image and the depth image into a preset double-current deep convolutional network, and extracting layered RGB modal characteristics of the RGB image and layered depth modal characteristics of the depth image in a layered mode.
In order to achieve the above object, the present invention also provides an RGB-D image saliency target detection apparatus comprising:
the modal characteristic acquisition module is used for acquiring the layered RGB modal characteristics of the RGB image and the layered depth modal characteristics of the depth image;
a fusion feature obtaining module, configured to perform multi-stage cross-modal feature fusion on the layered RGB modal features and the layered depth modal features based on an attention mechanism and a cross-modal guidance policy, to obtain multi-stage cross-modal fusion features of the layered RGB modal features and the layered depth modal features;
and the target image acquisition module is used for carrying out two-way fusion on the multi-level cross-modal fusion features based on a two-way fusion structure to obtain different scales of layered significant fusion features, and carrying out multi-scale fusion on the layered significant fusion features to obtain significant target images corresponding to the RGB images and the depth images.
Optionally, the fused feature obtaining module includes:
a depth response obtaining unit, configured to, in a first stage, perform feature screening on the layered RGB modal features and the layered depth modal features respectively using a spatial attention mechanism, so as to obtain RGB significant feature responses of the layered RGB modal features and depth significant feature responses of the layered depth modal features;
the screening guidance unit is used for guiding the layered depth modal characteristics to perform characteristic re-screening by utilizing the RGB significant characteristic response based on a cross-modal guidance strategy so as to obtain RGB depth significant characteristic response;
and the confrontation combination unit is used for carrying out confrontation combination on the RGB significant feature response, the depth significant feature response and the RGB depth significant feature response at the second stage to obtain a plurality of confrontation features, and fusing the plurality of confrontation features into a multi-level cross-modal fusion feature of the layered RGB modal feature and the layered depth modal feature.
Optionally, the screening guidance unit is further configured to:
performing parallel asymmetric convolution on the hierarchical depth modal features to roughly locate a preliminary significant region in the hierarchical depth modal features;
performing fusion guidance on the preliminary salient region by using the RGB feature responses based on the cross-modal guidance strategy so as to locate a target salient region in the preliminary salient region;
and acquiring a feature weight corresponding to the layered depth modal feature of the positioned target significant region, and performing weighted fusion output on the layered depth modal feature of the positioned target significant region and the layered depth modal feature according to the feature weight to serve as the RGB depth significant feature response.
Optionally, the confrontation combination unit is further configured to:
in the second stage, performing countermeasure combination on the RGB significant feature response, the depth significant feature response and the RGB depth significant feature response by using multiplication to obtain a plurality of countermeasure features;
and respectively distributing different weights to the plurality of antagonistic features by using parallel pooling operation, and performing convolution and addition fusion operation on the plurality of antagonistic features distributed with different weights to obtain the multilevel trans-modal fusion feature.
Optionally, the target image acquiring module further includes:
the bidirectional fusion unit is used for performing high-low level fusion on the cross-modal fusion features of each level in the multi-level cross-modal fusion features through a top-down and bottom-up bidirectional path to obtain the hierarchical significant fusion features;
the target sampling unit is used for performing up-sampling according to the size of the RGB image to obtain a plurality of target sub-images corresponding to the layered significant fusion features;
and the splicing convolution unit is used for splicing the target sub-images and performing convolution channel number conversion to obtain the salient target image.
Optionally, the bidirectional fusion unit is further configured to:
in a feature fusion stage, performing first convolution operation on cross-modal fusion features of each level in the multi-level cross-modal fusion features to obtain a multi-level first convolution result, and fusing the first convolution result of each level in the multi-level first convolution result with a first convolution result of a corresponding high level respectively in a top-down path to obtain forward fusion results corresponding to each level and different scales;
in the multi-scale fusion stage, performing second convolution operation on the cross-modal fusion features of each level to obtain a multi-level second convolution result, and fusing the second convolution result of each level in the multi-level second convolution result with a second convolution result corresponding to a low level and a forward fusion result corresponding to the same level by a path from bottom to top to obtain reverse fusion results corresponding to each level and different scales;
and passing the reverse fusion results corresponding to each level and different scales through a preset module connected with a channel attention mechanism and a space attention mechanism in parallel to obtain the layered significant fusion features.
Optionally, the modal feature acquisition module further includes:
the device comprises a layering extraction unit, a depth image processing unit and a depth image processing unit, wherein the layering extraction unit is used for acquiring an RGB image to be detected and a depth image in which the RGB image is registered, inputting the RGB image and the depth image into a preset double-current deep convolutional network, and extracting layering RGB modal characteristics of the RGB image and layering depth modal characteristics of the depth image in a layering manner.
Further, to achieve the above object, the present invention also provides an RGB-D image saliency target detection apparatus comprising: a memory, a processor and an RGB-D image saliency object detection program stored on said memory and executable on said processor, said RGB-D image saliency object detection program when executed by said processor implementing the steps of the RGB-D image saliency object detection method as described above.
In addition, to achieve the above object, the present invention also provides a computer readable storage medium having stored thereon an RGB-D image saliency object detection program that, when executed by a processor, implements the steps of the RGB-D image saliency object detection method as described above.
The invention provides a method, a device and equipment for detecting a salient object of an RGB-D image and a computer-readable storage medium. The RGB-D image saliency target detection method comprises the steps of obtaining the layered RGB modal characteristics of an RGB image and the layered depth modal characteristics of a depth image; performing multi-stage cross-modal feature fusion on the layered RGB modal features and the layered depth modal features based on an attention mechanism and a cross-modal guidance strategy to obtain multi-stage cross-modal fusion features of the layered RGB modal features and the layered depth modal features; and carrying out two-way fusion on the multi-level cross-modal fusion features based on a two-way fusion structure to obtain layered significant fusion features with different scales, and carrying out multi-scale fusion on the layered significant fusion features to obtain significant target images corresponding to the RGB images and the depth images. Through the mode, the mode of adopting an attention mechanism instead of directly using and fusing the layered RGB modal characteristics and the layered depth modal characteristics is adopted, so that the introduction of useless or redundant information in the modal characteristics is avoided, and the performance of the significant target detection is improved; the multi-stage cross-modal feature fusion is carried out by designing a cross-modal guidance strategy, so that the effectiveness and complementarity among cross-modal multi-features can be fully utilized, and accurate cross-modal significant feature expression is formed while the influence of a depth image with poor quality is reduced; by designing a bidirectional fusion structure, the multi-level cross-modal fusion features are subjected to multi-scale fusion, so that high-level and low-level features in multiple levels can be effectively aggregated, the detection performance of the saliency target is further improved, and the robustness of a detection algorithm is further improved, so that the technical problem that the detection capability of the saliency target detection method of the existing RGB-D image is limited is solved.
Drawings
FIG. 1 is a schematic diagram of an RGB-D image saliency target detection device of a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a RGB-D image saliency target detection method according to a first embodiment of the present invention;
FIG. 3 is a schematic diagram of an overall network structure in a first embodiment of the RGB-D image saliency target detection method of the present invention;
FIG. 4 is a diagram illustrating a multi-stage fusion strategy in a second embodiment of the RGB-D image saliency target detection method of the present invention;
FIG. 5 is a schematic diagram of a cross-modal guidance strategy in a second embodiment of the RGB-D image saliency target detection method of the present invention;
FIG. 6 is a schematic diagram illustrating details of a bi-directional multi-scale decoder according to a third embodiment of the RGB-D image saliency target detection method of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As shown in fig. 1, fig. 1 is a schematic structural diagram of an RGB-D image saliency target detection apparatus in a hardware operating environment according to an embodiment of the present invention.
The RGB-D image saliency target detection equipment provided by the embodiment of the invention can be a server, a PC (personal computer), and also can be terminal equipment such as a smart phone and a tablet personal computer.
As shown in fig. 1, the RGB-D image saliency target detection apparatus may include: a processor 1001, such as a CPU, a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a memory device separate from the processor 1001 described above.
Those skilled in the art will appreciate that the terminal structure shown in FIG. 1 does not constitute a limitation of the RGB-D image saliency target detection device, and may include more or less components than shown, or some components in combination, or a different arrangement of components.
As shown in fig. 1, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and an RGB-D image saliency object detection program.
In the terminal shown in fig. 1, the network interface 1004 is mainly used for connecting to a backend server and performing data communication with the backend server; the user interface 1003 is mainly used for connecting a client (user side) and performing data communication with the client; and the processor 1001 may be configured to call the RGB-D image saliency object detection program stored in the memory 1005 and perform the following operations:
acquiring layered RGB modal characteristics of the RGB image and layered depth modal characteristics of the depth image;
performing multi-stage cross-modal feature fusion on the layered RGB modal features and the layered depth modal features based on an attention mechanism and a cross-modal guidance strategy to obtain multi-stage cross-modal fusion features of the layered RGB modal features and the layered depth modal features;
and carrying out two-way fusion on the multi-level cross-modal fusion features based on a two-way fusion structure to obtain layered significant fusion features with different scales, and carrying out multi-scale fusion on the layered significant fusion features to obtain significant target images corresponding to the RGB images and the depth images.
Further, the step of performing multi-stage cross-modal feature fusion on the layered RGB modal features and the layered depth modal features based on an attention mechanism and a cross-modal guidance policy to obtain multi-stage cross-modal fusion features of the layered RGB modal features and the layered depth modal features includes:
in the first stage, respectively using a spatial attention mechanism to perform feature screening on the layered RGB modal features and the layered depth modal features to obtain RGB significant feature responses of the layered RGB modal features and depth significant feature responses of the layered depth modal features;
based on a cross-modal guidance strategy, guiding the layered depth modal characteristics to perform characteristic re-screening by using the RGB significant characteristic response so as to obtain RGB depth significant characteristic response;
in the second stage, the RGB significant feature response, the depth significant feature response and the RGB depth significant feature response are subjected to confrontation combination to obtain a plurality of confrontation features, and the plurality of confrontation features are fused into a multi-level cross-modal fusion feature of the layered RGB modal feature and the layered depth modal feature.
Further, the step of guiding the layered depth modal feature to perform feature re-screening by using the RGB significant feature response based on the cross-modal guidance policy to obtain an RGB depth significant feature response includes:
performing parallel asymmetric convolution on the hierarchical depth modal features to roughly locate a preliminary significant region in the hierarchical depth modal features;
performing fusion guidance on the preliminary salient region by using the RGB feature responses based on the cross-modal guidance strategy so as to locate a target salient region in the preliminary salient region;
and acquiring a feature weight corresponding to the layered depth modal feature of the positioned target significant region, and performing weighted fusion output on the layered depth modal feature of the positioned target significant region and the layered depth modal feature according to the feature weight to serve as the RGB depth significant feature response.
Further, in the second stage, the confrontation combining the RGB significant feature response, the depth significant feature response, and the RGB depth significant feature response to obtain a plurality of confrontation features, and fusing the plurality of confrontation features into a multi-level cross-modal fusion feature of the layered RGB modal feature and the layered depth modal feature includes:
in the second stage, performing countermeasure combination on the RGB significant feature response, the depth significant feature response and the RGB depth significant feature response by using multiplication to obtain a plurality of countermeasure features;
and respectively distributing different weights to the plurality of antagonistic features by using parallel pooling operation, and performing convolution and addition fusion operation on the plurality of antagonistic features distributed with different weights to obtain the multilevel trans-modal fusion feature.
Further, the step of performing bidirectional fusion on the multi-level cross-modal fusion features based on a bidirectional fusion structure to obtain hierarchical significant fusion features of different scales, and performing multi-scale fusion on the hierarchical significant fusion features to obtain significant target images corresponding to the RGB image and the depth image includes:
performing high-low level fusion on the cross-modal fusion features of each level in the multi-level cross-modal fusion features by a top-down path and a bottom-up path to obtain the layered significant fusion features;
performing upsampling according to the size of the RGB image to obtain a plurality of target sub-images corresponding to the layered significant fusion features;
and splicing the multiple target sub-images and performing convolution channel number conversion to obtain the salient target image.
Further, the step of performing high-low level fusion on the cross-modal fusion features of each level in the multi-level cross-modal fusion features through a top-down and bottom-up bidirectional path to obtain the hierarchical significant fusion feature includes:
in a feature fusion stage, performing first convolution operation on cross-modal fusion features of each level in the multi-level cross-modal fusion features to obtain a multi-level first convolution result, and fusing the first convolution result of each level in the multi-level first convolution result with a first convolution result of a corresponding high level respectively in a top-down path to obtain forward fusion results corresponding to each level and different scales;
in the multi-scale fusion stage, performing second convolution operation on the cross-modal fusion features of each level to obtain a multi-level second convolution result, and fusing the second convolution result of each level in the multi-level second convolution result with a second convolution result corresponding to a low level and a forward fusion result corresponding to the same level by a path from bottom to top to obtain reverse fusion results corresponding to each level and different scales;
and passing the reverse fusion results corresponding to each level and different scales through a preset module connected with a channel attention mechanism and a space attention mechanism in parallel to obtain the layered significant fusion features.
Further, the step of obtaining the layered RGB modal characteristics of the RGB image and the layered depth modal characteristics of the depth image includes:
the method comprises the steps of obtaining an RGB image to be detected and a depth image in registration with the RGB image, inputting the RGB image and the depth image into a preset double-current deep convolutional network, and extracting layered RGB modal characteristics of the RGB image and layered depth modal characteristics of the depth image in a layered mode.
Based on the hardware structure, the invention provides various embodiments of the RGB-D image saliency target detection method.
With the rapid development of computer image processing technology, there are two main ways for detecting the significant target of the RGB-D image, one of which is a way based on manual feature extraction technology and prior knowledge. The method generally comprises the steps of calculating priors in RGB images and depth images, such as background priors, center priors, depth priors and the like, comparing features of colors, brightness, textures, depths and the like of different regions, and fusing with certain post-processing technologies in a multiplication or addition mode, so as to detect a significant target; the other is a way based on deep learning techniques. Such approaches can be generally divided into early, mid and late fusions. In the early fusion, features of an RGB image and a depth image are extracted through a single shallow layer network and then fused, and the features are input into a subsequent single-stream network to learn high-level features. The mid-term fusion usually realizes cross-modal fusion by extracting hierarchical features through a double-flow network. Late fusion is to combine the deepest feature information to generate a fused saliency map.
With the wide application of depth cameras and portable intelligent devices, saliency detection using RGB-D images is also more popular in scenes such as robot navigation and navigation. Because the quality of the depth map is easily affected by noise such as sensor temperature, background illumination, distance and reflectivity of an observed object, and the like, even if the conventional RGB-D image saliency target detection method is adopted, an error or missing region of the depth map may still occur. The unreliable depth information can mislead the salient object detection process of the whole RGB-D image, so that the problem that the detection capability of the existing salient object detection method of the RGB-D image has limitation is caused.
In order to solve the problems, the invention provides a method for detecting the saliency target of an RGB-D image, which adopts an attention mechanism instead of directly using and fusing a mode of layering RGB modal characteristics and layering depth modal characteristics, avoids the introduction of useless or redundant information in modal characteristics, and improves the performance of saliency target detection; the multi-stage cross-modal feature fusion is carried out by designing a cross-modal guidance strategy, so that the effectiveness and complementarity among cross-modal multi-features can be fully utilized, and accurate cross-modal significant feature expression is formed while the influence of a depth image with poor quality is reduced; by designing a bidirectional fusion structure, the multi-level cross-modal fusion features are subjected to multi-scale fusion, so that high-level and low-level features in multiple levels can be effectively aggregated, the detection performance of the saliency target is further improved, and the robustness of a detection algorithm is further improved, so that the technical problem that the detection capability of the saliency target detection method of the existing RGB-D image is limited is solved.
Referring to fig. 2, fig. 2 is a schematic flowchart of a first embodiment of a RGB-D image saliency target detection method.
The first embodiment of the present invention provides a method for detecting a significant target of an RGB-D image, where the method for detecting a significant target of an RGB-D image is applied to an encoding end, and the method for detecting a significant target of an RGB-D image includes the following steps:
step S10, acquiring layered RGB modal characteristics of the RGB image and layered depth modal characteristics of the depth image;
in this embodiment, it should be noted that the RGB-D image saliency target detection method of the present invention is applicable to RGB-D images. An RGB-D image actually refers to two images: one is a normal RGB three-channel color image and the other is a Depth image, i.e., a Depth image. The Depth image is similar to a grayscale image except that each pixel value thereof is the actual distance of the sensor from the object. Usually, the RGB image and the Depth image are registered, so that there is a one-to-one correspondence between the pixel points. Hereinafter, Depth images are collectively referred to as Depth images.
For the RGB-D image which needs to detect the salient object currently, the layered modal characteristics of the RGB image and the depth image respectively need to be acquired. Wherein the hierarchical modal characteristics comprise a plurality of different levels of modal characteristics. Therefore, the layered RGB modal feature is a layered modal feature of an RGB image, and the layered depth modal feature is a layered modal feature of a depth image registered with the RGB image. The acquisition of the layered RGB modal features and the layered depth modal features is generally realized by using a deep convolutional network. For example, the RGB image and the depth image may be input into a dual-stream deep convolutional network Res2Net, respectively, to obtain the layered modal features corresponding to the RGB image and the depth image, respectively.
Step S20, based on an attention mechanism and a cross-modal guidance strategy, performing multi-stage cross-modal feature fusion on the layered RGB modal features and the layered depth modal features to obtain multi-stage cross-modal fusion features of the layered RGB modal features and the layered depth modal features;
in the present embodiment, the present invention designs a Multi-stage structure fusion strategy (MFM) in which an attention mechanism needs to be combined with a cross-modal guidance strategy. Firstly, an attention mechanism is adopted to highlight a remarkable object in the layered RGB modal characteristic and the layered depth modal characteristic, disordered background information is restrained, meanwhile, a cross-modal guidance strategy is adopted, and the enhanced RGB modal characteristic is utilized to guide another depth modal characteristic with poor quality, so that a useful characteristic is screened out from the depth modal characteristic, and the method is not used for simply discarding the depth image with low quality or directly using the depth image with poor quality neglected. The method comprises the steps of carrying out attention mechanism stage screening and cross-mode guidance fusion on a plurality of different levels of RGB modal characteristics and a plurality of different levels of depth modal characteristics, obtaining RGB image significant characteristic response, depth image significant characteristic response and RGB depth significant characteristic response obtained after guidance, and carrying out countermeasure combination on the RGB image significant characteristic response and the depth significant characteristic response to obtain a plurality of countermeasure characteristics. And finally, fusing the plurality of confrontation features to obtain the multi-level cross-modal fusion feature.
Step S30, carrying out two-way fusion on the multi-level cross-modal fusion features based on a two-way fusion structure to obtain layered significant fusion features with different scales, and carrying out multi-scale fusion on the layered significant fusion features to obtain significant target images corresponding to the RGB images and the depth images.
In the present embodiment, "bidirectional" in the bidirectional fusion structure means a bidirectional path from top to bottom and from bottom to top. Different levels in the deep network contain different characteristic responses to salient objects. In particular, deep features of the network provide differentiated and integrated semantic information, and lower layers of the network contain more local detail information for salient objects. In order to effectively position and detect global and local information of a salient object, the invention designs a multi-scale fusion mode based on a bidirectional structure, and aggregates multi-scale and comprehensive feature information of each level by a top-down path and a bottom-up path to obtain a layered salient fusion feature. And then sampling, splicing and other operations are carried out on the layered significant fusion characteristics, so that a high-quality significant target image corresponding to the original RGB-D image can be obtained.
As shown in fig. 3, fig. 3 is a schematic diagram of the overall network structure of the present invention. RGB refers to the RGB image to be detected, Depth refers to the Depth image registered with this RGB image. Conv1 to Conv5 each represent a convolution operation. fr2、fr3、fr4And fr5Representing the above-mentioned hierarchical RGB image features, fd2、fd3、fd4And fd5Representing the layered depth image features described above. Characterizing layered RGB image fr2、fr3、fr4And fr5With the hierarchical depth image feature fd2、fd3、fd4And fd5And inputting the data into a plurality of MFM modules according to corresponding levels to obtain the multi-level cross-modal fusion characteristics. Then, the cross-modal fusion features of each level are input into a Feature fusion module (FF) and a Multi-scale fusion Module (MF) corresponding to a Bi-directional Multi-scale decoder (BMD), and high and low level features are fused in a top-down and bottom-up path, so that the hierarchical significant fusion features can be obtained. And finally, carrying out UP (UP-sampling of bilinear interpolation) operation on the layered salient fusion features to obtain four sub-images P2, P3, P4 and P5, and finally carrying out splicing and channel number conversion on the four sub-images to obtain a final salient target image (salient image).
In the embodiment, the layered RGB modal characteristics of the RGB image and the layered depth modal characteristics of the depth image are obtained; performing multi-stage cross-modal feature fusion on the layered RGB modal features and the layered depth modal features based on an attention mechanism and a cross-modal guidance strategy to obtain multi-stage cross-modal fusion features of the layered RGB modal features and the layered depth modal features; and carrying out two-way fusion on the multi-level cross-modal fusion features based on a two-way fusion structure to obtain layered significant fusion features with different scales, and carrying out multi-scale fusion on the layered significant fusion features to obtain significant target images corresponding to the RGB images and the depth images. Through the mode, the mode of adopting an attention mechanism instead of directly using and fusing the layered RGB modal characteristics and the layered depth modal characteristics is adopted, so that the introduction of useless or redundant information in the modal characteristics is avoided, and the performance of the significant target detection is improved; cross-modal feature fusion is carried out by designing a cross-modal guidance strategy, so that effectiveness and complementarity among cross-modal multi-features can be fully utilized, and more accurate cross-modal significant feature expression is formed while the influence of a depth image with poor quality is reduced; by designing a bidirectional fusion structure, the multi-level cross-modal fusion features are subjected to multi-scale fusion, so that high-level and low-level features in multiple levels can be effectively aggregated, the detection performance of the saliency target is further improved, and the robustness of a detection algorithm is further improved, so that the technical problem that the detection capability of the saliency target detection method of the existing RGB-D image is limited is solved.
Further, not shown in the drawings, a second embodiment of the RGB-D image saliency target detection method according to the present invention is proposed based on the above-mentioned first embodiment shown in fig. 2. In the present embodiment, step S20 includes:
in the first stage, respectively using a spatial attention mechanism to perform feature screening on the layered RGB modal features and the layered depth modal features to obtain RGB significant feature responses of the layered RGB modal features and depth significant feature responses of the layered depth modal features;
based on a cross-modal guidance strategy, guiding the layered depth modal characteristics to perform characteristic re-screening by using the RGB significant characteristic response so as to obtain RGB depth significant characteristic response;
in the second stage, the RGB significant feature response, the depth significant feature response and the RGB depth significant feature response are subjected to confrontation combination to obtain a plurality of confrontation features, and the plurality of confrontation features are fused into a multi-level cross-modal fusion feature of the layered RGB modal feature and the layered depth modal feature.
In this embodiment, the first stage is a stage of screening the layered RGB modal features and the layered depth modal features using a spatial attention mechanism and a cross-modal guidance strategy. The second stage is the antagonistic combination stage. Unlike methods that directly use and fuse RGB and depth cross-modal feature information, this introduces useless or redundant information, but uses an attention mechanism to generate a useful multi-feature response in a first stage, then in a second stage, generate confrontation features for further confrontation and combination of the generated information, and finally, further combine the confrontation features into an accurate multi-level cross-modal fusion feature.
As shown in fig. 4, fig. 4 is a schematic diagram of a multi-stage fusion strategy. Fd in the figureiI.e. the above-mentioned hierarchical depth modal characteristic, friThe layered RGB modal characteristics are obtained; in Stage 1 (first phase), SA represents a Spatial attention module (SA), CGA represents a Cross-modal guided attention module (CGA), FdiI.e. the above-mentioned deep significant characteristic response, FriThat is, the RGB significant feature response, FrdiNamely the RGB depth salient feature response; in Stage 2 (second Stage), ADD denotes the element addition operation, AD′D、ARD′And ALCorresponding to the plurality of countermeasure features described above. Last cmiNamely the multi-level cross-modal fusion feature. In the first stage, fdiAnd friRespectively inputting into SA module, highlighting significant objects in RGB and depth mode characteristics to suppress disordered background information, and obtaining FdiAnd Fri. Then f diAnd FriAs input to the CGA module, Fr is utilizediGuide fdiFrom fd toiMedium screening for effective characteristics to obtain Frdi. In the second stage, Fd will be implemented using multiplicationi、FdiAnd FrdiAgainst a combination of them to obtain AD′D、ARD′And AL. Finally, parallel pooling pairs A will be usedD′D、ARD′And ALPerforming fusion to obtain cmi
Further, in this embodiment, the step of guiding the layered depth modal feature to perform feature re-screening by using the RGB significant feature response based on the cross-modal guidance policy to obtain an RGB depth significant feature response includes:
performing parallel asymmetric convolution on the hierarchical depth modal features to roughly locate a preliminary significant region in the hierarchical depth modal features;
performing fusion guidance on the preliminary salient region by using the RGB feature responses based on the cross-modal guidance strategy so as to locate a target salient region in the preliminary salient region;
and acquiring a feature weight corresponding to the layered depth modal feature of the positioned target significant region, and performing weighted fusion output on the layered depth modal feature of the positioned target significant region and the layered depth modal feature according to the feature weight to serve as the RGB depth significant feature response.
In the present embodiment, in the first stage, to effectively select important information in RGB or depth mode features, a spatial attention module is used to highlight salient objects in RGB and depth mode features and suppress cluttered background information. More importantly, the invention designs a cross-modal guidance module in consideration of the influence of poor quality of a depth map, and the core idea is that when an unreliable depth map is encountered, the reinforced RGB modal feature is utilized to guide another poor-quality depth modal feature, so that a useful feature is screened out from the depth modal feature, and the poor-quality depth map is not simply discarded or the quality problem is ignored for direct use. As shown in fig. 5, fig. 5 is a schematic diagram of a cross-modal guidance strategy. Characterizing depth mode of each level fdiThe preliminary salient region in the image is first roughly located by parallel asymmetric convolution of 1x3 and 3x1, and then by FriPerforming fusion guidance to further locate the salient regions with higher possibility, filtering out the error salient regions possibly appearing in the single depth map, finally obtaining the guided depth modal feature map weight through splicing and Sigmoid function, and combining the guided depth modal feature map weight with the original input fdiMultiplying to obtain more accurate fusion characteristic expression Frdi
Further, in this embodiment, in the second stage, the step of performing countermeasure combination on the RGB significant feature response, the depth significant feature response, and the RGB depth significant feature response to obtain a plurality of countermeasure features, so as to fuse the plurality of countermeasure features into a multi-level cross-modal fusion feature of the layered RGB modal feature and the layered depth modal feature includes:
in the second stage, performing countermeasure combination on the RGB significant feature response, the depth significant feature response and the RGB depth significant feature response by using multiplication to obtain a plurality of countermeasure features;
and respectively distributing different weights to the plurality of antagonistic features by using parallel pooling operation, and performing convolution and addition fusion operation on the plurality of antagonistic features distributed with different weights to obtain the multilevel trans-modal fusion feature.
In the embodiment, the antagonism combination between cross-modal features is carried out by using multiplication, the action of the antagonism combination can enhance the inconsistency between the common significant region and the inhibition cross-modal features, and meanwhile, the parallel pooling operation is used for distributing different weights to the antagonism combination result, so that important significant features are expressed and background features which are irrelevant to inhibition are highlighted in an adaptive mode. The cross-modal fusion features generated in the above manner are not only influenced by the own modal features but also influenced by the anti-modal features.
The process of the multi-stage fusion strategy described above is expressed as follows:
Figure BDA0002566954620000171
Figure BDA0002566954620000172
Figure BDA0002566954620000173
Figure BDA0002566954620000174
Figure BDA0002566954620000175
Figure BDA0002566954620000176
Figure BDA0002566954620000177
Figure BDA0002566954620000178
Figure BDA0002566954620000179
Figure BDA00025669546200001710
Figure BDA00025669546200001711
Figure BDA00025669546200001712
wherein, Wk1And Wk2Representing a convolution operation. And x is the input of SA. And σ represents the ReLU activation function and Sigmoid function, respectively. fri(i-2, 3,4,5) and fdi(i ═ 2,3,4,5) represents the paired side outputs of the RGB and Depth backbones.
Figure BDA0002566954620000181
And
Figure BDA0002566954620000182
respectively, element multiplication and element addition.
Figure BDA0002566954620000183
And
Figure BDA0002566954620000184
indicating that the input is uniformly converted to 256 channels with a 1x1 convolution. Conv, Conv1 and Conv2 represent convolution operations.
Figure BDA0002566954620000185
Convolution parameters for each layer.
Figure BDA0002566954620000186
Corresponding to the convolution operation of each layer. Cat represents the splicing operation. Avgppooling and MaxPooling represent global average pooling and global maximum pooling operations, respectively.
Further, clutter background information is suppressed by using a spatial attention module to highlight salient objects in RGB and depth modality features; by utilizing the enhanced RGB modal features to guide another depth modal feature with poor quality, useful features are screened out from the depth modal features, and a depth map with low quality is not simply discarded or quality problems of the depth map are ignored for direct use, so that a significant region in the depth modal features is more accurately positioned; the antagonism combination between cross-modal characteristics is carried out by multiplication, so that the inconsistency between a common significant region and the suppression cross-modal is enhanced; by simultaneously using parallel pooling operations to assign different weights to the results of the antagonistic combinations, it is possible to adaptively highlight important salient feature expressions and suppress irrelevant background features.
Further, not shown in the drawings, a third embodiment of the RGB-D image saliency target detection method according to the present invention is proposed based on the first embodiment shown in fig. 2. In the present embodiment, step S30 includes:
performing high-low level fusion on the cross-modal fusion features of each level in the multi-level cross-modal fusion features by a top-down path and a bottom-up path to obtain the layered significant fusion features;
performing upsampling according to the size of the RGB image to obtain a plurality of target sub-images corresponding to the layered significant fusion features;
and splicing the multiple target sub-images and performing convolution channel number conversion to obtain the salient target image.
In the embodiment, the invention realizes the multi-scale fusion of multi-level cross-modal fusion features by designing a bidirectional multi-scale fusion decoder BMD. Different levels in the deep network contain different characteristic responses to salient objects. In particular, deep features of the network provide differentiated and integrated semantic information, and lower layers of the network contain more local detail information for salient objects. In order to effectively position and detect global and local information of a salient object, a multi-scale fusion method based on a bidirectional structure is designed to aggregate multi-scale and comprehensive feature information of each level by a top-down path and a bottom-up path. Firstly, cross-modal fusion feature cm output by MFMi(i ═ 2,3,4,5) is fed to the FF module in a top-down path. Then, for each level of MF module, the fused feature F of FF output of the corresponding levelj(j ═ 2,3,4,5) and MFM output cross-modal fusion features cmi(i-2, 3,4,5) are integrated in a bottom-up manner. By the method, the multi-scale cross-modal fusion characteristics of the high and low layers are fused and the overall structure and the corresponding detailed information of the object are more prominent.
Finally, the fused features S from the MF modulek(k 2,3,4,5) is first up-sampled to the original size Pi(i ═ 2,3,4,5), then spliced together, and the final saliency map S is obtained by a 3 × 3 convolution conversion channel numbermapThe process is represented as:
Pi=up(Sk);
Smap=Conv(Cat(P2,P3,P4,P5)).
wherein Conv represents a 3 × 3 convolution, Cat represents a stitching operation, and Up represents an upsampling operation of bilinear interpolation.
Further, in this embodiment, the step of performing high-low level fusion on the cross-modal fusion features of each level in the multi-level cross-modal fusion features through a top-down and bottom-up bidirectional path to obtain the hierarchical significant fusion feature includes:
in a feature fusion stage, performing first convolution operation on cross-modal fusion features of each level in the multi-level cross-modal fusion features to obtain a multi-level first convolution result, and fusing the first convolution result of each level in the multi-level first convolution result with a first convolution result of a corresponding high level respectively in a top-down path to obtain forward fusion results corresponding to each level and different scales;
in the multi-scale fusion stage, performing second convolution operation on the cross-modal fusion features of each level to obtain a multi-level second convolution result, and fusing the second convolution result of each level in the multi-level second convolution result with a second convolution result corresponding to a low level and a forward fusion result corresponding to the same level by a path from bottom to top to obtain reverse fusion results corresponding to each level and different scales;
and passing the reverse fusion results corresponding to each level and different scales through a preset module connected with a channel attention mechanism and a space attention mechanism in parallel to obtain the layered significant fusion features.
In the present embodiment, the first convolution operation is referred to as a 3 × 3 convolution operation, and the second convolution operation is referred to as a 1 × 1 convolution operation. The feature fusion stage refers to the FF module and the multi-scale fusion stage refers to the MF module. The "forward direction" in the forward direction fusion result refers to a top-down path direction, and the "reverse direction" in the reverse direction fusion result refers to a bottom-up path direction. The preset module refers to a compression and Excitation (scSE, Spatial and Channel Squeeze & Excitation Block) module.
As shown in fig. 6, fig. 6 is a detailed view of a bi-directional multi-scale decoder. In FF module in the figure, feature cm is fused across modesiFirst convolution is carried out by 3x3 to convert into 64 channels, and the 64 channels are connected with a high-level input feature Fj+1Carrying out simple fusion, then with Fj+1Performing an addition operation to generate a cost-tier FF fused output Fj. Simultaneously in MF module, feature cm is fused across modesiIs converted into 64 channels by convolution of 1x1 and is combined with the low-level feature S from bottom to topk-1And the same level FF output FjAnd performing addition fusion, and further refining the details of the salient objects through the scSE module to obtain the output of the MF module. The specific process formula is as follows:
Figure BDA0002566954620000201
Figure BDA0002566954620000202
Figure BDA0002566954620000203
wherein the content of the first and second substances,
Figure BDA0002566954620000204
and
Figure BDA0002566954620000205
respectively representing element multiplication and element addition, up refers to 2 times up-sampling by means of bilinear interpolation,
Figure BDA0002566954620000206
and
Figure BDA0002566954620000207
refer to the 3x3 convolution operation with the PReLu activation function and the 1x1 convolution operation with the ReLu activation function, respectively, and they will be cmiThe number of output channels (i-2, 3,4,5) is uniformly converted from 256 to 64 channels. WiAnd Wj+1Only the convolution operation without the use of an activation function,
Figure BDA0002566954620000208
and WkRespectively, 3X3 convolution with ReLu activation and 1X1 convolution, Wk-1Then the 3x3 convolution with step size 2 downsamples the input and the scSE module used can further capture meaningful detail information.
Further, in the present embodiment, step S10 includes:
the method comprises the steps of obtaining an RGB image to be detected and a depth image in registration with the RGB image, inputting the RGB image and the depth image into a preset double-current deep convolutional network, and extracting layered RGB modal characteristics of the RGB image and layered depth modal characteristics of the depth image in a layered mode.
In this embodiment, the computer acquires an RGB image to be detected and a depth image corresponding to the RGB image, and inputs the RGB image and the depth image into a double-stream deep convolutional network, so as to extract layered RGB modal characteristics of the RGB image and layered depth modal characteristics of the depth image in a layered manner.
Furthermore, by designing a multi-scale fusion mode based on a bidirectional structure, multi-scale and comprehensive characteristic information of each level is aggregated by a top-down path and a bottom-up path, so that global and local information of a significant object can be more effectively positioned and detected; by adding a bottom-up path, the object detail detection capability is improved and the problem of top-down propagation and dilution of high-level semantic information is solved by fusing the object detail detection capability and the top-down corresponding level features in a multi-scale fusion module; by combining the multi-level high-level semantic information and the low-level detail information, the multi-scale detection performance of the algorithm is improved.
The invention also provides a RGB-D image saliency target detection device, which realizes the following steps:
the modal characteristic acquisition module is used for acquiring the layered RGB modal characteristics of the RGB image and the layered depth modal characteristics of the depth image;
a fusion feature obtaining module, configured to perform multi-stage cross-modal feature fusion on the layered RGB modal features and the layered depth modal features based on an attention mechanism and a cross-modal guidance policy, to obtain multi-stage cross-modal fusion features of the layered RGB modal features and the layered depth modal features;
and the target image acquisition module is used for carrying out two-way fusion on the multi-level cross-modal fusion features based on a two-way fusion structure to obtain different scales of layered significant fusion features, and carrying out multi-scale fusion on the layered significant fusion features to obtain significant target images corresponding to the RGB images and the depth images.
The invention also provides RGB-D image saliency target detection equipment.
The RGB-D image saliency object detection apparatus comprises a processor, a memory and an RGB-D image saliency object detection program stored on the memory and executable on the processor, wherein the RGB-D image saliency object detection program, when executed by the processor, implements the steps of the RGB-D image saliency object detection method as described above.
The method for implementing the RGB-D image significant object detection program when executed may refer to various embodiments of the RGB-D image significant object detection method of the present invention, and will not be described herein again.
The invention also provides a computer readable storage medium.
The computer readable storage medium of the present invention stores thereon an RGB-D image saliency object detection program that, when executed by a processor, implements the steps of the RGB-D image saliency object detection method as described above.
The method for implementing the RGB-D image significant object detection program when executed may refer to various embodiments of the RGB-D image significant object detection method of the present invention, and will not be described herein again.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A RGB-D image saliency target detection method is characterized in that the RGB-D image comprises an RGB image to be detected and a depth image registered with the RGB image, and the RGB-D image saliency target detection method comprises the following steps:
acquiring layered RGB modal characteristics of the RGB image and layered depth modal characteristics of the depth image;
performing multi-stage cross-modal feature fusion on the layered RGB modal features and the layered depth modal features based on an attention mechanism and a cross-modal guidance strategy to obtain multi-stage cross-modal fusion features of the layered RGB modal features and the layered depth modal features;
and carrying out two-way fusion on the multi-level cross-modal fusion features based on a two-way fusion structure to obtain layered significant fusion features with different scales, and carrying out multi-scale fusion on the layered significant fusion features to obtain significant target images corresponding to the RGB images and the depth images.
2. The RGB-D image saliency target detection method of claim 1, wherein the step of performing multi-stage cross-modal feature fusion of the layered RGB modal features with the layered depth modal features based on an attention mechanism and cross-modal guidance strategy to obtain a multi-level cross-modal fusion feature of the layered RGB modal features with the layered depth modal features comprises:
in the first stage, respectively using a spatial attention mechanism to perform feature screening on the layered RGB modal features and the layered depth modal features to obtain RGB significant feature responses of the layered RGB modal features and depth significant feature responses of the layered depth modal features;
based on a cross-modal guidance strategy, guiding the layered depth modal characteristics to perform characteristic re-screening by using the RGB significant characteristic response so as to obtain RGB depth significant characteristic response;
in the second stage, the RGB significant feature response, the depth significant feature response and the RGB depth significant feature response are subjected to confrontation combination to obtain a plurality of confrontation features, and the plurality of confrontation features are fused into a multi-level cross-modal fusion feature of the layered RGB modal feature and the layered depth modal feature.
3. The RGB-D image saliency target detection method of claim 2, wherein the step of using the RGB saliency feature response to guide the feature re-screening of the hierarchical depth modality features to obtain RGB depth saliency feature response based on a cross-modality guidance strategy comprises:
performing parallel asymmetric convolution on the hierarchical depth modal features to roughly locate a preliminary significant region in the hierarchical depth modal features;
performing fusion guidance on the preliminary salient region by using the RGB feature responses based on the cross-modal guidance strategy so as to locate a target salient region in the preliminary salient region;
and acquiring a feature weight corresponding to the layered depth modal feature of the positioned target significant region, and performing weighted fusion output on the layered depth modal feature of the positioned target significant region and the layered depth modal feature according to the feature weight to serve as the RGB depth significant feature response.
4. The RGB-D image saliency target detection method of claim 2, wherein in the second stage, oppositionally combining the RGB saliency feature responses, the depth saliency feature responses and the RGB depth saliency feature responses to obtain a plurality of oppositional features to fuse the plurality of oppositional features into a multi-level cross-modal fusion feature of the layered RGB modal features and the layered depth modal features includes:
in the second stage, performing countermeasure combination on the RGB significant feature response, the depth significant feature response and the RGB depth significant feature response by using multiplication to obtain a plurality of countermeasure features;
and respectively distributing different weights to the plurality of antagonistic features by using parallel pooling operation, and performing convolution and addition fusion operation on the plurality of antagonistic features distributed with different weights to obtain the multilevel trans-modal fusion feature.
5. The RGB-D image saliency target detection method according to claim 1, wherein the step of performing bi-directional fusion on the multi-level cross-modal fusion features based on a bi-directional fusion structure to obtain hierarchical saliency fusion features of different scales, and performing multi-scale fusion on the hierarchical saliency fusion features to obtain saliency target images corresponding to the RGB image and the depth image includes:
performing high-low level fusion on the cross-modal fusion features of each level in the multi-level cross-modal fusion features by a top-down path and a bottom-up path to obtain the layered significant fusion features;
performing upsampling according to the size of the RGB image to obtain a plurality of target sub-images corresponding to the layered significant fusion features;
and splicing the multiple target sub-images and performing convolution channel number conversion to obtain the salient target image.
6. The RGB-D image saliency target detection method according to claim 5, wherein the step of performing high-low level fusion of the cross-modal fusion features of each level of the multi-level cross-modal fusion features in a top-down and bottom-up bi-directional path to obtain the hierarchical saliency fusion features comprises:
in a feature fusion stage, performing first convolution operation on cross-modal fusion features of each level in the multi-level cross-modal fusion features to obtain a multi-level first convolution result, and fusing the first convolution result of each level in the multi-level first convolution result with a first convolution result of a corresponding high level respectively in a top-down path to obtain forward fusion results corresponding to each level and different scales;
in the multi-scale fusion stage, performing second convolution operation on the cross-modal fusion features of each level to obtain a multi-level second convolution result, and fusing the second convolution result of each level in the multi-level second convolution result with a second convolution result corresponding to a low level and a forward fusion result corresponding to the same level by a path from bottom to top to obtain reverse fusion results corresponding to each level and different scales;
and passing the reverse fusion results corresponding to each level and different scales through a preset module connected with a channel attention mechanism and a space attention mechanism in parallel to obtain the layered significant fusion features.
7. The RGB-D image saliency target detection method of claim 1, wherein the step of acquiring hierarchical RGB modality features of the RGB image and hierarchical depth modality features of the depth image comprises:
the method comprises the steps of obtaining an RGB image to be detected and a depth image in registration with the RGB image, inputting the RGB image and the depth image into a preset double-current deep convolutional network, and extracting layered RGB modal characteristics of the RGB image and layered depth modal characteristics of the depth image in a layered mode.
8. An RGB-D image saliency target detection apparatus characterized by comprising:
the modal characteristic acquisition module is used for acquiring the layered RGB modal characteristics of the RGB image and the layered depth modal characteristics of the depth image;
a fusion feature obtaining module, configured to perform multi-stage cross-modal feature fusion on the layered RGB modal features and the layered depth modal features based on an attention mechanism and a cross-modal guidance policy, to obtain multi-stage cross-modal fusion features of the layered RGB modal features and the layered depth modal features;
and the target image acquisition module is used for carrying out two-way fusion on the multi-level cross-modal fusion features based on a two-way fusion structure to obtain different scales of layered significant fusion features, and carrying out multi-scale fusion on the layered significant fusion features to obtain significant target images corresponding to the RGB images and the depth images.
9. An RGB-D image saliency target detection apparatus characterized by comprising: a memory, a processor and an RGB-D image saliency object detection program stored on the memory and executable on the processor, the RGB-D image saliency object detection program when executed by the processor implementing the steps of the RGB-D image saliency object detection method as claimed in any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon an RGB-D image saliency object detection program that, when executed by a processor, implements the steps of the RGB-D image saliency object detection method according to any one of claims 1 to 7.
CN202010637797.1A 2020-07-02 2020-07-02 RGB-D image saliency target detection method, device, equipment and storage medium Pending CN111967477A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010637797.1A CN111967477A (en) 2020-07-02 2020-07-02 RGB-D image saliency target detection method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010637797.1A CN111967477A (en) 2020-07-02 2020-07-02 RGB-D image saliency target detection method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111967477A true CN111967477A (en) 2020-11-20

Family

ID=73360962

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010637797.1A Pending CN111967477A (en) 2020-07-02 2020-07-02 RGB-D image saliency target detection method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111967477A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113077491A (en) * 2021-04-02 2021-07-06 安徽大学 RGBT target tracking method based on cross-modal sharing and specific representation form
CN113222003A (en) * 2021-05-08 2021-08-06 北方工业大学 RGB-D-based indoor scene pixel-by-pixel semantic classifier construction method and system
CN113298154A (en) * 2021-05-27 2021-08-24 安徽大学 RGB-D image salient target detection method
CN113538347A (en) * 2021-06-29 2021-10-22 中国电子科技集团公司电子科学研究院 Image detection method and system based on efficient bidirectional path aggregation attention network
CN113902783A (en) * 2021-11-19 2022-01-07 东北大学 Three-modal image fused saliency target detection system and method
CN114170174A (en) * 2021-12-02 2022-03-11 沈阳工业大学 CLANet steel rail surface defect detection system and method based on RGB-D image

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110555434A (en) * 2019-09-03 2019-12-10 浙江科技学院 method for detecting visual saliency of three-dimensional image through local contrast and global guidance
US20200160559A1 (en) * 2018-11-16 2020-05-21 Uatc, Llc Multi-Task Multi-Sensor Fusion for Three-Dimensional Object Detection
CN111242181A (en) * 2020-01-03 2020-06-05 大连民族大学 RGB-D salient object detector based on image semantics and details

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200160559A1 (en) * 2018-11-16 2020-05-21 Uatc, Llc Multi-Task Multi-Sensor Fusion for Three-Dimensional Object Detection
CN110555434A (en) * 2019-09-03 2019-12-10 浙江科技学院 method for detecting visual saliency of three-dimensional image through local contrast and global guidance
CN111242181A (en) * 2020-01-03 2020-06-05 大连民族大学 RGB-D salient object detector based on image semantics and details

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
T. D\'ORAZIO等: "Recent trends in gesture recognition: how depth data has improved classical approaches", 《IMAGE AND VISION COMPUTING》, vol. 52, 1 August 2016 (2016-08-01) *
刘政怡;段群涛;石松;赵鹏;: "基于多模态特征融合监督的RGB-D图像显著性检测", 电子与信息学报, no. 04, 15 April 2020 (2020-04-15) *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113077491A (en) * 2021-04-02 2021-07-06 安徽大学 RGBT target tracking method based on cross-modal sharing and specific representation form
CN113222003A (en) * 2021-05-08 2021-08-06 北方工业大学 RGB-D-based indoor scene pixel-by-pixel semantic classifier construction method and system
CN113222003B (en) * 2021-05-08 2023-08-01 北方工业大学 Construction method and system of indoor scene pixel-by-pixel semantic classifier based on RGB-D
CN113298154A (en) * 2021-05-27 2021-08-24 安徽大学 RGB-D image salient target detection method
CN113298154B (en) * 2021-05-27 2022-11-11 安徽大学 RGB-D image salient object detection method
CN113538347A (en) * 2021-06-29 2021-10-22 中国电子科技集团公司电子科学研究院 Image detection method and system based on efficient bidirectional path aggregation attention network
CN113538347B (en) * 2021-06-29 2023-10-27 中国电子科技集团公司电子科学研究院 Image detection method and system based on efficient bidirectional path aggregation attention network
CN113902783A (en) * 2021-11-19 2022-01-07 东北大学 Three-modal image fused saliency target detection system and method
CN113902783B (en) * 2021-11-19 2024-04-30 东北大学 Three-mode image fused saliency target detection system and method
CN114170174A (en) * 2021-12-02 2022-03-11 沈阳工业大学 CLANet steel rail surface defect detection system and method based on RGB-D image
CN114170174B (en) * 2021-12-02 2024-01-23 沈阳工业大学 CLANet steel rail surface defect detection system and method based on RGB-D image

Similar Documents

Publication Publication Date Title
CN111967477A (en) RGB-D image saliency target detection method, device, equipment and storage medium
CN112396115B (en) Attention mechanism-based target detection method and device and computer equipment
US11244195B2 (en) Iteratively applying neural networks to automatically identify pixels of salient objects portrayed in digital images
US10534981B2 (en) Media content analysis system and method
WO2018153322A1 (en) Key point detection method, neural network training method, apparatus and electronic device
US8463025B2 (en) Distributed artificial intelligence services on a cell phone
US10290107B1 (en) Transform domain regression convolutional neural network for image segmentation
US10964026B2 (en) Refined segmentation system, method and device of image shadow area
CN113066017B (en) Image enhancement method, model training method and equipment
CN106462572A (en) Techniques for distributed optical character recognition and distributed machine language translation
CN110991560A (en) Target detection method and system in combination with context information
CN112528961A (en) Video analysis method based on Jetson Nano
CN115249304A (en) Training method and device for detecting segmentation model, electronic equipment and storage medium
KR20210059576A (en) Method of processing image based on artificial intelligence and image processing device performing the same
CN111861867A (en) Image background blurring method and device
CN105654103B (en) Image identification method and electronic equipment
WO2024027347A1 (en) Content recognition method and apparatus, device, storage medium, and computer program product
CN114359789A (en) Target detection method, device, equipment and medium for video image
KR102193469B1 (en) Computer device and method to perform image conversion
CN111179283A (en) Image semantic segmentation method and device and storage medium
JP2019125128A (en) Information processing device, control method and program
CN104541289A (en) Interest point judgement method and interest point judgement device
CN113516088B (en) Object recognition method, device and computer readable storage medium
WO2014156558A1 (en) Related information providing system and operation control method thereof
CN117132819A (en) Image classification method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination