CN113222003B - Construction method and system of indoor scene pixel-by-pixel semantic classifier based on RGB-D - Google Patents
Construction method and system of indoor scene pixel-by-pixel semantic classifier based on RGB-D Download PDFInfo
- Publication number
- CN113222003B CN113222003B CN202110498856.6A CN202110498856A CN113222003B CN 113222003 B CN113222003 B CN 113222003B CN 202110498856 A CN202110498856 A CN 202110498856A CN 113222003 B CN113222003 B CN 113222003B
- Authority
- CN
- China
- Prior art keywords
- rgb
- pixel
- depth
- module
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000010276 construction Methods 0.000 title claims abstract description 18
- 238000000605 extraction Methods 0.000 claims abstract description 39
- 238000000034 method Methods 0.000 claims abstract description 33
- 230000007246 mechanism Effects 0.000 claims abstract description 14
- 238000002372 labelling Methods 0.000 claims abstract description 8
- 230000008447 perception Effects 0.000 claims abstract description 6
- 241000282326 Felis catus Species 0.000 claims description 7
- 230000004927 fusion Effects 0.000 claims description 7
- 230000011218 segmentation Effects 0.000 description 11
- 230000008569 process Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000003709 image segmentation Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 101100295091 Arabidopsis thaliana NUDT14 gene Proteins 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24133—Distances to prototypes
- G06F18/24137—Distances to cluster centroïds
- G06F18/2414—Smoothing the distance, e.g. radial basis function networks [RBFN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
- G06V10/267—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Multimedia (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to an indoor scene pixel-by-pixel semantic classifier construction method and system based on RGB-D, wherein the method comprises the following steps: s1: image acquisition is carried out on an indoor scene, and RGB data and Depth data are obtained; s2: defining the object category in the image, and labeling each pixel of the object category; s3: the RGB data and Depth data are respectively input into a feature extraction module, and simultaneously the RGB data is input into a Depth estimation module to supervise the feature extraction process of the RGB data so as to obtain a corresponding feature f rgb And f depth The method comprises the steps of carrying out a first treatment on the surface of the S4: will f rgb And f depth The input scale perception module is used for inputting the scale perception module,obtaining scale-aware featuresAnds5: will beAndrespectively input into self-attention mechanism module to obtain characteristicsAnd featuresS6: will beAndthe input mode self-adaption module calculates the mode self-adaption weight and fuses by utilizing the mode self-adaption weightAnda pixel-by-pixel semantic classification of the image is obtained. The method can be applied to the understanding of indoor scenes, and can effectively help indoor automatic navigation and other applications by utilizing the collected RGB-D image pixel-by-pixel semantic information.
Description
Technical Field
The invention relates to the field of blockchain and machine learning, in particular to an indoor scene pixel-by-pixel semantic classifier construction method and system based on RGB-D.
Background
In recent years, related research results of indoor scene semantic understanding have been widely applied in different fields, including emergency replay in the security field, positioning, obstacle avoidance, target searching functions and the like in the intelligent robot field, become hot spot problems of research in the fields of virtual reality, augmented reality and the like, and bring convenience to daily life and work of people. However, the indoor scene has great challenges for semantic understanding due to the condition that the light rays are dark and the objects are mutually overlapped, and the like, and the research has been a fundamental and classical problem in the fields of computer graphics, virtual reality, computer vision and the like.
Scene semantic segmentation belongs to a pixel-level picture classification task, a picture I is given, the size of the picture I is assumed to be w 'x h', the object of a scene semantic segmentation algorithm is to output a result graph with the size of w 'x h', and any pixel point in the graph is selectedCorresponding to the corresponding position pixel, x in I i,j The value of (c) represents the label class corresponding to the corresponding voxel in the corresponding I, where I, j represents the (I, j) th location in the graph. Image-based semantic segmentation is a necessary premise for semantic understanding of indoor scenes. Extensive and intensive studies have been conducted on semantic segmentation studies of images, and in particular, artificial feature-based methods and deep learning-based methods have been conducted. The method based on the artificial features mainly extracts robust artificial features from input image data. Depending on the manner of utilizing the features, there are generally several types of methods: image segmentation method based on artificial threshold setting: for example, a threshold value mode (mean value method, histogram correlation technology, vision technology and the like) is used for an input image to solve the image segmentation problem, but the method has weak processing effect on a homogeneous region in the image; based on a watershed algorithm: a better segmentation effect can be obtained by using a watershed algorithm on an original gradient image, but the algorithm needs to solve the gradient map of the image in advance and needs a relatively complex pre-processing. With the development of deep learning, a large number of algorithms based on deep neural networks are proposed. For example, the FCN algorithm, by modifying the last layer of the classical AlexNet network to be a 1 x 1 convolution, well achieves a pixel-by-pixel classification in the input image. Subsequent calculations regarding semantic segmentation of imagesThe vast majority of the processes are based on FCN improvements. However, these algorithms cannot effectively extract RGB features and Depth features, and use these features to achieve a better indoor scene semantic segmentation effect.
Disclosure of Invention
In order to solve the technical problems, the invention provides an indoor scene pixel-by-pixel semantic classifier construction method and system based on RGB-D.
The technical scheme of the invention is as follows: an indoor scene pixel-by-pixel semantic classifier construction method based on RGB-D comprises the following steps:
step S1: image acquisition is carried out on an indoor scene, and RGB data and Depth data are obtained;
step S2: defining the object category in the image, and labeling each pixel of the object category;
step S3: the RGB data and Depth data are respectively input into a feature extraction module, and simultaneously the RGB data is input into a Depth estimation module, and the module is utilized to monitor the RGB data feature extraction process to obtain a corresponding feature f rgb And f depth ;
Step S4: the f is set to rgb And said f depth Inputting the scale sensing module to select proper scale feature information and obtain scale sensing featuresAnd->
Step S5: the saidAnd said->Respectively inputting into self-attention mechanism modules for expanding receptive field to obtain characteristic +.>And features->
Step S6: features to be characterizedAnd->The method comprises the steps of inputting a modal self-adaption module, calculating modal self-adaption weights, and fusing +.>And->And obtaining pixel-by-pixel semantic classification of the image.
Compared with the prior art, the invention has the following advantages:
1. the invention provides an indoor scene pixel-by-pixel classifier construction method based on RGB-D data, which can utilize Depth as a Depth standard value of RGB data according to interaction between the collected RGB and the Depth data in the indoor scene, promote RGB branches to extract more effective and targeted features, simultaneously design a plurality of modules to fuse the features from the RGB data and the features of the Depth data, and finally provide a fractional graph organic combination mode of different modes to provide an effective solution for pixel-by-pixel classification of the indoor scene.
2. The present invention provides an end-to-end network architecture in which the network modules for feature extraction may be generic or post-designed without significant impact on the several modules presented herein. The invention can be applied to related applications in indoor scenes by integrating various modules to perform end-to-end optimization, and can effectively help indoor robot navigation and other related applications by effectively performing semantic segmentation on the currently captured scene especially aiming at tasks requiring fine-grained semantic understanding.
Drawings
FIG. 1 is a flow chart of a construction method of an indoor scene pixel-by-pixel semantic classifier based on RGB-D in an embodiment of the invention;
fig. 2 is a block diagram of step S4 in a construction method of an indoor scene pixel-by-pixel semantic classifier based on RGB-D according to an embodiment of the present invention: will f rgb And f depth Inputting the scale sensing module to select proper scale feature information and obtain scale sensing featuresAnd->Is a flow chart of (2);
fig. 3 is a block diagram of step S6 in a construction method of an indoor scene pixel-by-pixel semantic classifier based on RGB-D according to an embodiment of the present invention: features to be characterizedAnd->The input mode self-adaptive module calculates the mode self-adaptive weight, and the mode self-adaptive weight is utilized to fuse +.>And->Obtaining a flow chart of pixel-by-pixel semantic classification of the image;
fig. 4 is a structural block diagram of an indoor scene pixel-by-pixel semantic classifier construction system based on RGB-D according to an embodiment of the present invention.
Detailed Description
The invention provides an indoor scene pixel-by-pixel semantic classifier construction method based on RGB-D, which can be applied to related applications in indoor scenes, in particular to tasks requiring fine-granularity semantic understanding, and can effectively help indoor robot navigation and other related applications by effectively carrying out semantic segmentation on a currently captured scene.
The present invention will be further described in detail below with reference to the accompanying drawings by way of specific embodiments in order to make the objects, technical solutions and advantages of the present invention more apparent.
The execution environment of the embodiment of the invention adopts a 4.0GHZ central processing unit and 128 Gbyte memory core 4 kernel computer, and in order to accelerate the training and reasoning process of the semantic segmentation network model, the invention adopts 8 Inlet Weida Geforce GTX 1080TI GPU graphics cards to perform acceleration calculation. Meanwhile, a construction program of an indoor scene pixel-by-pixel classifier based on a convolutional neural network serving RGB-D input data, which is compiled by languages such as Python and C++, is adopted. The invention can be based on other execution environments on the premise of allowing the memory and the video memory of the computer, and the description is omitted here.
Example 1
As shown in fig. 1, the construction method of the indoor scene pixel-by-pixel semantic classifier based on RGB-D provided by the embodiment of the invention comprises the following steps:
step S1: image acquisition is carried out on an indoor scene, and RGB data and Depth data are obtained;
step S2: defining the object category in the image, and labeling each pixel of the object category;
step S3: the RGB data and Depth data are respectively input into a feature extraction module, and simultaneously the RGB data is input into a Depth estimation module, and the module is utilized to monitor the feature extraction process of the RGB data to obtain a corresponding feature f rgb And f depth ;
Step S4: will f rgb And f depth Inputting the scale sensing module to select proper scale feature information and obtain scale sensing featuresAnd->
Step S5: will beAnd->Respectively inputting into self-attention mechanism modules to expand receptive fields to obtain featuresAnd features->
Step S6: will beAnd->The input mode self-adaptive module calculates the mode self-adaptive weight, and the mode self-adaptive weight is utilized to fuse +.>And->A pixel-by-pixel semantic classification of the image is obtained.
In one embodiment, step S1 described above: image acquisition is carried out on an indoor scene to obtain RGB data and Depth data, and the method specifically comprises the following steps:
and acquiring data of an indoor scene by using a consumer-level Depth camera such as Microsoft Kinect and the like to acquire RGB data and Depth data.
And converting the point coordinates of the RGB and Depth images into a camera coordinate system, converting by utilizing the infrared and the optical heart external parameters of the RGB camera, and finding the relation between the points on the RGB image and the Depth image by utilizing the method so as to finish data alignment.
In one embodiment, step S2 above: defining the object category in the image, and labeling each pixel of the object category, which specifically comprises the following steps:
and counting the total categories of the objects in the current indoor scene, wherein the number of the total categories is the number of the categories of each pixel in the indoor scene after counting. Pixels in RGB and Depth images captured in indoor scenes are artificially marked. Since RGB and Depth describe the same scene, the tag values of RGB and Depth are the same at this time.
In one embodiment, the step S3: the RGB data and Depth data are respectively input into a feature extraction module, and simultaneously the RGB data is input into a Depth estimation module, and the module is utilized to monitor the feature extraction process of the RGB data to obtain a corresponding feature f rgb And f depth The method specifically comprises the following steps:
the RGB data and Depth data are respectively input into an RGB feature extraction module and a Depth feature extraction module for feature extraction, and when the RGB features are extracted, the RGB data are simultaneously input into a Depth estimation module, and the module is utilized to monitor the RGB data feature extraction process to respectively obtain corresponding features f rgb And f depth 。
In the step, the characteristic extraction is carried out on input RGB and Depth data based on networks such as ResNet and the like by using a classical convolutional neural network structure, and in the embodiment of the invention, resNet-18 is used as a coding end, and the ResNet-18 adopts a B1-B2-B3-B4 structure. Since RGB and Depth are two modality image data, it is necessary to use two feature extraction modules to perform feature extraction on them, respectively. In the RGB feature extraction branch, the depth estimation is used as an additional supervision signal, and the depth estimation module of the embodiment of the invention adopts a B1-B2-B3-B4-B4-B3-B2-B1 structure. The resolution of the extracted features replaces the downsampling layer with tape Kong Juanji to prevent the resolution of the extracted features from becoming too small, generally requiring that the resolution after feature extraction be no less than 1/8 of the original input image. Finally, the step extracts the characteristic f corresponding to the input RGB data and Depth data rgb And f depth 。
As shown in FIG. 2, in one embodiment, the above stepsS4: will f rgb And f depth Inputting the scale sensing module to select proper scale feature information and obtain scale sensing featuresAnd->The method specifically comprises the following steps:
step S41: utilizing feature pyramid to pair feature f rgb And f depth Extracting multi-scale characteristics to obtainAnd->
In the embodiment of the invention, a cavity space convolution pooling feature pyramid (ASPP) is used for extracting multi-scale features, a band Kong Juanji with the band porosity of {6,12,18,24} is used for extracting 4-scale feature information of RGB and Depth respectively, and thus, the method is obtainedAnd->
Step S42: will beAnd->Fusion is carried out to obtain fusion characteristics->Obtaining a multichannel scale weight graph through a convolution network;
in this step, a 3-layer network structure is constructed, which has the structure of cat-conv-sm (link layer-convolution layer-softmax layer),wherein the cat layer is used to extract multi-scale features of the RGB modalityAnd multiscale features of Depth modalitiesLigating together to construct a fusion feature->Because the feature pyramid used for extracting the features in the embodiment of the invention has 4 scales, namely the RGB features have 4 scales, and the Depth features have 4 scales, the method is realized by c A 1 x 1 convolutional network of onv layers can obtain an 8-channel scale weight map.
Step S43: feature selection is carried out on the multichannel scale weight graph to respectively obtain scale perception featuresAnd->
In this step, the 8-channel weight map is input sm The weight normalization is carried out in the layers, and finally the first 4 channels are used for selecting RGB scale characteristics, the last 4 channels are used for selecting Depth scale characteristics, and the scale perception characteristics can be obtained after the 3-layer network structure is adoptedAnd->The following formulas (1) to (2) show:
wherein,,respectively, a 4-channel weight map for selecting RGB scale features and a 4-channel weight map for selecting Depth scale features, denoted hadamard product.
In one embodiment, the step S5 is as follows: will beAnd->Respectively inputting into self-attention mechanism modules for expanding receptive field to obtain characteristic +.>And->The method specifically comprises the following steps:
in this step, the self-attention mechanism module is constructed by using the following formulas (3) to (5)And->Inputting the self-attention mechanism module to obtain the characteristic +.>And features->
Wherein,, c the number of characteristic channels represented will +.>3 copies of the new feature +.3 times> N=h×w; λ and β are weight parameters.
Each element in f of equation (3) characterizes the self-correlation between the current feature element by calculating the similarity between each element and all the remaining elements in the feature map to determine the relationship between the current element and all the remaining elements. In this way, the problem of smaller extracted characteristic receptive fields is solved. In order to make the network not completely depend on the attention mechanism characteristics in the training process, the participation degree of the attention mechanism characteristics in different stages in the RGB and Depth network training process is adjusted by introducing lambda and beta weight parameters. By the above way, obtainAnd->
As shown in fig. 3, in one embodiment, step S6 described above: features to be characterizedAnd->The input mode self-adaptive module calculates the mode self-adaptive weight, and the mode self-adaptive weight is utilized to fuse +.>And->The pixel-by-pixel semantic classification of the image is obtained, which specifically comprises the following steps:
step S61: constructing a modal self-adaptive module, adopting a 4-layer network structure cat ' -conv ' -sm ' -mul ' (connecting layer-convolution layer-softmax layer-matrix multiplication operation layer), wherein the cat ' layer is used for multiplyingAnd +.>The two features are fused together to form a fused feature +.>
Step S62: will fuse featuresInputting the conv' layer to obtain the map 2*h*w A weight mask map;
step S63: regularizing the weight mask map by utilizing the sm' layer, and separating according to the channels to obtain the weight mask mapAnd +.>
Step S64: applying a weight mask pattern to a channel using mul' layerP is obtained according to the formula (6) rgb The method comprises the steps of carrying out a first treatment on the surface of the The weight mask map of the other channel acts on +.>P is obtained according to the formula (7) depth The method comprises the steps of carrying out a first treatment on the surface of the And (3) adding the two according to a formula (8) to obtain a pixel-by-pixel classification result P of the image.
P=P rgb +P depth (8)
The invention provides an indoor scene pixel-by-pixel classifier construction method based on RGB-D data, which can utilize Depth as a Depth standard value of RGB data according to interaction between the collected RGB and the Depth data in the indoor scene, promote RGB branches to extract more effective and targeted features, simultaneously design a plurality of modules to fuse the features from the RGB data and the features of the Depth data, and finally provide a fractional graph organic combination mode of different modes to provide an effective solution for pixel-by-pixel classification of the indoor scene.
The present invention provides an end-to-end network architecture in which the network modules for feature extraction may be generic or post-designed without significant impact on the several modules presented herein. The invention can be applied to related applications in indoor scenes by integrating various modules to perform end-to-end optimization, and can effectively help indoor robot navigation and other related applications by effectively performing semantic segmentation on the currently captured scene especially aiming at tasks requiring fine-grained semantic understanding.
Example two
As shown in fig. 4, an embodiment of the present invention provides an indoor scene pixel-by-pixel semantic classifier construction system based on RGB-D, which includes the following modules:
the image acquisition module 71 is used for acquiring images of indoor scenes and acquiring RGB data and Depth data;
a category labeling module 72, configured to define a category of an object in the image and label each pixel thereof with a category;
the feature extraction module 73 is used for inputting the RGB data and Depth data into the feature extraction module respectively, inputting the RGB data into the Depth estimation module, and supervising the RGB data feature extraction process by using the feature extraction module to obtain the corresponding feature f rgb And f depth ;
A scale-aware feature extraction module 74 for extracting f rgb And f depth Inputting the scale sensing module to select proper scale feature information and obtain scale sensing featuresAnd->
Attention feature extraction module 75 for extractingAnd->Respectively inputting into self-attention mechanism modules for expanding receptive field to obtain characteristic +.>And->
An image classification module 76 for classifyingAnd->The input mode self-adaptive module calculates the mode self-adaptive weight, and the mode self-adaptive weight is utilized to fuse +.>And->A pixel-by-pixel semantic classification of the image is obtained.
The above examples are provided for the purpose of describing the present invention only and are not intended to limit the scope of the present invention. The scope of the invention is defined by the appended claims. Various equivalents and modifications that do not depart from the spirit and principles of the invention are intended to be included within the scope of the invention.
Claims (4)
1. An indoor scene pixel-by-pixel semantic classifier construction method based on RGB-D is characterized by comprising the following steps:
step S1: image acquisition is carried out on an indoor scene, and RGB data and Depth data are obtained;
step S2: defining the object category in the image, and labeling each pixel of the object category;
step S3: the RGB data and Depth data are respectively input into a feature extraction module, and simultaneously the RGB data is input into a Depth estimation module, and the module is utilized to monitor the RGB data feature extraction process to obtain a corresponding feature f rgb And f depth ;
Step S4: the f is set to rgb And said f depth Inputting the scale sensing module to select proper scale feature information and obtain scale sensing featuresAnd->The method specifically comprises the following steps:
step S41: using feature pyramid to make the feature f rgb And f depth Extracting multi-scale characteristics to obtainAnd->
Step S42: the saidAnd->Fusion is carried out to obtain fusion characteristics->Obtaining a multichannel scale weight graph through a convolution network;
step S43: feature selection is carried out on the multichannel scale weight graph to respectively obtain scale perception featuresAnd->
Step S5: the saidAnd said->Respectively inputting into self-attention mechanism modules for expanding receptive field to obtain characteristic +.>And->
Step S6: the saidAnd said->The mode self-adaptive module is input, the mode self-adaptive weight is calculated, and the mode self-adaptive weight is utilized to fuse the +.>And said->The pixel-by-pixel semantic classification of the image is obtained, specifically comprising:
step S61: constructing the mode self-adaptive module, and adopting a 4-layer network structure cat ' -conv ' -sm ' -mul ', wherein a cat ' layer is used for connectingAnd +.>The two features are fused together to form a fused feature +.>
Step S62: characterizing the fusionInputting the conv' layer to obtain the map 2*h*w Weight maskCode patterns;
step S63: regularizing the weight mask map by utilizing the sm' layer, and separating according to channels to respectively obtainAnd +.>
Step S64: applying the weight mask pattern to a channel using mul' layerObtaining P rgb The method comprises the steps of carrying out a first treatment on the surface of the The weight mask map of the other channel acts on +.>Obtaining P depth The method comprises the steps of carrying out a first treatment on the surface of the And adding the two to obtain a pixel-by-pixel classification result of the image.
2. The method for constructing an RGB-D based indoor scene pixel-by-pixel semantic classifier according to claim 1, wherein the step S3: the RGB data and Depth data are respectively input into a feature extraction module, and simultaneously the RGB data is input into a Depth estimation module, and the module is utilized to monitor the RGB data feature extraction process to obtain a corresponding feature f rgb And f depth The method specifically comprises the following steps:
inputting the RGB data and Depth data into an RGB feature extraction module and a Depth feature extraction module respectively for feature extraction, inputting the RGB data into a Depth estimation module simultaneously when extracting the RGB features, and supervising the RGB data feature extraction process by using the module to respectively obtain corresponding features f rgb And f depth 。
3. The RGB-D based indoor scene pixel-by-pixel semantic classifier construction method of claim 1, whichCharacterized in that said step S5: the saidAnd said->Inputting into a self-attention mechanism module, expanding receptive field, and respectively obtaining characteristic +.>And features->The method specifically comprises the following steps:
the saidAnd said->Respectively inputting the self-attention mechanism modules to obtain the characteristic +.>And features->Wherein the self-attention mechanism module is shown in the following formulas (3) to (5);
wherein,,c represents the number of characteristic channels; />For the purpose of->N=h×w; λ and β are weight parameters.
4. An indoor scene pixel-by-pixel semantic classifier construction system based on RGB-D is characterized by comprising the following modules:
the image acquisition module is used for acquiring images of indoor scenes and acquiring RGB data and Depth data;
the class labeling module is used for defining the class of the object in the image and labeling each pixel of the class;
the feature extraction module is used for inputting the RGB data and the Depth data into the feature extraction module respectively, inputting the RGB data into the Depth estimation module, and supervising the RGB data feature extraction process by using the feature extraction module to obtain the corresponding feature f rgb And f depth ;
A scale-aware feature extraction module for extracting the f rgb And said f depth Inputting the scale sensing module to select proper scale feature information and obtain scale sensing featuresAnd->The method specifically comprises the following steps:
step S41: the feature pyramid is utilized to make the featuref rgb And f depth Extracting multi-scale characteristics to obtainAnd->
Step S42: the saidAnd->Fusion is carried out to obtain fusion characteristics->Obtaining a multichannel scale weight graph through a convolution network;
step S43: feature selection is carried out on the multichannel scale weight graph to respectively obtain scale perception featuresAnd->
An attention feature extraction module for extracting the followingAnd said->Respectively inputting into self-attention mechanism modules for expanding receptive field to obtain characteristic +.>And->
An image classification module for classifying the imagesAnd said->The mode self-adaptive module is input, the mode self-adaptive weight is calculated, and the mode self-adaptive weight is utilized to fuse the +.>And said->The pixel-by-pixel semantic classification of the image is obtained, specifically comprising:
step S61: constructing the mode self-adaptive module, and adopting a 4-layer network structure cat ' -conv ' -sm ' -mul ', wherein a cat ' layer is used for connectingAnd +.>The two features are fused together to form a fused feature +.>
Step S62: characterizing the fusionInputting the conv' layer to obtain the map 2*h*w A weight mask map;
step S63: regularizing the weight mask map by utilizing the sm' layer, and separating according to channels to respectively obtainAnd +.>
Step S64: applying the weight mask pattern to a channel using mul' layerObtaining P rgb The method comprises the steps of carrying out a first treatment on the surface of the The weight mask map of the other channel acts on +.>Obtaining P depth The method comprises the steps of carrying out a first treatment on the surface of the And adding the two to obtain a pixel-by-pixel classification result of the image.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110498856.6A CN113222003B (en) | 2021-05-08 | 2021-05-08 | Construction method and system of indoor scene pixel-by-pixel semantic classifier based on RGB-D |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110498856.6A CN113222003B (en) | 2021-05-08 | 2021-05-08 | Construction method and system of indoor scene pixel-by-pixel semantic classifier based on RGB-D |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113222003A CN113222003A (en) | 2021-08-06 |
CN113222003B true CN113222003B (en) | 2023-08-01 |
Family
ID=77091864
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110498856.6A Active CN113222003B (en) | 2021-05-08 | 2021-05-08 | Construction method and system of indoor scene pixel-by-pixel semantic classifier based on RGB-D |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113222003B (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108596102A (en) * | 2018-04-26 | 2018-09-28 | 北京航空航天大学青岛研究院 | Indoor scene object segmentation grader building method based on RGB-D |
CN108985247A (en) * | 2018-07-26 | 2018-12-11 | 北方工业大学 | Multispectral image urban road identification method |
CN110298361A (en) * | 2019-05-22 | 2019-10-01 | 浙江省北大信息技术高等研究院 | A kind of semantic segmentation method and system of RGB-D image |
WO2019232836A1 (en) * | 2018-06-04 | 2019-12-12 | 江南大学 | Multi-scale sensing pedestrian detection method based on improved full convolutional network |
CN111191650A (en) * | 2019-12-30 | 2020-05-22 | 北京市新技术应用研究所 | Object positioning method and system based on RGB-D image visual saliency |
CN111563418A (en) * | 2020-04-14 | 2020-08-21 | 浙江科技学院 | Asymmetric multi-mode fusion significance detection method based on attention mechanism |
CN111582316A (en) * | 2020-04-10 | 2020-08-25 | 天津大学 | RGB-D significance target detection method |
CN111967477A (en) * | 2020-07-02 | 2020-11-20 | 北京大学深圳研究生院 | RGB-D image saliency target detection method, device, equipment and storage medium |
-
2021
- 2021-05-08 CN CN202110498856.6A patent/CN113222003B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108596102A (en) * | 2018-04-26 | 2018-09-28 | 北京航空航天大学青岛研究院 | Indoor scene object segmentation grader building method based on RGB-D |
WO2019232836A1 (en) * | 2018-06-04 | 2019-12-12 | 江南大学 | Multi-scale sensing pedestrian detection method based on improved full convolutional network |
CN108985247A (en) * | 2018-07-26 | 2018-12-11 | 北方工业大学 | Multispectral image urban road identification method |
CN110298361A (en) * | 2019-05-22 | 2019-10-01 | 浙江省北大信息技术高等研究院 | A kind of semantic segmentation method and system of RGB-D image |
CN111191650A (en) * | 2019-12-30 | 2020-05-22 | 北京市新技术应用研究所 | Object positioning method and system based on RGB-D image visual saliency |
CN111582316A (en) * | 2020-04-10 | 2020-08-25 | 天津大学 | RGB-D significance target detection method |
CN111563418A (en) * | 2020-04-14 | 2020-08-21 | 浙江科技学院 | Asymmetric multi-mode fusion significance detection method based on attention mechanism |
CN111967477A (en) * | 2020-07-02 | 2020-11-20 | 北京大学深圳研究生院 | RGB-D image saliency target detection method, device, equipment and storage medium |
Non-Patent Citations (6)
Title |
---|
Depth-Aware CNN for RGB-D Segmentation;Weiyue Wang et al.;《 Computer Vision – ECCV 2018》;144-161 * |
Scale-aware network with modality-awareness for RGB-D indoor semantic segmentation;feng zhou et al.;《Neurocomputing》;第492卷;464-473 * |
TSNet: Three-Stream Self-Attention Network for RGB-D Indoor Semantic Segmentation;Wujie Zhou et al.;《IEEE Intelligent Systems》;第36卷(第4期);73-78 * |
基于RGB-D的反向融合实例分割算法;汪丹丹等;《图学学报》;第42卷(第5期);767-774 * |
基于改进的深度神经网络的人体动作识别模型;何冰倩等;《计算机应用研究》;第36卷(第10期);3107-3111 * |
基于通道注意力机制的RGB-D图像语义分割网络;吴子涵等;《电子设计工程》;第28卷(第13期);147-153+159 * |
Also Published As
Publication number | Publication date |
---|---|
CN113222003A (en) | 2021-08-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110298262B (en) | Object identification method and device | |
CN111052126B (en) | Pedestrian attribute identification and positioning method and convolutional neural network system | |
CN109559320B (en) | Method and system for realizing visual SLAM semantic mapping function based on hole convolution deep neural network | |
CN111814661B (en) | Human body behavior recognition method based on residual error-circulating neural network | |
CN109948475B (en) | Human body action recognition method based on skeleton features and deep learning | |
CN112990211B (en) | Training method, image processing method and device for neural network | |
CN109902548B (en) | Object attribute identification method and device, computing equipment and system | |
CN112800903B (en) | Dynamic expression recognition method and system based on space-time diagram convolutional neural network | |
Cadena et al. | Pedestrian graph: Pedestrian crossing prediction based on 2d pose estimation and graph convolutional networks | |
CN111696110B (en) | Scene segmentation method and system | |
CN110222718B (en) | Image processing method and device | |
CN113344932B (en) | Semi-supervised single-target video segmentation method | |
CN112329525A (en) | Gesture recognition method and device based on space-time diagram convolutional neural network | |
CN111428664B (en) | Computer vision real-time multi-person gesture estimation method based on deep learning technology | |
CN112801015A (en) | Multi-mode face recognition method based on attention mechanism | |
Khan et al. | A deep survey on supervised learning based human detection and activity classification methods | |
CN111832592A (en) | RGBD significance detection method and related device | |
Grigorev et al. | Depth estimation from single monocular images using deep hybrid network | |
CN111209873A (en) | High-precision face key point positioning method and system based on deep learning | |
Manssor et al. | Real-time human detection in thermal infrared imaging at night using enhanced Tiny-yolov3 network | |
Jiang et al. | Unsupervised monocular depth perception: Focusing on moving objects | |
Yang et al. | [Retracted] A Method of Image Semantic Segmentation Based on PSPNet | |
CN116740419A (en) | Target detection method based on graph regulation network | |
CN114067273A (en) | Night airport terminal thermal imaging remarkable human body segmentation detection method | |
Jiang et al. | An end-to-end human segmentation by region proposed fully convolutional network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |