CN113222003B - Construction method and system of indoor scene pixel-by-pixel semantic classifier based on RGB-D - Google Patents

Construction method and system of indoor scene pixel-by-pixel semantic classifier based on RGB-D Download PDF

Info

Publication number
CN113222003B
CN113222003B CN202110498856.6A CN202110498856A CN113222003B CN 113222003 B CN113222003 B CN 113222003B CN 202110498856 A CN202110498856 A CN 202110498856A CN 113222003 B CN113222003 B CN 113222003B
Authority
CN
China
Prior art keywords
rgb
pixel
depth
module
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110498856.6A
Other languages
Chinese (zh)
Other versions
CN113222003A (en
Inventor
周锋
张凤全
蔡兴泉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
North China University of Technology
Original Assignee
North China University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by North China University of Technology filed Critical North China University of Technology
Priority to CN202110498856.6A priority Critical patent/CN113222003B/en
Publication of CN113222003A publication Critical patent/CN113222003A/en
Application granted granted Critical
Publication of CN113222003B publication Critical patent/CN113222003B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24137Distances to cluster centroïds
    • G06F18/2414Smoothing the distance, e.g. radial basis function networks [RBFN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Multimedia (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to an indoor scene pixel-by-pixel semantic classifier construction method and system based on RGB-D, wherein the method comprises the following steps: s1: image acquisition is carried out on an indoor scene, and RGB data and Depth data are obtained; s2: defining the object category in the image, and labeling each pixel of the object category; s3: the RGB data and Depth data are respectively input into a feature extraction module, and simultaneously the RGB data is input into a Depth estimation module to supervise the feature extraction process of the RGB data so as to obtain a corresponding feature f rgb And f depth The method comprises the steps of carrying out a first treatment on the surface of the S4: will f rgb And f depth The input scale perception module is used for inputting the scale perception module,obtaining scale-aware featuresAnds5: will beAndrespectively input into self-attention mechanism module to obtain characteristicsAnd featuresS6: will beAndthe input mode self-adaption module calculates the mode self-adaption weight and fuses by utilizing the mode self-adaption weightAnda pixel-by-pixel semantic classification of the image is obtained. The method can be applied to the understanding of indoor scenes, and can effectively help indoor automatic navigation and other applications by utilizing the collected RGB-D image pixel-by-pixel semantic information.

Description

Construction method and system of indoor scene pixel-by-pixel semantic classifier based on RGB-D
Technical Field
The invention relates to the field of blockchain and machine learning, in particular to an indoor scene pixel-by-pixel semantic classifier construction method and system based on RGB-D.
Background
In recent years, related research results of indoor scene semantic understanding have been widely applied in different fields, including emergency replay in the security field, positioning, obstacle avoidance, target searching functions and the like in the intelligent robot field, become hot spot problems of research in the fields of virtual reality, augmented reality and the like, and bring convenience to daily life and work of people. However, the indoor scene has great challenges for semantic understanding due to the condition that the light rays are dark and the objects are mutually overlapped, and the like, and the research has been a fundamental and classical problem in the fields of computer graphics, virtual reality, computer vision and the like.
Scene semantic segmentation belongs to a pixel-level picture classification task, a picture I is given, the size of the picture I is assumed to be w 'x h', the object of a scene semantic segmentation algorithm is to output a result graph with the size of w 'x h', and any pixel point in the graph is selectedCorresponding to the corresponding position pixel, x in I i,j The value of (c) represents the label class corresponding to the corresponding voxel in the corresponding I, where I, j represents the (I, j) th location in the graph. Image-based semantic segmentation is a necessary premise for semantic understanding of indoor scenes. Extensive and intensive studies have been conducted on semantic segmentation studies of images, and in particular, artificial feature-based methods and deep learning-based methods have been conducted. The method based on the artificial features mainly extracts robust artificial features from input image data. Depending on the manner of utilizing the features, there are generally several types of methods: image segmentation method based on artificial threshold setting: for example, a threshold value mode (mean value method, histogram correlation technology, vision technology and the like) is used for an input image to solve the image segmentation problem, but the method has weak processing effect on a homogeneous region in the image; based on a watershed algorithm: a better segmentation effect can be obtained by using a watershed algorithm on an original gradient image, but the algorithm needs to solve the gradient map of the image in advance and needs a relatively complex pre-processing. With the development of deep learning, a large number of algorithms based on deep neural networks are proposed. For example, the FCN algorithm, by modifying the last layer of the classical AlexNet network to be a 1 x 1 convolution, well achieves a pixel-by-pixel classification in the input image. Subsequent calculations regarding semantic segmentation of imagesThe vast majority of the processes are based on FCN improvements. However, these algorithms cannot effectively extract RGB features and Depth features, and use these features to achieve a better indoor scene semantic segmentation effect.
Disclosure of Invention
In order to solve the technical problems, the invention provides an indoor scene pixel-by-pixel semantic classifier construction method and system based on RGB-D.
The technical scheme of the invention is as follows: an indoor scene pixel-by-pixel semantic classifier construction method based on RGB-D comprises the following steps:
step S1: image acquisition is carried out on an indoor scene, and RGB data and Depth data are obtained;
step S2: defining the object category in the image, and labeling each pixel of the object category;
step S3: the RGB data and Depth data are respectively input into a feature extraction module, and simultaneously the RGB data is input into a Depth estimation module, and the module is utilized to monitor the RGB data feature extraction process to obtain a corresponding feature f rgb And f depth
Step S4: the f is set to rgb And said f depth Inputting the scale sensing module to select proper scale feature information and obtain scale sensing featuresAnd->
Step S5: the saidAnd said->Respectively inputting into self-attention mechanism modules for expanding receptive field to obtain characteristic +.>And features->
Step S6: features to be characterizedAnd->The method comprises the steps of inputting a modal self-adaption module, calculating modal self-adaption weights, and fusing +.>And->And obtaining pixel-by-pixel semantic classification of the image.
Compared with the prior art, the invention has the following advantages:
1. the invention provides an indoor scene pixel-by-pixel classifier construction method based on RGB-D data, which can utilize Depth as a Depth standard value of RGB data according to interaction between the collected RGB and the Depth data in the indoor scene, promote RGB branches to extract more effective and targeted features, simultaneously design a plurality of modules to fuse the features from the RGB data and the features of the Depth data, and finally provide a fractional graph organic combination mode of different modes to provide an effective solution for pixel-by-pixel classification of the indoor scene.
2. The present invention provides an end-to-end network architecture in which the network modules for feature extraction may be generic or post-designed without significant impact on the several modules presented herein. The invention can be applied to related applications in indoor scenes by integrating various modules to perform end-to-end optimization, and can effectively help indoor robot navigation and other related applications by effectively performing semantic segmentation on the currently captured scene especially aiming at tasks requiring fine-grained semantic understanding.
Drawings
FIG. 1 is a flow chart of a construction method of an indoor scene pixel-by-pixel semantic classifier based on RGB-D in an embodiment of the invention;
fig. 2 is a block diagram of step S4 in a construction method of an indoor scene pixel-by-pixel semantic classifier based on RGB-D according to an embodiment of the present invention: will f rgb And f depth Inputting the scale sensing module to select proper scale feature information and obtain scale sensing featuresAnd->Is a flow chart of (2);
fig. 3 is a block diagram of step S6 in a construction method of an indoor scene pixel-by-pixel semantic classifier based on RGB-D according to an embodiment of the present invention: features to be characterizedAnd->The input mode self-adaptive module calculates the mode self-adaptive weight, and the mode self-adaptive weight is utilized to fuse +.>And->Obtaining a flow chart of pixel-by-pixel semantic classification of the image;
fig. 4 is a structural block diagram of an indoor scene pixel-by-pixel semantic classifier construction system based on RGB-D according to an embodiment of the present invention.
Detailed Description
The invention provides an indoor scene pixel-by-pixel semantic classifier construction method based on RGB-D, which can be applied to related applications in indoor scenes, in particular to tasks requiring fine-granularity semantic understanding, and can effectively help indoor robot navigation and other related applications by effectively carrying out semantic segmentation on a currently captured scene.
The present invention will be further described in detail below with reference to the accompanying drawings by way of specific embodiments in order to make the objects, technical solutions and advantages of the present invention more apparent.
The execution environment of the embodiment of the invention adopts a 4.0GHZ central processing unit and 128 Gbyte memory core 4 kernel computer, and in order to accelerate the training and reasoning process of the semantic segmentation network model, the invention adopts 8 Inlet Weida Geforce GTX 1080TI GPU graphics cards to perform acceleration calculation. Meanwhile, a construction program of an indoor scene pixel-by-pixel classifier based on a convolutional neural network serving RGB-D input data, which is compiled by languages such as Python and C++, is adopted. The invention can be based on other execution environments on the premise of allowing the memory and the video memory of the computer, and the description is omitted here.
Example 1
As shown in fig. 1, the construction method of the indoor scene pixel-by-pixel semantic classifier based on RGB-D provided by the embodiment of the invention comprises the following steps:
step S1: image acquisition is carried out on an indoor scene, and RGB data and Depth data are obtained;
step S2: defining the object category in the image, and labeling each pixel of the object category;
step S3: the RGB data and Depth data are respectively input into a feature extraction module, and simultaneously the RGB data is input into a Depth estimation module, and the module is utilized to monitor the feature extraction process of the RGB data to obtain a corresponding feature f rgb And f depth
Step S4: will f rgb And f depth Inputting the scale sensing module to select proper scale feature information and obtain scale sensing featuresAnd->
Step S5: will beAnd->Respectively inputting into self-attention mechanism modules to expand receptive fields to obtain featuresAnd features->
Step S6: will beAnd->The input mode self-adaptive module calculates the mode self-adaptive weight, and the mode self-adaptive weight is utilized to fuse +.>And->A pixel-by-pixel semantic classification of the image is obtained.
In one embodiment, step S1 described above: image acquisition is carried out on an indoor scene to obtain RGB data and Depth data, and the method specifically comprises the following steps:
and acquiring data of an indoor scene by using a consumer-level Depth camera such as Microsoft Kinect and the like to acquire RGB data and Depth data.
And converting the point coordinates of the RGB and Depth images into a camera coordinate system, converting by utilizing the infrared and the optical heart external parameters of the RGB camera, and finding the relation between the points on the RGB image and the Depth image by utilizing the method so as to finish data alignment.
In one embodiment, step S2 above: defining the object category in the image, and labeling each pixel of the object category, which specifically comprises the following steps:
and counting the total categories of the objects in the current indoor scene, wherein the number of the total categories is the number of the categories of each pixel in the indoor scene after counting. Pixels in RGB and Depth images captured in indoor scenes are artificially marked. Since RGB and Depth describe the same scene, the tag values of RGB and Depth are the same at this time.
In one embodiment, the step S3: the RGB data and Depth data are respectively input into a feature extraction module, and simultaneously the RGB data is input into a Depth estimation module, and the module is utilized to monitor the feature extraction process of the RGB data to obtain a corresponding feature f rgb And f depth The method specifically comprises the following steps:
the RGB data and Depth data are respectively input into an RGB feature extraction module and a Depth feature extraction module for feature extraction, and when the RGB features are extracted, the RGB data are simultaneously input into a Depth estimation module, and the module is utilized to monitor the RGB data feature extraction process to respectively obtain corresponding features f rgb And f depth
In the step, the characteristic extraction is carried out on input RGB and Depth data based on networks such as ResNet and the like by using a classical convolutional neural network structure, and in the embodiment of the invention, resNet-18 is used as a coding end, and the ResNet-18 adopts a B1-B2-B3-B4 structure. Since RGB and Depth are two modality image data, it is necessary to use two feature extraction modules to perform feature extraction on them, respectively. In the RGB feature extraction branch, the depth estimation is used as an additional supervision signal, and the depth estimation module of the embodiment of the invention adopts a B1-B2-B3-B4-B4-B3-B2-B1 structure. The resolution of the extracted features replaces the downsampling layer with tape Kong Juanji to prevent the resolution of the extracted features from becoming too small, generally requiring that the resolution after feature extraction be no less than 1/8 of the original input image. Finally, the step extracts the characteristic f corresponding to the input RGB data and Depth data rgb And f depth
As shown in FIG. 2, in one embodiment, the above stepsS4: will f rgb And f depth Inputting the scale sensing module to select proper scale feature information and obtain scale sensing featuresAnd->The method specifically comprises the following steps:
step S41: utilizing feature pyramid to pair feature f rgb And f depth Extracting multi-scale characteristics to obtainAnd->
In the embodiment of the invention, a cavity space convolution pooling feature pyramid (ASPP) is used for extracting multi-scale features, a band Kong Juanji with the band porosity of {6,12,18,24} is used for extracting 4-scale feature information of RGB and Depth respectively, and thus, the method is obtainedAnd->
Step S42: will beAnd->Fusion is carried out to obtain fusion characteristics->Obtaining a multichannel scale weight graph through a convolution network;
in this step, a 3-layer network structure is constructed, which has the structure of cat-conv-sm (link layer-convolution layer-softmax layer),wherein the cat layer is used to extract multi-scale features of the RGB modalityAnd multiscale features of Depth modalitiesLigating together to construct a fusion feature->Because the feature pyramid used for extracting the features in the embodiment of the invention has 4 scales, namely the RGB features have 4 scales, and the Depth features have 4 scales, the method is realized by c A 1 x 1 convolutional network of onv layers can obtain an 8-channel scale weight map.
Step S43: feature selection is carried out on the multichannel scale weight graph to respectively obtain scale perception featuresAnd->
In this step, the 8-channel weight map is input sm The weight normalization is carried out in the layers, and finally the first 4 channels are used for selecting RGB scale characteristics, the last 4 channels are used for selecting Depth scale characteristics, and the scale perception characteristics can be obtained after the 3-layer network structure is adoptedAnd->The following formulas (1) to (2) show:
wherein,,respectively, a 4-channel weight map for selecting RGB scale features and a 4-channel weight map for selecting Depth scale features, denoted hadamard product.
In one embodiment, the step S5 is as follows: will beAnd->Respectively inputting into self-attention mechanism modules for expanding receptive field to obtain characteristic +.>And->The method specifically comprises the following steps:
in this step, the self-attention mechanism module is constructed by using the following formulas (3) to (5)And->Inputting the self-attention mechanism module to obtain the characteristic +.>And features->
Wherein,, c the number of characteristic channels represented will +.>3 copies of the new feature +.3 times> N=h×w; λ and β are weight parameters.
Each element in f of equation (3) characterizes the self-correlation between the current feature element by calculating the similarity between each element and all the remaining elements in the feature map to determine the relationship between the current element and all the remaining elements. In this way, the problem of smaller extracted characteristic receptive fields is solved. In order to make the network not completely depend on the attention mechanism characteristics in the training process, the participation degree of the attention mechanism characteristics in different stages in the RGB and Depth network training process is adjusted by introducing lambda and beta weight parameters. By the above way, obtainAnd->
As shown in fig. 3, in one embodiment, step S6 described above: features to be characterizedAnd->The input mode self-adaptive module calculates the mode self-adaptive weight, and the mode self-adaptive weight is utilized to fuse +.>And->The pixel-by-pixel semantic classification of the image is obtained, which specifically comprises the following steps:
step S61: constructing a modal self-adaptive module, adopting a 4-layer network structure cat ' -conv ' -sm ' -mul ' (connecting layer-convolution layer-softmax layer-matrix multiplication operation layer), wherein the cat ' layer is used for multiplyingAnd +.>The two features are fused together to form a fused feature +.>
Step S62: will fuse featuresInputting the conv' layer to obtain the map 2*h*w A weight mask map;
step S63: regularizing the weight mask map by utilizing the sm' layer, and separating according to the channels to obtain the weight mask mapAnd +.>
Step S64: applying a weight mask pattern to a channel using mul' layerP is obtained according to the formula (6) rgb The method comprises the steps of carrying out a first treatment on the surface of the The weight mask map of the other channel acts on +.>P is obtained according to the formula (7) depth The method comprises the steps of carrying out a first treatment on the surface of the And (3) adding the two according to a formula (8) to obtain a pixel-by-pixel classification result P of the image.
P=P rgb +P depth (8)
The invention provides an indoor scene pixel-by-pixel classifier construction method based on RGB-D data, which can utilize Depth as a Depth standard value of RGB data according to interaction between the collected RGB and the Depth data in the indoor scene, promote RGB branches to extract more effective and targeted features, simultaneously design a plurality of modules to fuse the features from the RGB data and the features of the Depth data, and finally provide a fractional graph organic combination mode of different modes to provide an effective solution for pixel-by-pixel classification of the indoor scene.
The present invention provides an end-to-end network architecture in which the network modules for feature extraction may be generic or post-designed without significant impact on the several modules presented herein. The invention can be applied to related applications in indoor scenes by integrating various modules to perform end-to-end optimization, and can effectively help indoor robot navigation and other related applications by effectively performing semantic segmentation on the currently captured scene especially aiming at tasks requiring fine-grained semantic understanding.
Example two
As shown in fig. 4, an embodiment of the present invention provides an indoor scene pixel-by-pixel semantic classifier construction system based on RGB-D, which includes the following modules:
the image acquisition module 71 is used for acquiring images of indoor scenes and acquiring RGB data and Depth data;
a category labeling module 72, configured to define a category of an object in the image and label each pixel thereof with a category;
the feature extraction module 73 is used for inputting the RGB data and Depth data into the feature extraction module respectively, inputting the RGB data into the Depth estimation module, and supervising the RGB data feature extraction process by using the feature extraction module to obtain the corresponding feature f rgb And f depth
A scale-aware feature extraction module 74 for extracting f rgb And f depth Inputting the scale sensing module to select proper scale feature information and obtain scale sensing featuresAnd->
Attention feature extraction module 75 for extractingAnd->Respectively inputting into self-attention mechanism modules for expanding receptive field to obtain characteristic +.>And->
An image classification module 76 for classifyingAnd->The input mode self-adaptive module calculates the mode self-adaptive weight, and the mode self-adaptive weight is utilized to fuse +.>And->A pixel-by-pixel semantic classification of the image is obtained.
The above examples are provided for the purpose of describing the present invention only and are not intended to limit the scope of the present invention. The scope of the invention is defined by the appended claims. Various equivalents and modifications that do not depart from the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims (4)

1. An indoor scene pixel-by-pixel semantic classifier construction method based on RGB-D is characterized by comprising the following steps:
step S1: image acquisition is carried out on an indoor scene, and RGB data and Depth data are obtained;
step S2: defining the object category in the image, and labeling each pixel of the object category;
step S3: the RGB data and Depth data are respectively input into a feature extraction module, and simultaneously the RGB data is input into a Depth estimation module, and the module is utilized to monitor the RGB data feature extraction process to obtain a corresponding feature f rgb And f depth
Step S4: the f is set to rgb And said f depth Inputting the scale sensing module to select proper scale feature information and obtain scale sensing featuresAnd->The method specifically comprises the following steps:
step S41: using feature pyramid to make the feature f rgb And f depth Extracting multi-scale characteristics to obtainAnd->
Step S42: the saidAnd->Fusion is carried out to obtain fusion characteristics->Obtaining a multichannel scale weight graph through a convolution network;
step S43: feature selection is carried out on the multichannel scale weight graph to respectively obtain scale perception featuresAnd->
Step S5: the saidAnd said->Respectively inputting into self-attention mechanism modules for expanding receptive field to obtain characteristic +.>And->
Step S6: the saidAnd said->The mode self-adaptive module is input, the mode self-adaptive weight is calculated, and the mode self-adaptive weight is utilized to fuse the +.>And said->The pixel-by-pixel semantic classification of the image is obtained, specifically comprising:
step S61: constructing the mode self-adaptive module, and adopting a 4-layer network structure cat ' -conv ' -sm ' -mul ', wherein a cat ' layer is used for connectingAnd +.>The two features are fused together to form a fused feature +.>
Step S62: characterizing the fusionInputting the conv' layer to obtain the map 2*h*w Weight maskCode patterns;
step S63: regularizing the weight mask map by utilizing the sm' layer, and separating according to channels to respectively obtainAnd +.>
Step S64: applying the weight mask pattern to a channel using mul' layerObtaining P rgb The method comprises the steps of carrying out a first treatment on the surface of the The weight mask map of the other channel acts on +.>Obtaining P depth The method comprises the steps of carrying out a first treatment on the surface of the And adding the two to obtain a pixel-by-pixel classification result of the image.
2. The method for constructing an RGB-D based indoor scene pixel-by-pixel semantic classifier according to claim 1, wherein the step S3: the RGB data and Depth data are respectively input into a feature extraction module, and simultaneously the RGB data is input into a Depth estimation module, and the module is utilized to monitor the RGB data feature extraction process to obtain a corresponding feature f rgb And f depth The method specifically comprises the following steps:
inputting the RGB data and Depth data into an RGB feature extraction module and a Depth feature extraction module respectively for feature extraction, inputting the RGB data into a Depth estimation module simultaneously when extracting the RGB features, and supervising the RGB data feature extraction process by using the module to respectively obtain corresponding features f rgb And f depth
3. The RGB-D based indoor scene pixel-by-pixel semantic classifier construction method of claim 1, whichCharacterized in that said step S5: the saidAnd said->Inputting into a self-attention mechanism module, expanding receptive field, and respectively obtaining characteristic +.>And features->The method specifically comprises the following steps:
the saidAnd said->Respectively inputting the self-attention mechanism modules to obtain the characteristic +.>And features->Wherein the self-attention mechanism module is shown in the following formulas (3) to (5);
wherein,,c represents the number of characteristic channels; />For the purpose of->N=h×w; λ and β are weight parameters.
4. An indoor scene pixel-by-pixel semantic classifier construction system based on RGB-D is characterized by comprising the following modules:
the image acquisition module is used for acquiring images of indoor scenes and acquiring RGB data and Depth data;
the class labeling module is used for defining the class of the object in the image and labeling each pixel of the class;
the feature extraction module is used for inputting the RGB data and the Depth data into the feature extraction module respectively, inputting the RGB data into the Depth estimation module, and supervising the RGB data feature extraction process by using the feature extraction module to obtain the corresponding feature f rgb And f depth
A scale-aware feature extraction module for extracting the f rgb And said f depth Inputting the scale sensing module to select proper scale feature information and obtain scale sensing featuresAnd->The method specifically comprises the following steps:
step S41: the feature pyramid is utilized to make the featuref rgb And f depth Extracting multi-scale characteristics to obtainAnd->
Step S42: the saidAnd->Fusion is carried out to obtain fusion characteristics->Obtaining a multichannel scale weight graph through a convolution network;
step S43: feature selection is carried out on the multichannel scale weight graph to respectively obtain scale perception featuresAnd->
An attention feature extraction module for extracting the followingAnd said->Respectively inputting into self-attention mechanism modules for expanding receptive field to obtain characteristic +.>And->
An image classification module for classifying the imagesAnd said->The mode self-adaptive module is input, the mode self-adaptive weight is calculated, and the mode self-adaptive weight is utilized to fuse the +.>And said->The pixel-by-pixel semantic classification of the image is obtained, specifically comprising:
step S61: constructing the mode self-adaptive module, and adopting a 4-layer network structure cat ' -conv ' -sm ' -mul ', wherein a cat ' layer is used for connectingAnd +.>The two features are fused together to form a fused feature +.>
Step S62: characterizing the fusionInputting the conv' layer to obtain the map 2*h*w A weight mask map;
step S63: regularizing the weight mask map by utilizing the sm' layer, and separating according to channels to respectively obtainAnd +.>
Step S64: applying the weight mask pattern to a channel using mul' layerObtaining P rgb The method comprises the steps of carrying out a first treatment on the surface of the The weight mask map of the other channel acts on +.>Obtaining P depth The method comprises the steps of carrying out a first treatment on the surface of the And adding the two to obtain a pixel-by-pixel classification result of the image.
CN202110498856.6A 2021-05-08 2021-05-08 Construction method and system of indoor scene pixel-by-pixel semantic classifier based on RGB-D Active CN113222003B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110498856.6A CN113222003B (en) 2021-05-08 2021-05-08 Construction method and system of indoor scene pixel-by-pixel semantic classifier based on RGB-D

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110498856.6A CN113222003B (en) 2021-05-08 2021-05-08 Construction method and system of indoor scene pixel-by-pixel semantic classifier based on RGB-D

Publications (2)

Publication Number Publication Date
CN113222003A CN113222003A (en) 2021-08-06
CN113222003B true CN113222003B (en) 2023-08-01

Family

ID=77091864

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110498856.6A Active CN113222003B (en) 2021-05-08 2021-05-08 Construction method and system of indoor scene pixel-by-pixel semantic classifier based on RGB-D

Country Status (1)

Country Link
CN (1) CN113222003B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108596102A (en) * 2018-04-26 2018-09-28 北京航空航天大学青岛研究院 Indoor scene object segmentation grader building method based on RGB-D
CN108985247A (en) * 2018-07-26 2018-12-11 北方工业大学 Multispectral image urban road identification method
CN110298361A (en) * 2019-05-22 2019-10-01 浙江省北大信息技术高等研究院 A kind of semantic segmentation method and system of RGB-D image
WO2019232836A1 (en) * 2018-06-04 2019-12-12 江南大学 Multi-scale sensing pedestrian detection method based on improved full convolutional network
CN111191650A (en) * 2019-12-30 2020-05-22 北京市新技术应用研究所 Object positioning method and system based on RGB-D image visual saliency
CN111563418A (en) * 2020-04-14 2020-08-21 浙江科技学院 Asymmetric multi-mode fusion significance detection method based on attention mechanism
CN111582316A (en) * 2020-04-10 2020-08-25 天津大学 RGB-D significance target detection method
CN111967477A (en) * 2020-07-02 2020-11-20 北京大学深圳研究生院 RGB-D image saliency target detection method, device, equipment and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108596102A (en) * 2018-04-26 2018-09-28 北京航空航天大学青岛研究院 Indoor scene object segmentation grader building method based on RGB-D
WO2019232836A1 (en) * 2018-06-04 2019-12-12 江南大学 Multi-scale sensing pedestrian detection method based on improved full convolutional network
CN108985247A (en) * 2018-07-26 2018-12-11 北方工业大学 Multispectral image urban road identification method
CN110298361A (en) * 2019-05-22 2019-10-01 浙江省北大信息技术高等研究院 A kind of semantic segmentation method and system of RGB-D image
CN111191650A (en) * 2019-12-30 2020-05-22 北京市新技术应用研究所 Object positioning method and system based on RGB-D image visual saliency
CN111582316A (en) * 2020-04-10 2020-08-25 天津大学 RGB-D significance target detection method
CN111563418A (en) * 2020-04-14 2020-08-21 浙江科技学院 Asymmetric multi-mode fusion significance detection method based on attention mechanism
CN111967477A (en) * 2020-07-02 2020-11-20 北京大学深圳研究生院 RGB-D image saliency target detection method, device, equipment and storage medium

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
Depth-Aware CNN for RGB-D Segmentation;Weiyue Wang et al.;《 Computer Vision – ECCV 2018》;144-161 *
Scale-aware network with modality-awareness for RGB-D indoor semantic segmentation;feng zhou et al.;《Neurocomputing》;第492卷;464-473 *
TSNet: Three-Stream Self-Attention Network for RGB-D Indoor Semantic Segmentation;Wujie Zhou et al.;《IEEE Intelligent Systems》;第36卷(第4期);73-78 *
基于RGB-D的反向融合实例分割算法;汪丹丹等;《图学学报》;第42卷(第5期);767-774 *
基于改进的深度神经网络的人体动作识别模型;何冰倩等;《计算机应用研究》;第36卷(第10期);3107-3111 *
基于通道注意力机制的RGB-D图像语义分割网络;吴子涵等;《电子设计工程》;第28卷(第13期);147-153+159 *

Also Published As

Publication number Publication date
CN113222003A (en) 2021-08-06

Similar Documents

Publication Publication Date Title
CN110298262B (en) Object identification method and device
CN111052126B (en) Pedestrian attribute identification and positioning method and convolutional neural network system
CN109559320B (en) Method and system for realizing visual SLAM semantic mapping function based on hole convolution deep neural network
CN111814661B (en) Human body behavior recognition method based on residual error-circulating neural network
CN109948475B (en) Human body action recognition method based on skeleton features and deep learning
CN112990211B (en) Training method, image processing method and device for neural network
CN109902548B (en) Object attribute identification method and device, computing equipment and system
CN112800903B (en) Dynamic expression recognition method and system based on space-time diagram convolutional neural network
Cadena et al. Pedestrian graph: Pedestrian crossing prediction based on 2d pose estimation and graph convolutional networks
CN111696110B (en) Scene segmentation method and system
CN110222718B (en) Image processing method and device
CN113344932B (en) Semi-supervised single-target video segmentation method
CN112329525A (en) Gesture recognition method and device based on space-time diagram convolutional neural network
CN111428664B (en) Computer vision real-time multi-person gesture estimation method based on deep learning technology
CN112801015A (en) Multi-mode face recognition method based on attention mechanism
Khan et al. A deep survey on supervised learning based human detection and activity classification methods
CN111832592A (en) RGBD significance detection method and related device
Grigorev et al. Depth estimation from single monocular images using deep hybrid network
CN111209873A (en) High-precision face key point positioning method and system based on deep learning
Manssor et al. Real-time human detection in thermal infrared imaging at night using enhanced Tiny-yolov3 network
Jiang et al. Unsupervised monocular depth perception: Focusing on moving objects
Yang et al. [Retracted] A Method of Image Semantic Segmentation Based on PSPNet
CN116740419A (en) Target detection method based on graph regulation network
CN114067273A (en) Night airport terminal thermal imaging remarkable human body segmentation detection method
Jiang et al. An end-to-end human segmentation by region proposed fully convolutional network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant