CN111583322A

CN111583322A - Depth learning-based 2D image scene depth prediction and semantic segmentation method and system

Info

Publication number: CN111583322A
Application number: CN202010380353.4A
Authority: CN
Inventors: 不公告发明人
Original assignee: Beijing Huayan Mutual Entertainment Technology Co ltd
Current assignee: Beijing Huayan Mutual Entertainment Technology Co ltd
Priority date: 2020-05-09
Filing date: 2020-05-09
Publication date: 2020-08-25

Abstract

The invention discloses a depth prediction and semantic segmentation method and system for a 2D image scene based on depth learning, wherein the scene depth prediction method comprises the following steps: acquiring a plurality of RGB-D images; each RGB-D image is taken as a training sample, and a scene depth prediction initial model is formed on the basis of convolutional neural network training; scene depth prediction is carried out on the RGB-D image through a scene depth prediction initial model to obtain a verification result; adjusting model training parameters according to a verification result, then carrying out updating training on the depth prediction initial model by taking the RGB-D image as a training sample, and finally training to form a scene depth prediction model; and carrying out scene depth prediction on the input RGB 2D image through a scene depth prediction model to obtain a scene depth prediction result. The method can realize depth prediction of 2D image scenes and semantic segmentation of scene images, has high prediction speed and prediction precision, and has high semantic segmentation accuracy.

Description

Deep learning-based 2D image scene depth prediction and semantic segmentation method and system

Technical Field

The invention relates to the technical field of deep learning and image analysis, in particular to a depth prediction and semantic segmentation method and system for a 2D image scene based on deep learning.

Background

The depth estimation method is used for estimating the depth information of each pixel point in the image to be processed and obtaining the global depth map of the image to be processed, and plays an important role in the application fields of computer vision and computer graphics. However, in the existing depth information estimation method, the depth information is usually determined only according to the position information of the pixel points in the image, and according to the principle of bottom-up, the object at the bottom of the image is regarded as a near view, and the object at the top of the image is regarded as a far view, so that the depth information of the image is determined and obtained. The depth value estimated by such a depth information estimation method is usually inaccurate, the depth map has a weak hierarchical sense, and most importantly, the depth map cannot be predicted for the input color image.

Semantic segmentation is classification at the pixel level, and pixels belonging to the same class are classified into one class to realize image segmentation of different kinds of objects on an image. The segmentation accuracy of the segmentation map obtained by the traditional semantic segmentation method, such as the semantic segmentation method based on a random forest classifier and the like, is not high. Although some semantic segmentation methods based on deep learning exist at present, the problem of low segmentation accuracy of the existing semantic segmentation method still cannot be solved due to the fact that accurate depth information estimation cannot be performed on a color image.

Disclosure of Invention

The invention aims to provide a depth prediction and semantic segmentation method and system for a 2D image scene based on deep learning, so as to solve the technical problems.

In order to achieve the purpose, the invention adopts the following technical scheme:

provided is a depth prediction method of a 2D image scene based on deep learning, comprising the following steps:

acquiring a plurality of RGB-D images to form an image data set;

dividing the image data set into a training set and a verification set according to a preset division ratio;

taking each RGB-D image in a training set as a training sample, and training on the basis of a convolutional neural network to form a scene depth prediction initial model;

scene depth prediction is carried out on the RGB-D images in the verification set through the scene depth prediction initial model so as to verify the model performance of the scene depth prediction initial model and obtain a verification result;

adjusting model training parameters according to the verification result, then carrying out updating training on the depth prediction initial model by taking the RGB-D images in the sample set as training samples, and finally training to form a scene depth prediction model;

and performing scene depth prediction on the input RGB image through the scene depth prediction model, and outputting a scene depth prediction result.

As a preferred embodiment of the present invention, the training samples for training the scene depth prediction model are augmented by performing any one or more of random flipping, random cropping, scaling, and random rotation on the RGB-D image.

As a preferred scheme of the invention, the scene depth prediction model is trained based on a ResNet convolutional neural network architecture.

As a preferable scheme of the invention, the model performance of the scene depth prediction initial model is verified by using a Huber regression loss function.

The invention also provides a depth prediction system of the 2D image scene based on the deep learning, which can realize the depth prediction method of the image scene and comprises the following steps:

the image acquisition module is used for acquiring the RGB-D image from an external image database;

the image storage module is connected with the image acquisition module and used for storing the acquired RGB-D image;

the image dividing module is connected with the image storage module and is used for dividing the stored RGB-D images into a training set and a verification set according to a preset dividing proportion;

the initial model training module is connected with the image storage module and used for training each RGB-D image in a training set to form the scene depth prediction initial model by taking the RGB-D images as training samples;

the model performance prediction module is respectively connected with the model training module and the image storage module and is used for verifying the prediction performance of the scene depth prediction initial model by taking the RGB-D images in the verification set as verification samples to obtain a verification result;

the verification result display module is connected with the model performance prediction module and used for displaying the verification result to a user;

the model parameter adjusting module is connected with the verification result display module and used for providing the user with model training parameters adjusted according to the verification result;

the model updating training module is respectively connected with the image storage module, the initial model training module and the model parameter adjusting module and is used for updating and training the depth prediction initial model by taking each RGB-D image in a training set as a training sample according to the adjusted model parameter to finally train and form the scene depth prediction model;

and the scene depth prediction module is connected with the model updating training module and used for performing image scene depth prediction on the input RGB image through the scene depth prediction model.

The invention also provides a 2D image scene semantic segmentation method based on deep learning, which comprises the following steps:

inputting the RGB-D image into a feature extractor to extract an RGB feature map corresponding to the RGB image in the RGB-D image;

inputting the ground truth-value depth map corresponding to the RGB-D image into the feature extractor to extract a truth-value depth feature map corresponding to the ground truth-value depth map;

carrying out image fusion on the RGB feature map and the true value depth feature map to obtain a feature fusion map;

and performing semantic segmentation on the feature fusion graph through a pre-trained semantic segmentation model, and outputting a semantic segmentation result.

As a preferred aspect of the present invention, the method for training the semantic segmentation model includes:

acquiring the RGB-D image and the ground truth value depth map corresponding to the RGB-D image to form an image data set;

taking each RGB-D image in the training set and the ground truth-value depth map corresponding to the RGB-D image as training samples, and training on the basis of a convolutional neural network to form a semantic segmentation initial model;

performing semantic segmentation on the RGB images in the verification set through the semantic segmentation initial model to verify the model performance of the semantic segmentation initial model to obtain a verification result;

and adjusting model training parameters according to the verification result, then updating and training the semantic segmentation initial model by taking the image data in the training set as a training sample, and finally training to form the semantic segmentation model.

In a preferred embodiment of the present invention, the image size of the RGB feature map or the true-value depth feature map output by the feature extractor is 160 × 128 × 64.

As a preferred scheme of the present invention, the network structure of the convolutional neural network for training the semantic segmentation model at least includes an upper convolutional layer, a first convolutional layer, a second convolutional layer, and an upper sampling layer, where the first convolutional layer is connected to the upper convolutional layer, the second convolutional layer is connected to the first convolutional layer, the upper sampling layer is connected to the second convolutional layer, the upper convolutional layer performs an upper convolutional operation on the feature fusion map, and the upper sampling layer performs an upper sampling process on a feature map output by the second convolutional layer and outputs a semantic segmentation result.

The invention also provides a deep learning-based 2D image scene semantic segmentation system, which can realize the 2D image scene semantic segmentation method and comprises the following steps:

the first image acquisition module is used for acquiring and storing the RGB-D image;

the second image acquisition module is used for acquiring the ground truth value depth map corresponding to the RGB-D image;

the first image feature extraction module is connected with the first image acquisition module and is used for extracting the RGB feature map corresponding to the RGB image in the RGB-D image;

the second image feature extraction module is connected with the second image acquisition module and is used for extracting the true value depth feature map corresponding to the ground true value depth map;

the feature fusion module is respectively connected with the first image feature extraction module and the second image feature extraction module and is used for carrying out image fusion on the RGB feature map and the truth-value depth feature map to obtain a feature fusion map;

and the semantic segmentation module is connected with the feature fusion module and used for performing semantic segmentation on the feature fusion graph through a pre-trained semantic segmentation model and outputting a semantic segmentation result.

The depth prediction method and the device realize depth prediction of 2D image scenes, have high prediction speed and prediction precision, and can accurately obtain the depth information of the input color map. In addition, the color image is subjected to semantic segmentation based on the accurately predicted depth information, and the segmentation accuracy is greatly improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below. It is obvious that the drawings described below are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

Fig. 1 is a diagram illustrating steps of a deep learning-based 2D image scene depth prediction method according to an embodiment of the present invention;

FIG. 2 is a block diagram of a depth prediction system for a deep learning based 2D image scene according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating steps of a deep learning-based 2D image scene semantic segmentation method according to an embodiment of the present invention;

FIG. 4 is a diagram of method steps for training the semantic segmentation model;

FIG. 5 is a block diagram of a deep learning based 2D image scene semantic segmentation system according to an embodiment of the present invention;

FIG. 6 is a network architecture diagram of the feature extractor extracting the RGB feature map or the true depth feature map;

fig. 7 is a schematic diagram of predicting a depth of a 2D image scene and performing semantic segmentation of the image scene.

Detailed Description

The technical scheme of the invention is further explained by the specific implementation mode in combination with the attached drawings.

Wherein the showings are for the purpose of illustration only and are shown by way of illustration only and not in actual form, and are not to be construed as limiting the present patent; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if the terms "upper", "lower", "left", "right", "inner", "outer", etc. are used for indicating the orientation or positional relationship based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not indicated or implied that the referred device or element must have a specific orientation, be constructed in a specific orientation and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes and are not to be construed as limitations of the present patent, and the specific meanings of the terms may be understood by those skilled in the art according to specific situations.

In the description of the present invention, unless otherwise explicitly specified or limited, the term "connected" or the like, if appearing to indicate a connection relationship between the components, is to be understood broadly, for example, as being fixed or detachable or integral; can be mechanically or electrically connected; they may be directly connected or indirectly connected through intervening media, or may be connected through one or more other components or may be in an interactive relationship with one another. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

Fig. 1 illustrates a depth prediction method for a 2D image scene based on deep learning according to an embodiment of the present invention. Referring to fig. 1, the depth prediction method for an image scene based on deep learning according to the present embodiment includes the following steps:

step S1, acquiring a plurality of RGB-D images to form an image data set; the RGB-D image is a color-depth image, and actually comprises two images, one is the RGB image, and the other is a depth image (D image) corresponding to the RGB image;

step S2, dividing the image data set into a training set and a verification set according to a preset division ratio;

step S3, each RGB-D image in the training set is used as a training sample, and a scene depth prediction initial model is formed based on convolutional neural network training;

step S4, performing scene depth prediction on the RGB-D images in the verification set through the scene depth prediction initial model, specifically, performing scene depth prediction on the RGB images in the RGB-D images through the scene depth prediction initial model to verify the model performance of the scene depth prediction initial model and obtain a verification result;

step S5, adjusting model training parameters according to a verification result, then carrying out updating training on the depth prediction initial model by taking RGB-D images in the sample set as training samples, and finally training to form a scene depth prediction model;

in step S6, the normal scene depth prediction model performs scene depth prediction on the input RGB image, and outputs a scene depth prediction result.

In order to ensure the diversity of the training samples, preferably, in the embodiment of the present invention, the training samples of the training scene depth prediction model are extended by performing image preprocessing such as image random flipping, random cropping, scaling, or random rotation on the RGB-D image, so as to ensure that the trained scene depth prediction model has higher prediction accuracy.

In step S3, in the embodiment of the present invention, the scene depth prediction model is preferably trained based on the ResNet convolutional neural network architecture. Please refer to fig. 7a for a detailed network structure of the ResNet residual network. Fig. 6 shows the internal network structure of the feature extractor in fig. 7 a. As can be seen from fig. 7a, the image size of the RGB image input to the convolutional neural network is 304 × 228 × 3, the image size of the feature map output by the feature extractor is 160 × 128 × 64, the feature map output by the feature extractor is subjected to convolution feature extraction of a convolution kernel with a size of 3 × 3, and then a feature map with a size of 160 × 128 × 1 is output, and then the feature map with a size of 160 × 128 × 1 is up-sampled, and finally a prediction depth map with a size of 640 × 480 is output.

In the embodiment of the invention, a Huber return loss function is preferably adopted to verify the model performance of the scene depth prediction initial model. Specifically, the loss amount between the prediction depth map and the real depth map corresponding to the input RGB image is calculated through a Huber loss function, so that the model performance of the scene depth prediction initial model is verified. Since the calculation of the amount of loss between the predicted depth map and the true depth map by the Huber loss function is not within the scope of the claimed invention, the loss calculation process for the Huber loss function is not set forth herein.

Referring to fig. 2, the present invention further provides a depth prediction system for a 2D image scene based on deep learning, which can implement the depth prediction method for an image scene, and the system includes:

the image acquisition module 1 is used for acquiring an RGB-D image from an external image database;

the image storage module 2 is connected with the image acquisition module 1 and is used for storing the acquired RGB-D image;

the image dividing module 3 is connected with the image storage module 2 and is used for dividing the stored RGB-D images into a training set and a verification set according to a preset dividing proportion;

the initial model training module 4 is connected with the image storage module 2 and used for training each RGB-D image in the training set as a training sample to form a scene depth prediction initial model;

the model performance prediction module 5 is connected with the initial model training module 4 and used for verifying the prediction performance of the scene depth prediction initial model by taking the RGB-D images in the verification set as verification samples to obtain a verification result;

the verification result display module 6 is connected with the model performance prediction module 5 and used for displaying the verification result to the user;

the model parameter adjusting module 7 is connected with the verification result displaying module 6 and used for providing the user with the model training parameters adjusted according to the verification result;

the model updating training module 8 is respectively connected with the image storage module 2, the initial model training module 4 and the model parameter adjusting module 7 and is used for updating and training the depth prediction initial model by taking each RGB-D image in the training set as a training sample according to the adjusted model parameters and finally training to form a scene depth prediction model;

and the scene depth prediction module 9 is connected with the model updating training module 8 and used for performing image scene depth prediction on the input RGB image through the scene depth prediction model.

The invention also provides a deep learning-based 2D image scene semantic segmentation method, please refer to fig. 3 and 7b, which specifically comprises the following steps:

step L1, inputting the RGB-D image into a feature extractor to extract the RGB feature map corresponding to the RGB image in the RGB-D image;

step L2, inputting the ground truth-value depth map corresponding to the RGB-D image into a feature extractor to extract a truth-value depth feature map corresponding to the ground truth-value depth map;

l3, carrying out image fusion on the RGB feature map and the true-value depth feature map to obtain a feature fusion map;

and L4, performing semantic segmentation on the feature fusion graph through a pre-trained semantic segmentation model, and outputting a semantic segmentation result.

Please refer to fig. 6 for the internal network structure of the feature extractor described in steps L1 and L2. The image size of the RGB image input to the feature extractor or the ground truth depth map corresponding to the RGB image is 304 × 228 × 3, and the image size of the RGB feature map or the ground truth depth map output by the feature extractor is 160 × 128 × 64.

In step L3, the image size of the feature fusion map formed by image fusion of the RGB feature map and the true-value depth feature map is 160 × 128 × 64.

In step L4, the network structure of the convolutional neural network of the training semantic segmentation model at least includes an upper convolutional layer, a first convolutional layer, a second convolutional layer, and an upper sampling layer, where the first convolutional layer is connected to the upper convolutional layer, the second convolutional layer is connected to the first convolutional layer, the upper sampling layer is connected to the second convolutional layer, the upper convolutional layer performs an upper convolutional operation on the feature fusion map, and the upper sampling layer performs an upper sampling process on the feature map output by the second convolutional layer and outputs a semantic segmentation result.

In this embodiment, the convolution kernel size of the first convolution layer and the second convolution layer is preferably 3 × 3. The size of the segmentation graph output by the semantic segmentation model is 640 × 480 × 38, and "38" is used for representing the number of segmented semantic tags.

Referring to fig. 4, the method for training the semantic segmentation model according to the embodiment of the present invention includes the following steps:

step M1, acquiring an RGB-D image and a ground truth value depth map corresponding to the RGB-D image to form an image data set;

step M2, dividing the image data set into a training set and a verification set according to a preset division ratio;

step M3, taking each RGB-D image in the training set and the ground truth-value depth map corresponding to the RGB-D image as training samples, and training on the basis of a convolutional neural network to form a semantic segmentation initial model; preferably, training a semantic segmentation initial model by adopting a ResNet convolutional neural network architecture;

step M4, performing semantic segmentation on the RGB images in the verification set through the semantic segmentation initial model to verify the model performance of the semantic segmentation initial model and obtain a verification result; preferably, the model performance of the initial model of semantic segmentation is verified through an L2 loss function, and a specific verification process is not described herein;

and step M5, adjusting model training parameters according to the verification result, then updating and training the semantic segmentation initial model by taking the image data in the training set as a training sample, and finally training to form a semantic segmentation model.

Referring to fig. 5, the present invention further provides a deep learning-based 2D image scene semantic segmentation system, which can implement the image scene semantic segmentation method described above, and the system includes:

the first image acquisition module 11 is used for acquiring and storing an RGB-D image;

the second image acquisition module 12 is configured to acquire a ground truth depth map corresponding to the RGB-D image;

the first image feature extraction module 13 is connected to the first image acquisition module 11, and is configured to extract an RGB feature map corresponding to an RGB image in the RGB-D image; referring to fig. 7a, the image size of the RGB feature map is 160 × 128 × 64;

the second image feature extraction module 14 is connected to the second image acquisition module 12, and is configured to extract a true-value depth feature map corresponding to the ground true-value depth map; referring to fig. 7b, the image size of the true depth feature is also 160 × 128 × 64;

the feature fusion module 15 is respectively connected to the first image feature extraction module 13 and the second image feature extraction module 14, and is configured to perform image fusion on the RGB feature map and the true-value depth feature map to obtain a feature fusion map; the image size of the feature fusion map is 160 × 128;

and the semantic segmentation module 16 is connected with the feature fusion module 15 and used for performing semantic segmentation on the feature fusion map through a pre-trained semantic segmentation model and outputting a semantic segmentation result.

It should be understood that the above-described embodiments are merely preferred embodiments of the invention and the technical principles applied thereto. It will be understood by those skilled in the art that various modifications, equivalents, changes, and the like can be made to the present invention. However, such variations are within the scope of the invention as long as they do not depart from the spirit of the invention. In addition, certain terms used in the specification and claims of the present application are not limiting, but are used merely for convenience of description.

Claims

1. A depth prediction method for a 2D image scene based on deep learning is characterized by comprising the following steps:

acquiring a plurality of RGB-D images to form an image data set;

2. The method of claim 1, wherein the training samples for training the scene depth prediction model are augmented by any one or more of random flipping, random cropping, scaling, or random rotation of the RGB-D image.

3. The 2D image scene depth prediction method of claim 1, wherein the scene depth prediction model is trained based on a ResNet convolutional neural network architecture.

4. The method for 2D image scene depth prediction according to claim 1, wherein a Huber regression loss function is used to verify the model performance of the scene depth prediction initial model.

5. The deep learning-based 2D image scene depth prediction system can realize the image scene depth prediction method according to any one of claims 1 to 4, and is characterized by comprising the following steps:

6. A2D image scene semantic segmentation method based on deep learning is characterized by comprising the following steps:

7. The method for deep learning based 2D image scene semantic segmentation as claimed in claim 6, wherein the method for training the semantic segmentation model comprises:

8. The method according to claim 7, wherein the image size of the RGB feature map or the true-value depth feature map output by the feature extractor is 160 × 128 × 64.

9. The deep learning-based 2D image scene semantic segmentation method according to claim 6, wherein a network structure of the convolutional neural network for training the semantic segmentation model at least includes an upper convolutional layer, a first convolutional layer, a second convolutional layer and an upper sampling layer, the first convolutional layer is connected to the upper convolutional layer, the second convolutional layer is connected to the first convolutional layer, the upper sampling layer is connected to the second convolutional layer, the upper convolutional layer performs an upper convolutional operation on the feature fusion map, and the upper sampling layer performs an upper sampling process on a feature map output by the second convolutional layer and outputs a semantic segmentation result.

10. A2D image scene semantic segmentation system based on deep learning can realize the image scene semantic segmentation method as the right 6-9, and is characterized by comprising the following steps: