CN110781717A

CN110781717A - Cab scene semantic and visual depth combined analysis method

Info

Publication number: CN110781717A
Application number: CN201910734881.2A
Authority: CN
Inventors: 缪其恒; 苏志杰; 王江明; 许炜
Original assignee: Zhejiang Zero Run Technology Co Ltd
Current assignee: Zhejiang Leapmotor Technology Co Ltd; Zhejiang Zero Run Technology Co Ltd
Priority date: 2019-08-09
Filing date: 2019-08-09
Publication date: 2020-02-11

Abstract

The invention relates to a cab scene semantic and visual depth combined analysis method, which comprises the following steps: establishing a neural network model comprising a scene semantic output branch and a scene visual depth output branch; training the scene semantic output branch; performing joint training on the scene semantic output branch and the scene visual depth output branch; collecting and preprocessing an image; model output post-processing; the outputs of the neural network model are integrated into scene structured data. The invention has the advantages that: compared with the existing visual system which can only sense the driver behaviors, the invention can sense the whole cab including a driver seat and a passenger seat by extracting the cab scene structured data through the cab semantic and depth information combined analysis, and provides prior information for realizing various behaviors and member attribute analysis of a cab scene area.

Description

Cab scene semantic and visual depth combined analysis method

Technical Field

The invention relates to the field of visual perception, in particular to a cab scene semantic and visual depth combined analysis method.

Background

Intelligence is one of the important trends in the development of the automobile industry today. The vision system plays an important role in the existing automatic driving/assistant driving system, and is mainly used for perceiving driving scene data required by related applications. Existing mass production vision systems are mainly used for sensing driving scene data (including a travelable area, traffic participants, traffic signal identification and the like) around a vehicle and driving states (fatigue, inattentive driving behavior and the like) of drivers inside the vehicle. The current perception of the driving scene inside the cab and related applications are limited compared to the perception of the driving scene around the vehicle.

The application of the existing vision system in the cab only contains the information of the region of interest of the face (or the upper body) of the driver, and the subsequent application of a cab analysis system, such as driving state monitoring, cab state analysis and the like, cannot be met. In order to realize the application of the vision system with higher intelligent level, the cab analysis system needs to have the capability of automatically measuring more cab prior information, including cab area information, passenger seat area information, steering wheel area information and other customized function prior information.

Disclosure of Invention

The invention mainly solves the problem that the existing visual system can not obtain the driver seat area information, the passenger seat area information, the steering wheel area information and other customized function prior information, and provides a driver seat scene semantic and visual depth combined analysis method which utilizes an infrared system to collect images in a driver seat and analyzes the images through a deep convolutional neural network to finally obtain the prior information of the driver seat.

The invention solves the technical problem by adopting the technical scheme that a cab scene semantic and visual depth combined analysis method comprises the following steps:

s1: establishing a neural network model comprising a scene semantic output branch and a scene visual depth output branch;

s2: training the scene semantic output branch;

s3: performing joint training on the scene semantic output branch and the scene visual depth output branch;

s4: collecting and preprocessing an image;

s5: model output post-processing;

s6: the outputs of the neural network model are integrated into scene structured data.

The scene semantic output branch is based on cab infrared image input, a deep convolution neural network obtained by training a manual labeling scene semantic information sample is used for performing cab scene semantic analysis and outputting a cab scene semantic layer containing information such as seats, a steering wheel and the like, the scene visual depth output branch is based on the cab infrared image input and the scene semantic analysis neural network sharing shallow layer characteristic part, the deep convolution neural network obtained by labeling training is generated by using a laser radar for performing cab scene visual depth analysis, normalized scene visual depth layers with the resolution ratio of input pictures and the like are output, and scene structural data information including the seats, the steering wheel and the steering wheel can be obtained through the cab scene semantic layer and the normalized scene visual depth layers.

As a preferable scheme of the above scheme, the neural network model includes an input layer, a shared feature coding layer, and two branch output layers, and the branch output layers include a semantic output layer and a depth-of-view output layer.

As a preferred scheme of the above scheme, the input of the neural network model is a single-channel infrared image, the output of the semantic output layer is a scene semantic layer containing foreground semantic output representing people and objects in the cab, the output of the visual depth output layer is a normalized scene visual depth layer, and the brightness and darkness of each pixel point in the normalized scene visual depth layer represent the distance between the object and the people corresponding to the pixel point and the camera in practice.

As a preferable scheme of the above scheme, the scene semantic output branch training in step S2 includes the following steps:

s21: collecting cab scene data of different visual angles, vehicle types and illumination conditions;

s22: labeling various foreground semantic outputs of the semantic layer to generate a training data set;

s23: expanding the training data set;

s24: randomly initializing scene semantic output branch parameters, and optimizing a pixel-level semantic loss function L by adopting a batch random gradient descent method _SemAnd updating scene semantic output branch parameters

Wherein, [ u, v]Is the pixel coordinate of each point in the image coordinate system, [ W, H ]]For inputting width and height of image, s _u,vSemantic tags, p, for respective coordinates _u,vAnd outputting the predicted values of the branches to the corresponding coordinates for scene semantics.

As a preferable scheme of the above scheme, the joint training of the scene semantic output branch and the scene depth output branch in step S3 includes the following steps:

s31: acquiring a joint training set;

s32: obtaining shared characteristic layer parameters after the semantic output branch training of the solidified scene;

s33: expanding the joint training set;

s34: randomly initializing scene semantic neural network model parameters, and solving the visual depth loss function L at a high learning rate by adopting a batch random gradient descent method _DepAnd updating scene depth of vision output branch parameters

Wherein d (x, y) is the apparent depth corresponding to each pixel (x, y) of the output depth-of-view map, and d _T(x, y) are training sample visual depth labels, and W and H are image width and height;

s35: reducing the training learning rate, canceling the parameters of the solidified shared characteristic layer, and updating the parameters of each branch network according to the following semantic and visual depth comprehensive loss function L

L＝k ₁L _Sem+k ₂L _Dep

Wherein k is ₁,k ₂Configurable for semantic and depth of vision impairmentAnd weight coefficients of the loss function in the comprehensive loss function.

As a preferable scheme of the above scheme, the joint training set in S31 is obtained by: calibrating the scene semantic output branch training data set in the step S2 by using the laser radar point cloud output data, aligning a camera coordinate system and point cloud data coordinates according to camera calibration parameters and laser radar system calibration parameters, transforming the point cloud data coordinates into an image coordinate system by using a pinhole imaging principle, and completing the part of the depth-of-view map without effective data by using a bilinear interpolation method.

As a preferable scheme of the above scheme, the image preprocessing in step S4 includes adaptive adjustment of camera shutter, aperture and gain parameters, and image Y channel interception and scaling.

As a preferred scheme of the above scheme, in the step S5, the model output post-processing is performed, and the confidence of each type of multi-channel semantic information output by the scene semantic output branch is processed according to the following formula, so as to obtain the semantic map layer of the driving scene

Wherein ch _i(u, v) is the i channel confidence output of the scene semantic network, T _minThe confidence coefficient is the minimum credibility threshold value and is a configurable parameter.

As a preferable scheme of the above scheme, the output integration of the neural network model in step S6 includes the following steps:

s61: establishing a coordinate system of a cockpit;

s62: based on the neural network output result, clustering the same semantic individuals by using scene depth information;

s63: and constructing cab scene structured data for describing the analysis result of the cab scene network.

As a preferable mode of the above, the cab scene structured data includes a cab class, a seat number, and an attribute of each seat. The cockpit type is input through a configuration parameter interface.

The invention has the advantages that: compared with the existing visual system which can only sense the driver behaviors, the invention can sense the whole cab including a driver seat and a passenger seat by extracting the cab scene structured data through the cab semantic and depth information combined analysis, and provides prior information for realizing various behaviors and member attribute analysis of a cab scene area.

Drawings

FIG. 1 is a schematic flow chart of the present invention.

FIG. 2 is a schematic diagram of a topology of a neural network model according to the present invention.

FIG. 3 is a schematic flow chart of scene semantic output branch training according to the present invention.

FIG. 4 is a schematic flow chart of the joint training of the present invention.

FIG. 5 is a flow chart illustrating the integration of the output of the neural network model into scene structured data according to the present invention.

Fig. 6 is a schematic structural diagram of scene structured data according to the present invention.

Detailed Description

The technical solution of the present invention is further described below by way of examples with reference to the accompanying drawings.

Example 1:

in this embodiment, a method for jointly analyzing scene semantics and depth of view of a cab, as shown in fig. 1, includes two stages, namely, an offline training stage and an online application stage, where the offline training stage includes:

s1: establishing a neural network model comprising a scene semantic output branch and a scene depth of vision output branch, wherein the neural network model comprises an input layer, a shared feature coding layer and two branch output layers as shown in figure 2, the branch output layers comprise a semantic output layer and a depth of vision output layer, the input of the model in the embodiment is a single-channel infrared image of 640 x 320, the output of the semantic output layer is processed into a scene semantic layer of 640 x 320, the scene semantic layer comprises foreground semantic outputs (definition: 0-other, 1-seat, 2-person, 3-safety belt, 4-steering wheel, 5-child seat and 6-driving seat) representing people and objects in a driving cab, the output of the depth of vision output layer is a normalized scene depth of vision layer (actual measurement range is 0-5 m, normalized output range is 0-1 floating point), normalizing the brightness of each pixel point in the scene depth-of-view layer to represent the distance between an object and a person corresponding to the pixel point and a camera in practice;

s2: training the scene semantic output branch, as shown in fig. 3, includes the following steps:

s21: cab scene data (about 10 tens of thousands of pieces) with different viewing angles, vehicle types and illumination conditions are collected by using a driver behavior analysis camera (the horizontal viewing angle is about 50 degrees) and a cab panoramic camera (the horizontal viewing angle is about 180 degrees);

s22: labeling various foreground semantic outputs (namely semantic outputs defining non-0 categories) of the semantic layer in a multi-segment line mode to generate a training data set;

s23: carrying out on-line expansion on the training data set through random geometric and color transformation;

s24: randomly initializing neural network model parameters, and optimizing pixel-level semantic loss function L by adopting batch random gradient descent method _SemAnd with a semantic loss function L _SemUpdating scene semantic output branch parameters

Wherein, [ u, v]Is the pixel coordinate of each point in the image coordinate system, [ W, H ]]For inputting width and height of image, s _u,vSemantic tags, p, for respective coordinates _u,vOutputting predicted values of the corresponding coordinates of the branches for scene semantics;

s3: the joint training is performed on the scene semantic output branch and the scene depth output branch, as shown in fig. 4, and includes the following steps:

s31: acquiring a joint training set, wherein the specific mode is that a scene semantic output branch training data set in the step S2 is calibrated by using laser radar point cloud output data, then a camera coordinate system and a point cloud data coordinate are aligned according to a camera calibration parameter and a laser radar system calibration parameter, and then the point cloud data coordinate is transformed into an image coordinate system by using a pinhole imaging principle, and as the laser radar output point cloud is discrete data, the part without effective data of a depth-of-view image needs to be supplemented by using a bilinear interpolation method;

s33: carrying out on-line expansion on the joint training set through random geometry and color transformation;

s34: randomly initializing neural network model parameters, and solving the visual depth loss function L at a high learning rate by adopting a batch random gradient descent method _DepAnd by the depth of sight loss function L _DepUpdating scene depth output branch parameters

s35: reducing training learning rate, canceling solidification and sharing characteristic layer parameters, and updating each branch network parameter, namely scene semantic output branch parameter and scene implementation output branch parameter, according to the following semantic and visual depth comprehensive loss function L

L＝k ₁L _Sem+k ₂L _Dep

Wherein k is ₁,k ₂The weight coefficients of the configurable semantic and visual depth loss functions in the comprehensive loss function are both 0.5 as default values.

After offline training of a semantic output branch and a scene visual depth output branch of the neural network model is completed, the neural network model is deployed, and after compression operations such as pruning (channel cutting and thinning) and quantization (8-bit or 16-bit floating point and fixed point data types) are performed on trained model parameters, the trained model parameters are deployed on a front-end embedded platform (comprising a data file and a configuration file). And initializing the model file according to a predefined function interface during forward operation.

Entering an online application stage, wherein the online application comprises the following steps:

s4: the method comprises the steps of image acquisition and preprocessing, wherein a driver's cab scene vision system is subjected to infrared light supplement or natural illumination, the vision system mainly comprises a driver behavior analysis camera and a cab panoramic camera, and the image preprocessing mainly comprises adaptive adjustment of parameters such as a camera shutter, an aperture and gain, interception and scaling of an image Y channel and the like. The adaptive adjustment of parameters such as a camera shutter, an aperture, gain and the like can be completed in an off-line image quality adjustment mode; image Y channel data interception (for infrared fill light images, preprocessing is necessary, and options are available under the input of a non-infrared fill light camera) and scaling can be realized by writing corresponding algorithm configuration parameters into an initialization function and reading in the initialization function through a corresponding function interface; based on the preprocessing, an image ROI (region of interest) can be obtained and then input into a trained neural network module;

s5: model output post-processing, wherein the scene apparent depth output branch directly outputs 640 x 320 normalized scene depth; the original output of the scene voice output branch is 640 x 320 x 7 channels of various semantic information confidence coefficients, the confidence coefficients of the multiple channels of various semantic information output by the scene semantic output branch are processed according to the following formula, and the semantic layer of the driving scene is obtained

Wherein ch _i(u, v) is the i channel confidence output of the scene semantic network, i is 0,1,2, …,6, T _minThe confidence coefficient is the minimum credible threshold value, and is a configurable parameter, and the default is 0.5.

S6: integrating the output of the neural network model into scene structured data, as shown in fig. 5, comprises the following steps:

s61: establishing a coordinate system of a cockpit, wherein an origin can be arranged at a driver seat;

s63: and constructing cab scene structured data for describing the analysis result of the cab scene network, wherein the cab scene structured data is shown in fig. 6, and cab categories can be transmitted by a configuration parameter interface, and comprise small vehicles, heavy commercial trucks, heavy commercial buses and the like. Heavy commercial passenger vehicles do not contain passenger seat structured data information. The acquired cab scene structured data is prior information necessary for realizing various behaviors and member attribute analysis of the cab scene area.

The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

Claims

1. A cab scene semantic and visual depth combined analysis method is characterized by comprising the following steps: the method comprises the following steps:

s2: training the scene semantic output branch;

s4: collecting and preprocessing an image;

s5: model output post-processing;

2. The method for jointly analyzing the semantic meaning and the visual depth of the cab scene as claimed in claim 1, wherein the method comprises the following steps: the neural network model comprises an input layer, a shared feature coding layer and two branch output layers, wherein the branch output layers comprise a semantic output layer and an apparent depth output layer.

3. The method for jointly analyzing the semantic meaning and the visual depth of the cab scene as claimed in claim 2, wherein the method comprises the following steps: the input of the neural network model is a single-channel infrared image, the output of the semantic output layer is a scene semantic layer containing foreground semantic output representing people and objects in a cab, the output of the visual depth output layer is a normalized scene visual depth layer, and the brightness of each pixel point in the normalized scene visual depth layer represents the distance between the object corresponding to the pixel point and the people and the camera in practice.

4. The method for jointly analyzing the semantic meaning and the visual depth of the cab scene as claimed in claim 1, wherein the method comprises the following steps: the scene semantic output branch training in the step S2 includes the following steps:

s23: expanding the training data set;

s24: randomly initializing neural network model parameters, and optimizing pixel-level semantic loss function L by adopting batch random gradient descent method _SemAnd updating scene semantic output branch parameters

5. The method for jointly analyzing the semantic meaning and the visual depth of the cab scene as claimed in claim 1, wherein the method comprises the following steps: in step S3, the joint training of the scene semantic output branch and the scene depth output branch includes the following steps:

s31: acquiring a joint training set;

s33: expanding the joint training set;

s34: randomly initializing neural network model parameters, using batchesMethod for solving visual depth loss function L by using random gradient decline of quantity at large learning rate _DepAnd updating scene depth of vision output branch parameters

L＝k ₁L _Sem+k ₂L _Dep

Wherein k is ₁,k ₂The weight coefficients of the semantic and visual depth loss function in the comprehensive loss function can be configured.

6. The method for jointly analyzing the semantic meaning and the visual depth of the cab scene as claimed in claim 5, wherein the method comprises the following steps: the joint training set in the S31 is obtained by the following steps: calibrating the scene semantic output branch training data set in the step S2 by using the laser radar point cloud output data, aligning a camera coordinate system and point cloud data coordinates according to camera calibration parameters and laser radar system calibration parameters, transforming the point cloud data coordinates into an image coordinate system by using a pinhole imaging principle, and completing the part of the depth-of-view map without effective data by using a bilinear interpolation method.

7. The method for jointly analyzing the semantic meaning and the visual depth of the cab scene as claimed in claim 1, wherein the method comprises the following steps: the image preprocessing in step S4 includes adaptive adjustment of camera shutter, aperture and gain parameters, and image Y-channel clipping and scaling.

8. The method for jointly analyzing the semantic meaning and the visual depth of the cab scene as claimed in claim 1, wherein the method comprises the following steps: in the step S5, the model output post-processing is performed, the confidence of various types of multi-channel semantic information output by the scene semantic output branch is processed according to the following formula, and the semantic map layer of the driving scene is obtained

9. The method for jointly analyzing the semantic meaning and the visual depth of the cab scene as claimed in claim 1, wherein the method comprises the following steps: the output integration of the neural network model in step S6 includes the following steps:

s61: establishing a coordinate system of a cockpit;

10. The method for analyzing the scene semantics and the visual depth of the cab in a combined manner according to claim 1 or 9, wherein: the cab scene structured data comprises cab category, seat number and attribute of each seat.