CN112668473A

CN112668473A - Vehicle state accurate sensing method based on multi-feature deep fusion neural network

Info

Publication number: CN112668473A
Application number: CN202011583142.7A
Authority: CN
Inventors: 徐启敏; 常彬; 李旭
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2020-12-28
Filing date: 2020-12-28
Publication date: 2021-04-16
Anticipated expiration: 2040-12-28
Also published as: CN112668473B

Abstract

The invention discloses a vehicle state accurate perception method based on a multi-feature deep fusion neural network, which comprises the steps of firstly establishing a parallel deep convolution-recurrent neural network architecture, respectively extracting rotation and translation geometric features in an input tensor by using a parallel convolution neural network, and learning motion state time correlation characteristics among the extracted features by using the recurrent neural network; then optimizing network parameters by adopting a mean square error iteration method based on weight balance; and finally estimating the three-dimensional speed and the three-dimensional angular speed information of the vehicle by using the trained network. The method only uses a monocular camera, and has the characteristic of low cost; the designed network has good generalization performance and accurate perception of vehicle state parameters.

Description

Vehicle state accurate sensing method based on multi-feature deep fusion neural network

Technical Field

The invention relates to a monocular camera-based vehicle state sensing method, in particular to a multi-feature deep fusion neural network-based vehicle state accurate sensing method, and belongs to the field of vehicle state sensing.

Background

The automobile industry has strategic and pillar status in national economic development, and the automobile industry in China stably occupies the first world in scale at present, so that the automobile industry is a genuine big country of the automobile industry. With the coming of a new technological revolution, the intelligent development of automobiles becomes a necessary trend of the development of the automobile industry. The development of intelligent automobiles can effectively solve the problems of traffic safety, road congestion, energy consumption, environmental pollution and the like in the current society, and has great significance for continuously meeting the increasing demand of good life of people. The accurate acquisition of the vehicle state information plays a fundamental and critical role in the technical fields of complex environment perception, intelligent decision control, information safety, test evaluation and the like of the intelligent automobile.

At present, the disclosed method for vehicle state perception mainly relates to the following two aspects:

firstly, use inertial sensor to come direct measurement vehicle state information, including vehicle angular velocity, attitude angle, speed, acceleration etc. but measurement accuracy receives inertial sensor performance to influence, and high accuracy inertial sensor often is expensive, can't carry out on a large scale popularization and application, and low accuracy inertial sensor error is great, and the error accumulation is especially serious, influences the result of use.

Secondly, the vision sensor is used for vehicle state perception, the instant positioning and map construction (SLAM) technology in the robot field can estimate the motion state of the carrier, and with the popularization of vehicle-mounted cameras, the technology is transplanted and applied to the vehicle field.

The rapid development and the wide application of the deep learning technology provide a new idea for the perception of the vehicle state information. Aiming at the defects of the traditional vehicle state perception method, the invention discloses a vehicle state information perception method based on deep learning. The method disclosed by the invention adopts an end-to-end neural network structure, directly senses vehicle state parameters from an original RGB image acquired by a monocular camera, namely inputs the original image, and finally outputs vehicle state results (comprising vehicle speed and angular speed). The invention designs a multi-feature deep fusion neural network architecture, extracts rotation and translation geometric features in an image by using a parallel convolution neural network, and learns the dynamic relation and time correlation characteristics among the extracted features by using a recurrent neural network. The method provided by the invention can sense the multidimensional state of the vehicle, namely the three-dimensional speed and the three-dimensional angular speed information of the vehicle, by fully extracting the effective information in the image, so that the defect of low sensing precision of the traditional computer vision method is overcome, and the method has good generalization.

Disclosure of Invention

The invention aims to provide a vehicle state accurate sensing method based on a multi-feature deep fusion neural network.

The technical scheme adopted by the invention is as follows: a vehicle state accurate perception method based on a multi-feature deep fusion neural network is characterized by comprising the following steps: the method comprises the steps of firstly, respectively extracting rotation and translation geometric characteristics in an input tensor by using a parallel convolution neural network, then, learning the dynamic relation and the time correlation characteristic between the extracted characteristics by using the convolution neural network, and finally estimating the three-dimensional speed (longitudinal speed v) of the vehicle_xTransverse vehicle speed v_yVertical speed v_z) Three-dimensional angular velocity (roll angular velocity ω)_xPitch angle velocity ω_yYaw angular velocity ω_z)。

The method specifically comprises the following steps:

the method comprises the following steps: determining network input and output, and establishing training data set

The change of the automobile state information is continuous, so the input quantity when the deep neural network is used for vehicle state perception is a sequence of images captured by a camera. In order to enable the features in the image to be sufficient and recognizable, a monocular camera is required to acquire a sequence image of a vehicle in motion in a non-open environment (such as an urban road and a mountain road), and then an RGB image sequence acquired by the monocular camera is labeled, that is, vehicle state information corresponding to the image acquisition time is labeled. Existing data samples, such as the KITTI data set, may also be used in the present invention.

To simplify the data input to the network while ensuring that key feature information in the image is not lost, the average RGB values of the training set are subtracted from each frame of image data, and then the image dimensions are readjusted to multiples of 64 to fit the network structure. The dimension of the images of the training set is uniformly adjusted to be 1280 multiplied by 384 multiplied by 3, and then two continuous frames of images are stacked together to form a tensor which is sent to a deep neural network, namely the dimension of the input quantity is 1280 multiplied by 384 multiplied by 6. The preprocessing of the image sequence can ensure that the data volume of network processing is reduced while effective information is not lost, thereby saving training time. The output of the network is the information of the three-dimensional speed and the three-dimensional angular speed of the vehicle which needs to be perceived. Corresponding the input quantity and the output quantity at the same time to form a training data set, which is marked as D_T。

Step two: designing parallel deep convolutional-recurrent neural network architecture

The convolutional neural network has strong image feature extraction capability, can directly take image pixel information as input, and performs high-level feature abstraction through convolution operation. The recurrent neural network has good processing and prediction capabilities on sequence data, including time series and space series. In the invention, a convolutional neural network is required to be used for image feature extraction, and a recurrent neural network is also required to process the time-dependent characteristic of the vehicle motion state, so that a parallel deep convolutional-recurrent neural network architecture is designed for perceiving the multidimensional state information of the vehicle.

The motion of the vehicle in space can be decomposed into translation and rotation, so that the framework comprises a rotation sensitive module and a translation sensitive module which are parallel. For each frame of input tensor, the rotation sensitive module extracts rotation characteristic information of the vehicle, and the translation sensitive module extracts translation characteristic information of the vehicle. The characteristics learned by the deep convolutional neural network can compress the original high-dimensional RGB image into a compact information description, and the efficiency of continuous sequence training is improved. And then, the feature information extracted by the parallel deep convolution is sent to a recurrent neural network module to learn the time correlation features among the continuous frame features so as to accurately sense the vehicle state information. The parallel deep convolution-recurrent neural network architecture has the advantage that the learning of various vehicle motion state characteristics can be simultaneously carried out through the combination of the convolutional neural network and the recurrent neural network. The method specifically comprises the following substeps:

substep 1: translation-sensitive module designed for extracting vehicle translation characteristic information

In order to accurately extract the vehicle translation characteristics in the input tensor, a translation sensitive module is designed, and the translation sensitive module consists of 6 layers of convolution layers with gradually decreasing receptive field, wherein the size of the receptive field of each layer is F_T×F_T，F_TThe values of (a) are 7, 5, 3, respectively. To adapt each layer of the convolutional neural network to the configuration of the receptive field while preserving the spatial dimensions of the post-convolution tensor to reduce feature loss, zero padding is introduced for each layer of convolutional layers. Zero padding size of P_TWherein each layer corresponds to P_TRespectively 3, 2, 1 and 1. As the number of convolution layers increases, the number of filters, i.e., the number of channels, used for feature extraction in the layer also increases, and the corresponding number of channels C_T64, 128, 256, 512, 1024 respectively. Each convolutional layer uses a correction linear unit as an activation function, and the unsaturated form of the correction linear unit can relieve the problem of gradient disappearance caused by excessive convolutional layers. The specific structural expression of the translation-sensitive module is as follows:

convolutional layer 1_ 1: convolving a 7 × 7 receptive field with a 1280 × 384 × 6 input tensor, wherein the step size is 2, the zero padding size is 3, and activating by a modified linear unit to obtain a feature map with the dimension of 640 × 192 × 64;

convolutional layer 1_ 2: performing convolution by using a 5 × 5 receptive field and a feature map output by the convolution layer 1_1, wherein the step length is 2, the zero padding size is 2, and the feature map with the dimension of 320 × 96 × 128 is obtained through activation by a modified linear unit;

convolutional layer 1_ 3: performing convolution by using a 5 × 5 receptive field and a feature map output by the convolution layer 1_2, wherein the step length is 2, the zero padding size is 2, and the feature map with the dimensionality of 160 × 48 × 256 is obtained through activation of a modified linear unit;

convolutional layer 1_ 4: convolving the characteristic diagram output by the convolution layer 1_3 by using the 3 multiplied by 3 receptive field, wherein the step length is 2, the zero padding size is 1, and activating by a modified linear unit to obtain the characteristic diagram with the dimension of 80 multiplied by 24 multiplied by 512;

convolutional layer 1_ 5: convolving the characteristic diagram output by the convolution layer 1_4 by using the 3 × 3 receptive field, wherein the step length is 2, the zero padding size is 1, and activating by a modified linear unit to obtain the characteristic diagram with the dimension of 40 × 12 × 512;

convolutional layer 1_ 6: convolving the feature map output by the convolution layer 1_5 with the 3 × 3 receptive field, the step size is 2, and the zero padding size is 1, so as to obtain the feature map with the dimension of 20 × 6 × 1024.

Substep 2: rotation-sensitive module designed for extracting vehicle rotation characteristic information

In order to accurately extract the vehicle rotation characteristics in the input tensor, a rotation sensitive module is designed, wherein the rotation sensitive module comprises 5 convolution layers, and the receptive field size of each layer is n_f×n_f，n_fThe values of (a) are 7, 5, 3, respectively; zero padding is also adopted to enable each layer of convolutional neural network to adapt to the configuration of the receptive field; number of channels per convolutional layer C_R64, 128, 26, 512, 1024, respectively, using modified linear elements as activation functions; in order to better extract image rotation characteristics, a maximum pooling operation is adopted, and the number of pooling layers n_PIs 1, the sample size is 2 x 2, and the step size is 2. The specific structure of the trans-sensitive module is expressed as follows:

convolution layer 2_1, convolution layer channel number is 64, convolution layer channel number is 4, convolution layer 7 × 7 receptive field and 1280 × 384 × 6 input sample are used, and characteristic diagram with dimension 320 × 96 × 64 is obtained through activation of a correction linear unit;

convolutional layer 2_ 2: performing convolution by using a 5 multiplied by 5 receptive field and a feature map output by the convolutional layer 2_1, wherein the step length is 2, the number of channels of the convolutional layer is 128, and activating by using a modified linear unit to obtain a feature map with the dimension of 160 multiplied by 48 multiplied by 128;

a pooling layer: using a 2 × 2 kernel to check the feature map output by the convolutional layer 2_2 for maximum pooling to obtain a feature map with dimensions of 160 × 48 × 256;

convolutional layer 2_ 3: convolving the characteristic diagram output by the pooling layer by using a 3 × 3 receptive field, wherein the step length is 2, the number of channels of the convolutional layer is 256, and activating by a correction linear unit to obtain a characteristic diagram with the dimension of 80 × 24 × 256;

convolutional layer 2_ 4: convolving the characteristic diagram output by the convolutional layer 2_3 by using the 3 multiplied by 3 receptive field, wherein the step length is 2, the number of channels of the convolutional layer is 512, and activating by a modified linear unit to obtain the characteristic diagram with the dimension of 40 multiplied by 12 multiplied by 512;

convolutional layer 2_ 5: and (3) performing convolution by using the 3 × 3 receptive field and the feature map output by the convolutional layer 2_4, wherein the step length is 2, the number of channels of the convolutional layer is 1024, and activating by using a modified linear unit to obtain the feature map with the dimension of 20 × 6 × 1024.

Substep 3: recurrent neural network module designed for extracting vehicle motion state time-related features

The course of motion of the vehicle has time-dependent characteristics in addition to spatial translation and rotation. The recurrent neural network is suitable for processing serialized data and is not suitable for processing high-dimensional original data such as images. Therefore, the invention takes the features extracted by the parallel deep convolutional network (namely the combination of the features extracted by the translation sensitive module and the rotation sensitive module) as the input of the recurrent neural network. The long-short term memory network can process deep time and dynamic relation, so the recurrent neural network in the invention is constructed by cascading two long-short term memory network layers, and each long-short term memory network layer has 1000 hidden states. The recurrent neural network estimates 6-dimensional vehicle state information including three-dimensional speed (v) of the vehicle according to the tensor features of each frame extracted by the parallel deep convolutional network_x,v_y,v_z) Three-dimensional angular velocity (omega)_x,ω_y,ω_z). The concrete structure of the recurrent neural network module is expressed as follows:

long-short term memory network layer 1: learning by using 1000 hidden states, a feature map output by the parallel deep convolution network and a vehicle state output by a previous frame to obtain state output of a current frame;

long-short term memory network layer 2: and learning by using 1000 hidden states, the vehicle state output by the long-short term memory network layer 1 and the vehicle state output in the last frame, and outputting 6-dimensional vehicle state information.

Step three: optimization of network parameters using weight balance based mean square error iterative method

The optimization goal of the parallel deep convolutional-recurrent neural network is the probability of the vehicle state parameter before time t under the condition of the image sequence, i.e. the probability

p(Y_t|X_t)＝p(y₁,...,y_t|x₁,...,x_t) (4)

In the formula (1), Y_t＝(y₁,...,y_t) For all vehicle state parameters, X, before a given time t_t＝(x₁,...,x_t) For a sequence of images before a given time t.

In order to obtain an optimal estimation of the vehicle state parameters, the conditional probability of equation (1) should be maximized, introducing parameters:

the true value of the vehicle state at the time k is Y_k＝[v_k,ω_k]The estimated value obtained by the network calculation is

Updating the hyper-parameter theta by using the mean square error of the vehicle state at all the moments after weight balance to enable theta^*And maximizing to further optimize the output estimated value, namely, the estimated value is closest to the true value, and the process is as follows:

v in the formula (3)_k＝[v_x v_y v_z]_kThree-dimensional speed of the vehicle at time k, ω_k＝[ω_x ω_y ω_z]_kThe three-dimensional angular velocity of the vehicle at the moment k; | l | · | is a two-norm of the vehicle state parameter; rho₁、ρ₂The scale factors are respectively used for balancing the weight occupied by the speed and the angular speed state quantity of the vehicle, and the parameters need to be manually adjusted according to the training effect in the network training process.

After the network parameter optimization algorithm, the designed parallel deep convolution-recurrent neural network is trained by using training data set samples. In order to improve the accuracy of the training result and the generalization of the network, the network is pre-trained before formal training, and then parameters obtained by pre-training are finely adjusted, which specifically comprises the following substeps:

substep 1: selecting an image sequence dataset to pre-train a network

Selecting vehicle motion process image sequence data or KITTI data set acquired by a monocular camera with small sample size, adjusting the image according to the method of the step one, and recording the processed data set as D_p. Then use D_pPerforming network pre-training, and setting the maximum iteration number as I_pLearning rate of alpha_pAnd the weight is set to lambda_pThe scale factor is set to rho_P1、ρ_p2Storing the network parameters obtained by pre-training;

substep 2: fine tuning of network parameters using established training data sets

Using the data set D established in step one_TFine-tuning the network parameters obtained by pre-training in the third substep 1, and setting the maximum iteration number as I_TLearning rate of alpha_TAnd the weight is set to lambda_TThe scale factor is set to rho_T1、ρ_T2And then, adjusting the network parameters according to the variation conditions of the training loss curve and the verification loss curve until the network parameters are optimal, namely the formula (3) reaches the maximum value.

Step four: vehicle state parameter sensing using trained networks

And (3) preprocessing the vehicle motion process image sequence acquired by the monocular camera according to the method of the step one, and sending the preprocessed image sequence serving as input quantity to a trained parallel deep convolution-recurrent neural network to obtain 6-dimensional vehicle state information, including the three-dimensional speed and the three-dimensional angular speed of the vehicle.

The invention has the advantages and obvious effects that:

the vehicle state sensing method provided by the invention only uses a monocular camera, and has the advantage of low cost; the proposed parallel deep convolution-recurrent neural network architecture fully considers various characteristics in the vehicle motion process, including translational motion characteristics, rotational motion characteristics and motion state time correlation characteristics of the vehicle, and the estimated vehicle state information has multiple dimensions, good generalization performance and high accuracy.

Drawings

FIG. 1 is a flow chart of a vehicle state accurate perception method based on a multi-feature deep fusion neural network;

FIG. 2 is a diagram of a parallel deep convolution-recurrent neural network architecture;

fig. 3 is a flow chart of iterative optimization of network parameters.

Detailed Description

The technical scheme adopted by the invention is as follows:a vehicle state accurate perception method based on a multi-feature deep fusion neural network is characterized by comprising the following steps: the method comprises the steps of firstly, respectively extracting rotation and translation geometric characteristics in an input tensor by using a parallel convolution neural network, then, learning the dynamic relation and the time correlation characteristic between the extracted characteristics by using the convolution neural network, and finally estimating the three-dimensional speed (longitudinal speed v) of the vehicle_xTransverse vehicle speed v_yVertical speed v_z) Three-dimensional angular velocity (roll angular velocity ω)_xPitch angle velocity ω_yYaw angular velocity ω_z). The flow of the method of the invention is shown in fig. 1, and specifically comprises the following steps:

The motion of the vehicle in space can be decomposed into translation and rotation, so that the framework comprises a rotation sensitive module and a translation sensitive module which are parallel. For each frame of input tensor, the rotation sensitive module extracts rotation characteristic information of the vehicle, and the translation sensitive module extracts translation characteristic information of the vehicle. The characteristics learned by the deep convolutional neural network can compress the original high-dimensional RGB image into a compact information description, and the efficiency of continuous sequence training is improved. And then, the feature information extracted by the parallel deep convolution is sent to a recurrent neural network module to learn the time correlation features among the continuous frame features so as to accurately sense the vehicle state information. The parallel deep convolution-recurrent neural network architecture has the advantage that the learning of various vehicle motion state characteristics can be simultaneously carried out through the combination of the convolutional neural network and the recurrent neural network. The designed network architecture is shown in fig. 2, and specifically includes the following sub-steps:

In order to accurately extract the vehicle translation characteristics in the input tensor, a translation sensitive module is designed, and the translation sensitive module consists of 6 layers of convolution layers with gradually decreasing receptive field, wherein the size of the receptive field of each layer is F_T×F_T，F_TThe values of (a) are 7, 5, 3, respectively. To adapt each layer of the convolutional neural network to the configuration of the receptive field while preserving the spatial dimensions of the post-convolution tensor to reduce feature loss, zero padding is introduced for each layer of convolutional layerZero padding size of P_TWherein each layer corresponds to P_TRespectively 3, 2, 1 and 1. As the number of convolution layers increases, the number of filters, i.e., the number of channels, used for feature extraction in the layer also increases, and the corresponding number of channels C_T64, 128, 256, 512, 1024 respectively. Each convolutional layer uses a correction linear unit as an activation function, and the unsaturated form of the correction linear unit can relieve the problem of gradient disappearance caused by excessive convolutional layers. The specific structural expression of the translation-sensitive module is as follows:

To accurately extract the rotational features of the vehicle in the input tensor, letA rotation-sensitive module is designed, which comprises 5 convolution layers, each layer having a field size of n_f×n_f，n_fThe values of (a) are 7, 5, 3, respectively; zero padding is also adopted to enable each layer of convolutional neural network to adapt to the configuration of the receptive field; number of channels per convolutional layer C_R64, 128, 26, 512, 1024, respectively, using modified linear elements as activation functions; in order to better extract image rotation characteristics, a maximum pooling operation is adopted, and the number of pooling layers n_PIs 1, the sample size is 2 x 2, and the step size is 2. The specific structure of the trans-sensitive module is expressed as follows:

Vehicle with a steering wheelHas time-dependent characteristics in addition to spatial translation and rotation. The recurrent neural network is suitable for processing serialized data and is not suitable for processing high-dimensional original data such as images. Therefore, the invention takes the features extracted by the parallel deep convolutional network (namely the combination of the features extracted by the translation sensitive module and the rotation sensitive module) as the input of the recurrent neural network. The long-short term memory network can process deep time and dynamic relation, so the recurrent neural network in the invention is constructed by cascading two long-short term memory network layers, and each long-short term memory network layer has 1000 hidden states. The recurrent neural network estimates 6-dimensional vehicle state information including three-dimensional speed (v) of the vehicle according to the tensor features of each frame extracted by the parallel deep convolutional network_x,v_y,v_z) Three-dimensional angular velocity (omega)_x,ω_y,ω_z). The concrete structure of the recurrent neural network module is expressed as follows:

p(Y_t|X_t)＝p(y₁,...,y_t|x₁,...,x_t) (7)

After the network parameter optimization algorithm, the designed parallel deep convolution-recurrent neural network is trained by using training data set samples. In order to improve the accuracy of the training result and the generalization of the network, the network is pre-trained before formal training, and then parameters obtained by the pre-training are fine-tuned, wherein the training process is shown in fig. 3 and specifically comprises the following sub-steps:

substep 1: selecting an image sequence dataset to pre-train a network

Selecting the vehicle motion process image sequence data or KITTI data set collected by a monocular camera with small sample size, and adjusting the graph according to the method of the step oneImage, the processed data set is denoted as D_p. Then use D_pPerforming network pre-training, and setting the maximum iteration number as I_pLearning rate of alpha_pAnd the weight is set to lambda_pThe scale factor is set to rho_P1、ρ_p2Storing the network parameters obtained by pre-training;

Step four: vehicle state parameter sensing using trained networks

Claims

1. A vehicle state accurate perception method based on a multi-feature deep fusion neural network is characterized by comprising the following steps: firstly, parallel convolutional neural network is used for respectively extracting rotation geometric features and translation geometric features in input tensor, and then recursive neural network is used for learning dynamic relation and time phase between extracted featuresAnd (3) according to the characteristic, finally estimating the three-dimensional speed and the three-dimensional angular speed information of the vehicle, wherein the three-dimensional speed comprises a longitudinal vehicle speed v_xTransverse vehicle speed v_yVertical speed v_zThe three-dimensional angular velocity includes a roll angular velocity ω_xPitch angle velocity omega_yYaw angular velocity ω_zThe method specifically comprises the following steps:

Acquiring a sequence image of a vehicle in motion through a monocular camera in a non-open environment, and then marking an RGB image sequence acquired by the monocular camera, namely marking vehicle state information corresponding to the image acquisition moment;

subtracting the average RGB value of the training set from each frame of image data, readjusting the image dimension to 1280 × 384 × 3, stacking two continuous frames of images to form a tensor, sending the tensor into a deep neural network, wherein the dimension of the input quantity is 1280 × 384 × 6, the output of the network is the information of the three-dimensional speed and the three-dimensional angular speed of the vehicle required to be sensed, and corresponding the input quantity and the output quantity at the same time to form a training data set which is marked as D_T；

Firstly, decomposing the motion of a vehicle in a space into a translation part and a rotation part, wherein the framework comprises a rotation sensitive module and a translation sensitive module which are parallel, for each frame of input tensor, the rotation sensitive module extracts the rotation characteristic information of the vehicle, the translation sensitive module extracts the translation characteristic information of the vehicle, and the original high-dimensional RGB image is compressed into a compact information description through the characteristics learned by a deep convolutional neural network, so that the efficiency of continuous sequence training is improved; then, the feature information extracted by the parallel deep convolution is sent to a recurrent neural network module to learn time correlation features among continuous frame features, so as to accurately sense the vehicle state information, and the method specifically comprises the following substeps:

The translation sensitive module is gradually decreased from the receptive fieldIs composed of 6 convolutional layers, the size of the receptive field of each layer is F_T×F_T，F_TThe values of (a) are 7, 5, 3, respectively; in order to adapt each layer of convolutional neural network to the configuration of a receptive field and simultaneously keep the spatial dimension of a post-convolution tensor to reduce feature loss, zero padding is introduced into each layer of convolutional layer, and the size of the zero padding is P_TWherein each layer corresponds to P_T3, 2, 1 and 1 respectively; with the increase of the number of convolution layers, the number of filters for feature extraction, namely the number of channels, of the layers is correspondingly increased, and the corresponding number of channels C is increased_T64, 128, 256, 512, 1024, respectively; each convolution layer uses a modified linear unit as an activation function, and the specific structure of the translation sensitive module is expressed as follows:

convolutional layer 1_ 6: convolving the characteristic diagram output by the convolution layer 1_5 by using the 3 × 3 receptive field, wherein the step length is 2, and the zero padding size is 1 to obtain the characteristic diagram with the dimension of 20 × 6 × 1024;

The rotation sensitive module comprises 5 convolution layers, and the receptive field size of each layer is n_f×n_f，n_fThe values of (a) are 7, 5, 3, respectively; zero padding is also adopted to enable each layer of convolutional neural network to adapt to the configuration of the receptive field; number of channels per convolutional layer C_R64, 128, 26, 512, 1024, respectively, using modified linear elements as activation functions; number of pooling layers n_PThe sampling size is 1, the sampling size is 2 x 2, the step length is 2, and the specific structure expression of the trans-sensitive module is as follows:

convolutional layer 2_1: convolving a 7 × 7 receptive field with 1280 × 384 × 6 input samples, wherein the step length is 4, the number of channels of the convolutional layer is 64, and activating the convolutional layer by a modified linear unit to obtain a characteristic diagram with the dimension of 320 × 96 × 64;

convolutional layer 2_ 5: performing convolution by using a 3 × 3 receptive field and a feature map output by the convolutional layer 2_4, wherein the step length is 2, the number of channels of the convolutional layer is 1024, and activating by using a modified linear unit to obtain a feature map with the dimension of 20 × 6 × 1024;

The method comprises the steps that features extracted by a parallel deep convolutional network are used as input of a recurrent neural network, wherein the features extracted by the parallel deep convolutional network are the combination of features extracted by a translation sensitive module and a rotation sensitive module, the recurrent neural network is constructed by cascading two long-short term memory network layers, and each long-short term memory network layer has 1000 hidden states; the recurrent neural network estimates 6-dimensional vehicle state information including three-dimensional speed and three-dimensional angular speed of the vehicle according to tensor features of each frame extracted by the parallel deep convolutional network, and the concrete structural expression of the recurrent neural network module is as follows:

long-short term memory network layer 2: learning by using 1000 hidden states, the vehicle state output by the long-short term memory network layer 1 and the vehicle state output by the previous frame, and outputting 6-dimensional vehicle state information;

p(Y_t|X_t)＝p(y₁,...,y_t|x₁,...,x_t) (1)

In the formula (1), Y_t＝(y₁,...,y_t) For all vehicle state parameters, X, before a given time t_t＝(x₁,...,x_t) For a sequence of images before a given time t;

vehicle state at time kTrue value of (a) is Y_k＝[v_k,ω_k]The estimated value obtained by the network calculation is

v in the formula (3)_k＝[v_x v_y v_z]_kThree-dimensional speed of the vehicle at time k, ω_k＝[ω_x ω_y ω_z]_kThe three-dimensional angular velocity of the vehicle at the moment k; | l | · | is a two-norm of the vehicle state parameter; rho₁、ρ₂The scale factors are used for balancing the weight occupied by the speed and the angular speed state quantity of the vehicle, and the parameters need to be manually adjusted according to the training effect in the network training process;

in order to improve the accuracy of the training result and the generalization of the network, the network is pre-trained before formal training, and then parameters obtained by pre-training are finely adjusted, which specifically comprises the following substeps:

substep 1: selecting an image sequence dataset to pre-train a network

Selecting vehicle motion process image sequence data or KITTI data set acquired by a monocular camera with small sample size, adjusting the image according to the method of the step one, and recording the processed data set as D_p(ii) a Then use D_pPerforming network pre-training, and setting the maximum iteration number as I_pLearning rate of alpha_pAnd the weight is set to lambda_pThe scale factor is set to rho_P1、ρ_p2Storing the network parameters obtained by pre-training;

Using the data set D established in step one_TFine-tuning the network parameters obtained by pre-training in the third substep 1, and setting the maximum iteration number as I_TLearning rate of alpha_TAnd the weight is set to lambda_TThe scale factor is set to rho_T1、ρ_T2Then, adjusting network parameters according to the variation conditions of the training loss curve and the verification loss curve until the network parameters are optimal, namely, the formula (3) reaches the maximum value;

step four: vehicle state parameter sensing using trained networks

And (3) preprocessing the image sequence of the vehicle motion process acquired by the monocular camera according to the method of the step one, and sending the preprocessed image sequence as an input quantity to a trained parallel deep convolution-recurrent neural network to obtain 6-dimensional vehicle state information, including the three-dimensional speed and the three-dimensional angular speed of the vehicle.