CN113838135A

CN113838135A - Pose estimation method, system and medium based on LSTM double-current convolution neural network

Info

Publication number: CN113838135A
Application number: CN202111181525.6A
Authority: CN
Inventors: 罗元; 曾勇超; 胡章芳
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2021-10-11
Filing date: 2021-10-11
Publication date: 2021-12-24
Anticipated expiration: 2041-10-11
Also published as: CN113838135B

Abstract

The invention requests to protect a pose estimation method, a pose estimation system and a pose estimation medium based on an LSTM double-current convolutional neural network, wherein the method comprises the following steps: s1, preprocessing the color image and the depth image, respectively cascading the color image and the depth image of two adjacent frames, further preprocessing the depth image by MND coding, and finally normalizing the color image and the depth image; s2, respectively inputting the color image and the depth image which are preprocessed into a color flow and a depth flow of a double-flow convolution neural network for feature extraction; s3, fusing the rgb feature map output by the color stream and the depth feature map output by the depth stream to generate a new fusion feature map; s4, performing global mean pooling on the newly generated fusion feature map; and S5, predicting the current pose by training through an LSTM neural network. The result shows that the pose estimation model provided by the method has higher precision and robustness under the conditions of motion blur and insufficient light.

Description

Pose estimation method, system and medium based on LSTM double-current convolution neural network

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a pose estimation method of an LSTM double-current convolution neural network.

Background

The intelligent manufacturing proposed by industry 4.0 is oriented to the whole life cycle of products, and realizes the informatization manufacturing under the ubiquitous sensing condition. The intelligent manufacturing technology is based on modern sensing technology, network technology, automation technology and artificial intelligence, realizes the intellectualization of product design process, manufacturing process and enterprise management and service through perception, man-machine interaction, decision, execution and feedback, and is the deep integration of information technology and manufacturing technology. The indoor mobile robot is one of the representative products fusing the modern sensing technology, the network technology and the automation technology proposed by the industry 4.0.

The mobile robot technology is widely applied to the fields of resource exploration and development, medical service, home entertainment, military, aerospace and the like, for example, Auto Guided Vehicle (AGV) and cleaning robots are applied to logistics transportation and home sanitation cleaning. In an intelligent mobile robot, Simultaneous Localization and Mapping (SLAM) is a core technology thereof. The navigation process of a mobile robot can be broken down into three modules: positioning, mapping and path planning. Positioning is used for determining the pose state of the robot in the environment at the current moment; the mapping is to integrate the local continuous observation information of the surrounding environment into a globally consistent model; the path plan determines the optimal navigation path in the map.

Artificial intelligence technology capable of simulating human reasoning, judgment and memory is widely applied in various aspects, such as face recognition, object classification and the like. Similar to the application of deep learning techniques in the field of face recognition, feature point method-based visual odometry also requires feature point detection, matching and screening. Therefore, the deep learning technology is feasible to be applied to the visual odometer of the SLAM, and the visual odometer based on the deep learning is more consistent with the human perception mode and has wide research potential and value. Most of the existing visual mileage calculation methods basically pass through links such as feature extraction matching, motion estimation, local optimization and the like and are greatly influenced by camera parameters, motion blur and insufficient light.

The prior art includes: a target positioning method based on double-current convolution neural network regression learning of depth images (patent application No. 201910624713.8, patent publication No. CN 110443849A). The method adopts a binocular camera to shoot two photos simultaneously, carries out depth reduction through an image preprocessing technology to obtain a depth image, and converts a color image into a gray image in the image preprocessing. After preprocessing, the two images are respectively input into a convolutional neural network to extract features, then the two features are subjected to convolutional feature fusion, and finally the two features are input into a full-link layer to be subjected to regression operation. The invention adopts the RGB-D camera as the sensor, can directly acquire the RGB image and the corresponding depth image by the camera, and does not need to convert the RGB image into a gray image. Preprocessing the RGB image and the depth image, inputting the preprocessed RGB image and the preprocessed depth image into a double-current convolution neural network, respectively obtaining color features in the RGB image and depth features of the depth image, inputting the two feature maps into a feature fusion unit for splicing feature fusion, and finally inputting the fused features into a long-short term memory recurrent neural network (LSTM) for time sequence modeling to obtain pose information. Compared with 201910624713.8, the method has different sensors, different preprocessing methods, different convolutional neural network structures, different feature fusion methods and different pose estimation methods.

After retrieval, the closest prior art is as follows: 201910624713.8, a target positioning method based on the double-current convolution neural network regression learning of depth images, characterized in that, S1, at each reference position, a binocular camera collects a gray image and a depth image corresponding to the gray image; s2 converting the grayscale image and the depth image into three-channel images using an image preprocessing technique; s3 the double-flow CNN with the shared weight coefficient is used for off-line regression learning to obtain a distance-based regression model; s4 after the preprocessing of the grayscale and depth images, the final distance may be estimated by a distance-based regression model. Obviously, some beneficial attempts are made, but some problems of poor robustness and large error of the line-of-sight estimation exist.

Disclosure of Invention

The present invention is directed to solving the above problems of the prior art. A pose estimation method, medium and system based on an LSTM double-current convolutional neural network are provided. The technical scheme of the invention is as follows:

a pose estimation method based on an LSTM double-current convolutional neural network comprises the following steps:

s1, preprocessing a color image and a depth image acquired by an RGB-D camera, respectively cascading the color image and the depth image of two adjacent frames, preprocessing the depth image by adopting minimum normal plus depth MND coding, and finally normalizing the color image and the depth image; s2, respectively inputting the color image and the depth image which are preprocessed into a color flow and a depth flow of a double-flow convolution neural network for feature extraction; s3, fusing the color feature map rgb feature map output by the color flow and the depth feature map output by the depth flow to generate a new fusion feature map; s4, performing global mean pooling on the newly generated fusion feature map; and S5, predicting the current pose by training through an LSTM neural network.

Further, the color image preprocessing specifically includes cascading adjacent frames of the color image to generate a color image with a size of 640 × 960; the depth image preprocessing specifically includes that the depth image is subjected to MND coding processing, and the width and the height of the depth image are scaled to be n_xAnd n_yUsing depth d as the third channel of the image, for the scaled surface normal [ n ]_x,n_y,d]Satisfy the requirement of

Adjacent frames of the depth image are then concatenated to generate a depth image of 640 x 960 size.

Further, step S2 is to input the color image and the depth image after the preprocessing into the color stream and the depth stream of the dual-stream convolutional neural network, respectively, to perform feature extraction, specifically: adopting a double-current convolutional neural network architecture, wherein the structures of the color flow and the depth flow are consistent and are composed of 5 convolutional layers, extracting the characteristics of different layers in the image, and the first four layers are provided with a ReLU activation unit; preprocessing the color image I^rgbAs input of the color stream, the preprocessed depth image I^depthAnd (3) as the input of the depth stream, respectively obtaining a color feature map and a depth structure feature map through convolution operation.

Further, the dual-current convolutional neural network adopts a parallel structure, each branch of the parallel structure is composed of five convolutional layers, the first four convolutional layers of each branch pass through a ReLU activation unit, and the formula is as follows:

f(x)＝max(0,x) (1)

where x is the input and f (x) is the output after the ReLU unit.

Further, the step S3 is to fuse the color feature map rgb feature map output by the color stream and the depth feature map output by the depth stream to generate a new fusion feature map specifically: combining the feature maps output by conv5 in the two data flow networks to form a new fusion feature map, performing global mean pooling after batch normalization and a ReLU nonlinear activation unit, and expressing the generated fusion feature as follows:

wherein, X_kIs a fusion feature map that is a function of the feature map,

is the rgb feature map, and the rgb feature map,

is a depth feature map.

Further, the step S5 predicts the current pose by training using the LSTM neural network, and specifically includes:

performing time sequence modeling on the image sequence by using an LSTM neural network, and predicting current pose information; the LSTM neural network consists of a forgetting gate, an input gate and an output gate, and the information which is useful for estimating the current pose information is memorized through learning, and the information which is not useful for estimating the current pose information is forgotten; the forgetting door can control to forget useless information of the previous state, and the formula is as follows:

f_k＝σ(W_f·[h_k-1,x_k]+b_f) (3)

wherein f is_kIs the output of a forgetting gate, σ is the sigmoid function, W_fIs a forgetting parameter, h_k-1Is the hidden state at the previous moment, x_kIs an input of the current time, b_fIs the bias of the forgetting gate;

in which the input gate decides what information to add to the current state, the input gate selecting layer i from the input_kAnd candidate layer

The formula is as follows:

i_k＝σ(W_i·[h_k-1,x_k]+b_i) (4)

wherein W_iIs an input parameter, tanh is a hyperbolic tangent function, W_CIs a candidate parameter; b_iIs the select layer bias; b_cIs the bias of the candidate layer;

where the output gate decides what prediction to make, the formula is:

o_k＝σ(W_o·[h_k-1,x_k]+b_o) (6)

wherein W_oIs an output parameter; b_oIs the offset of the output gate;

and finally, designing a loss function by minimizing the Euclidean distance between the real pose and the estimated pose, wherein the loss function is as follows:

where N is the number of samples; w is a weight coefficient of position and attitude;

to estimate the pose;

is an actual pose.

A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a pose estimation method based on an LSTM dual-flow convolutional neural network as claimed in any one of the preceding claims.

A pose estimation system of an LSTM double-current convolutional neural network based on the method comprises the following steps:

a preprocessing module: the system is used for preprocessing a color image and a depth image acquired by an RGB-D camera, respectively cascading the two adjacent frames of the color image and the depth image, coding and preprocessing the depth image by adopting a minimum normal plus a depth method, and finally normalizing the color image and the depth image;

a feature extraction module: the color image and the depth image which are preprocessed are respectively input into a color flow and a depth flow of a double-flow convolution neural network for feature extraction;

a fusion module: the fusion feature map fusion method comprises the steps that a color feature map rgb feature map output by a color flow and a depth feature map output by a depth flow are fused to generate a new fusion feature map; carrying out global mean pooling on the newly generated fusion feature map;

a prediction module: and predicting the current pose by using an LSTM neural network through training.

The invention has the following advantages and beneficial effects:

aiming at the problems that a visual odometer is sensitive to camera parameters and is greatly influenced by motion blur and insufficient light, the invention provides an LSTM-based double-current convolution neural network, so that the contour features extracted by a depth flow supplement the color features extracted by a color flow, and the robustness of a pose estimation system in the environment with motion blur and insufficient light is improved.

By testing on the public data set TUM, experiments show that there is better robustness in motion-blurred and light-deficient environments when the pose estimation fuses the contour features extracted from the depth image. Compared with other pose estimation methods based on a convolutional neural network, the pose estimation method based on the convolutional neural network has the advantages that the error of the model on sight estimation is smaller, and superior performance is achieved.

The method proposes a new pose estimation method based on an LSTM dual-stream convolutional neural network, and according to the pose estimation method based on an LSTM dual-stream convolutional neural network described in claim 1, in step S2, the color image and the depth image after being preprocessed are respectively input into the color stream and the depth stream of the dual-stream convolutional neural network for feature extraction. The method provides a new double-current convolutional neural network structure, the structures of the color flow and the depth flow are consistent and are composed of 5 convolutional layers, and the first four layers are all provided with ReLU activation units. The method introduces the depth characteristics into the system through the double-flow architecture of the convolutional neural network, has higher precision and robustness compared with other attitude regression systems based on the convolutional neural network, and particularly has better performance in challenging environments.

The method provides a new pose estimation method based on an LSTM double-current convolutional neural network, according to the claims 4-6, the method firstly adopts the double-current convolutional neural network to extract color features and depth features, then fuses the color features and the depth features, and finally carries out time sequence modeling by taking the fused features as the input of a long-short term memory cyclic neural network (LSTM) to estimate the current pose. The pose estimation is to estimate the pose by finding out the front and back rules in the image stream, and the long and short term memory cyclic neural network can memorize the former state and find out the association between the current time and the past time, which is just suitable for solving the pose regression problem. The common method adopts full-hierarchy prediction pose information, and is more suitable for the problems of object identification and classification.

Drawings

FIG. 1 is a pose estimation framework diagram based on an LSTM dual-flow convolutional neural network according to a preferred embodiment of the present invention;

FIG. 2 is a diagram of an LSTM dual-stream convolutional neural network.

Detailed Description

The technical solutions in the embodiments of the present invention will be described in detail and clearly with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.

The technical scheme for solving the technical problems is as follows:

and S1, preprocessing the color image and the depth image, respectively cascading the color image and the depth image of two adjacent frames, further preprocessing the depth image by MND coding, and finally normalizing the color image and the depth image.

And S2, respectively inputting the color image and the depth image which are preprocessed into the color flow and the depth flow of the double-flow convolutional neural network for feature extraction. And fusing the rgb feature map output by the color stream and the depth feature map output by the depth stream to generate a new fusion feature map. The dual-current convolutional neural network adopts a parallel structure, each branch of the parallel structure is composed of five convolutional layers, and the first four convolutional layers of each branch pass through a ReLU activation unit. The formula is expressed as:

f(x)＝max(0,x) (1)

where x is the input and f (x) is the output after the ReLU unit.

The resulting fused features are represented as:

wherein, X_kIs a fusion feature map that is a function of the feature map,

is the rgb feature map, and the rgb feature map,

is a depth feature map.

And S3, predicting the current pose by training through an LSTM neural network. The LSTM neural network is composed of a forgetting gate, an input gate and an output gate, and memorizes information useful for estimating the current pose information through learning, and forgets information useless for estimating the current pose information. The forgetting door can control the forgetting of useless information in the previous state, and the formula is as follows:

f_k＝σ(W_f·[h_k-1,x_k]+b_f) (3)

wherein f is_kIs the output of a forgetting gate, σ is the sigmoid function, W_fIs a forgetting parameter, h_k-1Is the hidden state at the previous moment, x_kIs an input of the current time, b_fIs the bias of the forgetting gate.

The input gate, which selects layer i from the input, decides what information to add to the current state_kAnd candidate layer

The formula is as follows:

i_k＝σ(W_i·[h_k-1,x_k]+b_i) (4)

wherein W_iIs an input parameter, tanh is a hyperbolic tangent function, W_CAre candidate parameters. b_iIs the select layer bias; b_cIs the bias of the candidate layer.

The output gate decides what prediction to make, and it has the formula:

o_k＝σ(W_o·[h_k-1,x_k]+b_o) (6)

wherein W_oIs an output parameter; b_oIs the offset of the output gate.

to estimate the pose;

is an actual pose. .

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims

1. A pose estimation method based on an LSTM double-current convolutional neural network is characterized by comprising the following steps:

2. The pose estimation method based on the LSTM dual-current convolutional neural network according to claim 1, wherein the color image preprocessing specifically includes concatenating adjacent frames of a color image to generate a color image with a size of 640 × 960;the depth image preprocessing specifically includes that the depth image is subjected to MND coding processing, and the width and the height of the depth image are scaled to be n_xAnd n_yUsing depth d as the third channel of the image, for the scaled surface normal [ n ]_x,n_y,d]Satisfy the requirement of

3. The pose estimation method based on the LSTM dual-stream convolutional neural network according to claim 1, wherein the step S2 inputs the color image and the depth image that are processed in advance into the color stream and the depth stream of the dual-stream convolutional neural network respectively for feature extraction, specifically: adopting a double-current convolutional neural network architecture, wherein the structures of the color flow and the depth flow are consistent and are composed of 5 convolutional layers, extracting the characteristics of different layers in the image, and the first four layers are provided with a ReLU activation unit; preprocessing the color image I^rgbAs input of the color stream, the preprocessed depth image I^depthAnd (3) as the input of the depth stream, respectively obtaining a color feature map and a depth structure feature map through convolution operation.

4. The pose estimation method based on the LSTM dual-stream convolutional neural network as claimed in any of claims 1-3, wherein the dual-stream convolutional neural network adopts a parallel structure, and each branch of the parallel structure is composed of five convolutional layers, and the first four convolutional layers of each branch pass through a ReLU activation unit, and the formula is as follows:

f(x)＝max(0,x) (1)

where x is the input and f (x) is the output after the ReLU unit.

5. The pose estimation method based on the LSTM dual-current convolutional neural network according to claim 4, wherein the step S3 is to fuse the color feature map rgb feature map output by the color flow and the depth feature map output by the depth flow to generate a new fusion feature map specifically: combining the feature maps output by conv5 in the two data flow networks to form a new fusion feature map, performing global mean pooling after batch normalization and a ReLU nonlinear activation unit, and expressing the generated fusion feature as follows:

wherein, X_kIs a fusion feature map that is a function of the feature map,

is the rgb feature map, and the rgb feature map,

is a depth feature map.

6. The pose estimation method based on the LSTM dual-flow convolutional neural network of claim 5, wherein the step S5 predicts the current pose through training by using the LSTM neural network, and specifically includes:

f_k＝σ(W_f·[h_k-1,x_k]+b_f) (3)

wherein f is_kIs the output of a forgetting gate, σ is the sigmoid function, W_fIs a forgetting parameter, h_k-1Is the hidden state at the previous moment, x_kIs an input of the current time, b_fIs a forgetting doorBias of (3);

The formula is as follows:

i_k＝σ(W_i·[h_k-1,x_k]+b_i) (4)

where the output gate decides what prediction to make, the formula is:

o_k＝σ(W_o·[h_k-1,x_k]+b_o) (6)

wherein W_oIs an output parameter; b_oIs the offset of the output gate;

to estimate the pose;

is an actual pose.

7. A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, implements the LSTM dual-stream convolutional neural network-based pose estimation method of any of claims 1 to 6.

8. A pose estimation system of an LSTM dual-flow convolutional neural network based on the method of claims 1-6, comprising the steps of: