CN113838135B

CN113838135B - Pose estimation method, system and medium based on LSTM double-flow convolutional neural network

Info

Publication number: CN113838135B
Application number: CN202111181525.6A
Authority: CN
Inventors: 罗元; 曾勇超; 胡章芳
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2021-10-11
Filing date: 2021-10-11
Publication date: 2024-03-19
Anticipated expiration: 2041-10-11
Also published as: CN113838135A

Abstract

The invention discloses a pose estimation method, a pose estimation system and a pose estimation medium based on an LSTM double-flow convolutional neural network, wherein the method comprises the following steps: s1, preprocessing a color image and a depth image, respectively cascading two adjacent frames of color images and depth images, further preprocessing the depth image by MND (maximum length coding), and finally normalizing the color image and the depth image; s2, respectively inputting the preprocessed color image and the preprocessed depth image into a color flow and a depth flow of a double-flow convolutional neural network to perform feature extraction; s3, fusing the rgb feature map of the color flow output and depth feature map of the depth flow output to generate a new fusion feature map; s4, carrying out global average pooling treatment on the newly generated fusion feature map; s5, predicting the current pose by training through the LSTM neural network. The result shows that the pose estimation model provided by the method has higher precision and robustness under the conditions of motion blur and insufficient light.

Description

Pose estimation method, system and medium based on LSTM double-flow convolutional neural network

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a pose estimation method of an LSTM double-flow convolutional neural network.

Background

The intelligent manufacturing proposed by industry 4.0 is full life cycle oriented to products, and informatization manufacturing under ubiquitous sensing conditions is realized. The intelligent manufacturing technology is based on modern sensing technology, network technology, automation technology and artificial intelligence, and realizes the intellectualization of product design process, manufacturing process and enterprise management and service through perception, man-machine interaction, decision-making, execution and feedback, and is the deep fusion and integration of information technology and manufacturing technology. Indoor mobile robots are one of the representative products that incorporate modern sensing technology, networking technology, and automation technology proposed by industry 4.0.

Mobile robotics are widely used in the fields of resource exploration and development, medical services, home entertainment, military, aerospace, etc., for example Auto Guided Vehicle (AGV), and cleaning robots have been used in logistics transportation and home sanitary cleaning. In intelligent mobile robots, simultaneous localization and mapping (Simultaneous Localization and Mapping, SLAM) is its core technology. The navigation process of a mobile robot can be broken down into three modules: positioning, mapping and path planning. Positioning is used for determining the pose state of the robot in the environment at the current moment; the construction is to integrate the local continuous observation information of the surrounding environment into a globally consistent model; the path planning determines the optimal navigation path in the map.

Artificial intelligence techniques, which are currently capable of simulating human reasoning, judgment and memory, are widely used in various aspects, such as face recognition, object classification, etc. Similar to the application of the deep learning technology in the face recognition field, the visual odometer based on the feature point method also needs to detect, match and screen feature points. Therefore, the application of the deep learning technology to the visual odometer of SLAM has feasibility, and the visual odometer based on the deep learning is more in line with the human perception mode, and has wide research potential and value. Most of the existing visual mileage calculation methods basically go through links such as feature extraction matching, motion estimation, local optimization and the like, and are greatly influenced by camera parameters, motion blur and insufficient light.

The prior art comprises the following steps: a target positioning method based on deep image double-flow convolutional neural network regression learning (patent application number: 201910624713.8, patent publication number: CN 110443849A). According to the method, a binocular camera is adopted to shoot two pictures at the same time, depth reduction is carried out through an image preprocessing technology to obtain a depth image, and a color image is converted into a gray image in image preprocessing. After pretreatment, the two types of images are respectively input into a convolutional neural network to extract features, then the two features are subjected to convolutional feature fusion, and finally the two features are input into a full-union layer to carry out regression operation. The invention adopts the RGB-D camera as the sensor, can directly acquire the RGB image and the corresponding depth image, and does not need to convert the RGB image into the gray image. The RGB image and the depth image are input into a double-flow convolutional neural network after preprocessing is completed, color features in the RGB image and depth features of the depth image are respectively obtained, two feature maps are input into a feature fusion unit for splicing feature fusion, and finally the fusion features are input into a long-short-period memory cyclic neural network (LSTM) for time sequence modeling to obtain pose information. Compared with 201910624713.8, the method has different sensors, different preprocessing methods, different convolutional neural network structures, different feature fusion methods and different pose estimation methods.

Through the search, the closest prior art is: 201910624713.8A target positioning method based on double-flow convolutional neural network regression learning of depth image is characterized in that S1, at each reference position, a binocular camera collects gray level images and corresponding depth images thereof; s2, converting the gray level image and the depth image into three-channel images by using an image preprocessing technology; s3, double-flow CNN with shared weight coefficient is used for offline regression learning to obtain a regression model based on distance; s4 after preprocessing of the gray image and the depth image, the final distance may be estimated by a distance-based regression model. It is clear that it also makes some beneficial attempts, but there are also some problems of poor robustness and large errors in the line of sight estimation.

Disclosure of Invention

The present invention is directed to solving the above problems of the prior art. A pose estimation method, medium and system based on LSTM double-flow convolutional neural network are provided. The technical scheme of the invention is as follows:

a pose estimation method based on an LSTM double-flow convolutional neural network comprises the following steps:

s1, preprocessing a color image and a depth image acquired by an RGB-D camera, respectively cascading two adjacent frames of color images and depth images, preprocessing the depth images by adopting a MND (minimum normal+depth) method, and finally normalizing the color images and the depth images; s2, respectively inputting the preprocessed color image and the preprocessed depth image into a color flow and a depth flow of a double-flow convolutional neural network to perform feature extraction; s3, fusing the color feature map rgb feature map output by the color flow and the depth feature map depth feature map output by the depth flow to generate a new fused feature map fusion feature map; s4, carrying out global average pooling treatment on the newly generated fusion feature map; s5, predicting the current pose by training through the LSTM neural network.

Further, the color image preprocessing specifically includes that adjacent frames of the color image are cascaded to generate a color image with 640 x 960 size; the depth image preprocessing is specifically that firstly, MND encoding processing is carried out on the depth image, and the width and the height of the depth image are scaled to be n _x And n _y Taking depth d as the third channel of the image, for scaled surface normal [ n ] _x ,n _y ,d]Satisfy the following requirementsAdjacent frames of the depth image are then concatenated to generate a 640 x 960 size depth image.

Further, the step S2 inputs the preprocessed color image and the preprocessed depth image into the color flow and the depth flow of the dual-flow convolutional neural network respectively for feature extraction, specifically: adopting a double-flow convolutional neural network architecture, wherein the consistent structure of color flow and depth flow consists of 5 layers of convolutional layers, extracting the characteristics of different layers in an image, and the first four layers are provided with ReLU activating units; pre-processed color image I ^rgb As input to the color stream, the preprocessed depth image I is to be processed ^depth And as the input of the depth stream, respectively obtaining a color characteristic spectrum and a depth structure characteristic spectrum through convolution operation.

Further, the dual-flow convolutional neural network adopts a parallel structure, each branch of the parallel structure is composed of five convolutional layers, the first four convolutional layers of each branch pass through a ReLU activation unit, and the formula is expressed as follows:

f(x)＝max(0,x) (1)

where x is the input and f (x) is the output after passing through the ReLU unit.

Further, in the step S3, the fusing of the color feature map rgb feature map output by the color flow and the depth feature map depth feature map output by the depth flow to generate a new fused feature map fusion feature map specifically includes: combining the feature maps output by conv5 in two data stream networks to form a new fusion feature map, and performing global mean value pooling processing after batch normalization and ReLU nonlinear activation units, wherein the generated fusion characteristics are expressed as follows:

wherein X is _k Is a compound of the formula fusion feature map,is rgb feature map, +.>Is depth feature map.

Further, the step S5 predicts the current pose by training using the LSTM neural network, and specifically includes:

performing time sequence modeling on the image sequence by utilizing an LSTM neural network, and predicting current pose information; the LSTM neural network consists of a forgetting gate, an input gate and an output gate, and is used for memorizing information useful for estimating the current pose information through learning and forgetting information useless for estimating the current pose information; the forgetting door can control to forget useless information in a last state, and the formula is as follows:

f _k ＝σ(W _f ·[h _k-1 ,x _k ]+b _f ) (3)

wherein f _k Is the output of the forget gate, σ is the sigmoid function, W _f Is forgetting parameter, h _k-1 Is the hidden state of the last moment, x _k Is the input of the current moment, b _f Is the bias of the forgetting door;

wherein the input gate determines what information to add to the current state, the input gate is selected by the input selection layer i _k And candidate layerThe formula is as follows:

i _k ＝σ(W _i ·[h _k-1 ,x _k ]+b _i ) (4)

wherein W is _i Is an input parameter, tanh is a hyperbolic tangent function, W _C Is a candidate parameter; b _i Is a selection layer bias; b _c Is a bias for the candidate layer;

wherein the output gate decides what predictions to make, the formula is:

o _k ＝σ(W _o ·[h _k-1 ,x _k ]+b _o ) (6)

wherein W is _o Is an output parameter; b _o Is the bias of the output gate;

finally, a loss function is designed by minimizing Euclidean distance of the real pose and the estimated pose, and the loss function is as follows:

where N is the number of samples; w is the weight coefficient of the position and the gesture;to estimate the pose; />Is the actual pose.

A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the LSTM dual-flow convolutional neural network-based pose estimation method as claimed in any one of the claims.

An LSTM double-flow convolutional neural network pose estimation system based on the method comprises the following steps:

and a pretreatment module: the method comprises the steps of preprocessing color images and depth images acquired by an RGB-D camera, respectively cascading two adjacent frames of color images and depth images, preprocessing the depth images by adopting a minimized normal + depth method, and finally normalizing the color images and the depth images;

and the feature extraction module is used for: the method comprises the steps of inputting a preprocessed color image and a preprocessed depth image into a color flow and a depth flow of a double-flow convolutional neural network respectively for feature extraction;

and a fusion module: the method comprises the steps of fusing a color feature map rgb feature map output by a color flow and a depth feature map depth feature map output by a depth flow to generate a new fused feature map fusion feature map; carrying out global average pooling treatment on the newly generated fusion characteristic map fusion feature map;

and a prediction module: and predicting the current pose through training by utilizing the LSTM neural network.

The invention has the advantages and beneficial effects as follows:

aiming at the problems that a visual odometer is sensitive to camera parameters and is greatly influenced by motion blur and insufficient light, the invention provides a convolutional neural network based on LSTM double flow, and the contour features extracted by depth flow supplement the color features extracted by color flow so as to improve the robustness of a pose estimation system in the motion blur and insufficient light environment.

Experiments on the public data set TUM show that the pose estimation is more robust in motion blur and insufficient light environments when the pose estimation fuses the contour features extracted from the depth image. Comparing the algorithm model with other pose estimation methods based on convolutional neural networks, the model has smaller error on vision estimation and obtains superior performance.

According to the pose estimation method based on the LSTM double-flow convolutional neural network, the step S2 is used for respectively inputting the preprocessed color image and depth image into the color flow and depth flow of the double-flow convolutional neural network for feature extraction. The method provides a new double-flow convolutional neural network structure, wherein the color flow and the depth flow have the same structure and are composed of 5 layers of convolutional layers, and the first four layers are provided with ReLU activating units. The method introduces depth features into the system through the double-flow architecture of the convolutional neural network, has higher precision and robustness compared with other gesture regression systems based on the convolutional neural network, and particularly has better performance in challenging environments.

According to the method, according to claims 4-6, the double-flow convolutional neural network is adopted to extract color features and depth features, the color features and the depth features are fused, and finally the fused features are used as inputs of a long-short-term memory cyclic neural network (LSTM) to conduct time sequence modeling, so that the current pose is estimated. The pose estimation is to estimate the pose by finding out the front and back rules in the image stream, and the long and short-term memory cyclic neural network can memorize the previous state, find out the association between the current moment and the past moment, and is just suitable for solving the problem of pose regression. The common method adopts the full-linked layer to predict pose information, and is more suitable for the problems of object identification and classification.

Drawings

FIG. 1 is a diagram of a pose estimation framework based on an LSTM dual-flow convolutional neural network in accordance with a preferred embodiment of the present invention;

FIG. 2 is a block diagram of an LSTM dual-flow convolutional neural network.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and specifically described below with reference to the drawings in the embodiments of the present invention. The described embodiments are only a few embodiments of the present invention.

The technical scheme for solving the technical problems is as follows:

s1, preprocessing a color image and a depth image, respectively cascading two adjacent frames of color images and depth images, further preprocessing the depth image by MND coding, and finally normalizing the color image and the depth image.

S2, respectively inputting the preprocessed color image and the preprocessed depth image into the color flow and the depth flow of the double-flow convolutional neural network to perform feature extraction. And then merging the rgb feature map of the color flow output and the depth feature map of the depth flow output to generate a new fusion feature map. Wherein the dual-flow convolutional neural network adopts a parallel structure, and each branch of the parallel structure consists of five convolutional layers, and the first four convolutional layers of each branch pass through a ReLU activation unit. The formula is as follows:

f(x)＝max(0,x) (1)

The generated fusion features are expressed as:

S3, predicting the current pose by training through an LSTM neural network. The LSTM neural network is composed of a forgetting gate, an input gate and an output gate, and is used for memorizing information useful for estimating the current pose information through learning, and forgetting information useless for estimating the current pose information. The forgetting door can control to forget useless information in a last state, and the formula is as follows:

f _k ＝σ(W _f ·[h _k-1 ,x _k ]+b _f ) (3)

wherein f _k Is the output of the forget gate, σ is the sigmoid function, W _f Is forgetting parameter, h _k-1 Is the hidden state of the last moment, x _k Is the input of the current moment, b _f Is the bias of the forgetting gate.

The input gate determines what information to add to the current state, and the input gate selects layer i by input _k And candidate layerThe formula is as follows:

i _k ＝σ(W _i ·[h _k-1 ,x _k ]+b _i ) (4)

wherein W is _i Is an input parameter, tanh is a hyperbolic tangent function, W _C Is a candidate parameter. b _i Is a selection layer bias; b _c Is the bias of the candidate layer.

The output gate decides what predictions to make, and its formula is:

o _k ＝σ(W _o ·[h _k-1 ,x _k ]+b _o ) (6)

wherein W is _o Is an output parameter; b _o Is the bias of the output gate.

where N is the number of samples; w is the weight coefficient of the position and the gesture;to estimate the pose; />Is the actual pose. .

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

The above examples should be understood as illustrative only and not limiting the scope of the invention. Various changes and modifications to the present invention may be made by one skilled in the art after reading the teachings herein, and such equivalent changes and modifications are intended to fall within the scope of the invention as defined in the appended claims.

Claims

1. The pose estimation method based on the LSTM double-flow convolutional neural network is characterized by comprising the following steps of:

s1, preprocessing a color image and a depth image acquired by an RGB-D camera, respectively cascading two adjacent frames of color images and depth images, preprocessing the depth images by adopting a MND (minimum normal+depth) method, and finally normalizing the color images and the depth images; s2, respectively inputting the preprocessed color image and the preprocessed depth image into a color flow and a depth flow of a double-flow convolutional neural network to perform feature extraction; s3, fusing the color feature map rgb feature map output by the color flow and the depth feature map depth feature map output by the depth flow to generate a new fused feature map fusion feature map; s4, carrying out global average pooling treatment on the newly generated fusion feature map; s5, predicting the current pose by training through an LSTM neural network;

the color image preprocessing specifically comprises the steps of cascading adjacent frames of color images to generate color images with 640 x 960 size; the depth image preprocessing is specifically that firstly, MND encoding processing is carried out on the depth image, and the width and the height of the depth image are scaled to be n _x And n _y Taking depth d as the third channel of the image, for scaled surface normal [ n ] _x ,n _y ,d]Satisfy the following requirementsAdjacent frames of the depth image are then concatenated to generate a 640 x 960 size depth image.

2. The pose estimation method based on the LSTM dual-flow convolutional neural network according to claim 1, wherein the step S2 inputs the preprocessed color image and the preprocessed depth image into the color flow and the depth flow of the dual-flow convolutional neural network respectively for feature extraction, specifically comprises: the structure of the color flow and the depth flow are consistent and are composed of 5 convolution layers by adopting a double-flow convolution neural network architecture, and the graph is extractedFeatures of different layers in the image, wherein the first four layers are provided with ReLU activation units; pre-processed color image I ^rgb As input to the color stream, the preprocessed depth image I is to be processed ^depth And as the input of the depth stream, respectively obtaining a color characteristic spectrum and a depth structure characteristic spectrum through convolution operation.

3. The pose estimation method based on LSTM dual-flow convolutional neural network according to any one of claims 1-2, wherein the dual-flow convolutional neural network adopts a parallel structure, each branch of the parallel structure is composed of five convolutional layers, the first four convolutional layers of each branch are subjected to a ReLU activation unit, and the formula is expressed as:

f(x)＝max(0,x) (1)

4. The pose estimation method based on LSTM dual-flow convolutional neural network according to claim 3, wherein the step S3 of fusing the color feature map rgb feature map output by the color flow and the depth feature map depth feature map output by the depth flow to generate a new fused feature map fusion feature map specifically comprises: combining the feature maps output by conv5 in two data stream networks to form a new fusion feature map, and performing global mean value pooling processing after batch normalization and ReLU nonlinear activation units, wherein the generated fusion characteristics are expressed as follows:

5. The method for estimating pose based on LSTM dual-flow convolutional neural network according to claim 4, wherein said step S5 predicts the current pose by training using LSTM neural network, and specifically comprises:

f _k ＝σ(W _f ·[h _k-1 ,x _k ]+b _f ) (3)

i _k ＝σ(W _i ·[h _k-1 ,x _k ]+b _i ) (4)

wherein the output gate decides what predictions to make, the formula is:

o _k ＝σ(W _o ·[h _k-1 ,x _k ]+b _o ) (6)

wherein W is _o Is an output parameter; b _o Is the bias of the output gate;

6. A computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the pose estimation method based on the LSTM dual-flow convolutional neural network according to any one of claims 1 to 5 is implemented.

7. A pose estimation system of an LSTM dual-flow convolutional neural network based on the method of any one of claims 1-5, comprising the steps of:

and a prediction module: predicting the current pose through training by utilizing an LSTM neural network;