CN111062311A

CN111062311A - Pedestrian gesture recognition and interaction method based on depth-level separable convolutional network

Info

Publication number: CN111062311A
Application number: CN201911281009.3A
Authority: CN
Inventors: 秦文虎; 张仕超; 孙立博; 张哲�; 平鹏
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2019-12-13
Filing date: 2019-12-13
Publication date: 2020-04-24
Anticipated expiration: 2039-12-13
Also published as: CN111062311B

Abstract

The invention relates to a pedestrian gesture recognition and interaction method with a depth-level separable convolutional network, comprising: collecting images containing pedestrians through a front-view camera system installed on a vehicle; inputting the images into a depth-level separable convolutional network to detect pedestrians Bounding box, input the image of the bounding box area into the gesture recognition network, and output the feature map of the pedestrian area. The image of the pedestrian area is input into the gesture recognition network for gesture recognition. The gesture recognition network extracts features through a depth-level separable convolutional layer, predicts 12 human body joint point information and 12 corresponding offset vectors at each point of the output feature map, and finally understands pedestrian gestures, vehicles and vehicles by classifying joint points. Based on the recognized pedestrian gestures, combined with gesture priorities, the most conservative strategy is adopted to make decisions. The invention uses the depth-level separable convolution to realize the model, reduces the scale of the model exponentially, and can realize detection in low-power mobile terminals such as smart phones.

Description

Pedestrian gesture recognition and interaction method based on depth-level separable convolutional network

Technical Field

The invention relates to a pedestrian gesture recognition and interaction technology based on a depth-level separable convolutional network, and belongs to the technical field of advanced automobile driver assistance.

Background

The driving environment perception function is an important function of advanced driver assistance system adas (advanced driver assistance system). Pedestrians, as an important component in public transportation scenarios, have a significant impact on vehicle driving decisions. Currently, most research is focused on how to drive autonomously driven vehicles efficiently and safely, and there is a lack of research in terms of interaction with pedestrians. Therefore, as an important part of the driving environment perception, there is an urgent need to recognize a pedestrian gesture and perform pedestrian interaction.

Currently, in order to complete the task of recognizing the gesture of a pedestrian, there are two main methods: one method is based on the traditional statistical learning method, and depends on complicated characteristic engineering to obtain the gesture information of the pedestrian; in the other method, a deep learning method is used, image information is extracted by relying on a convolution network, and a proper loss function is designed for feature graph output to train a model, so that the aim of recognizing the gesture of the pedestrian is finally achieved. Although the traditional statistical learning method based on the feature engineering is small in calculated amount and simple and easy to implement, the recognition accuracy is poor due to the fact that the feature engineering is too complex; although the model based on the deep convolutional network has high recognition accuracy, most of the models need high-performance GPUs to achieve the real-time recognition effect.

Chinese patent application publication No. CN107423679A proposes a pedestrian intention detection method and system, the method comprising: arranging a distance sensor to collect target form data in an observation area; acquiring track information of the target based on the existing state information of the target; and judging the action intention of each target according to the movement track and the space information of each target. The method only obtains the prediction of the walking track of the pedestrian, and does not achieve the interaction effect of the pedestrian and the vehicle. In addition, chinese patent application publication No. CN104915628A proposes a pedestrian intention detection model for an automated vehicle, the method including: acquiring basic scene elements of a traffic scene around a pedestrian related to the movement intention of the pedestrian; analyzing a relationship between a state change when the pedestrian walks and each surrounding basic scene element to obtain a relationship between the basic scene element and a pedestrian state change, based on the basic scene element and three-dimensional (3D) distance information of the pedestrian over time; establishing a context correlation model between the pedestrian and all the surrounding basic scene elements by using the obtained relationship; and predicting the next motion state of the pedestrian by using the established context correlation model based on the current scene element which is obtained in real time and is related to the current pedestrian so as to generate the next motion prediction result of the pedestrian. The method also has no interaction process of pedestrians and vehicles, needs to identify more additional scene information and 3D information, is very large in calculation amount, and also does not indicate how to deal with when multiple pedestrians are simultaneously present.

Disclosure of Invention

The technical problem to be solved by the invention is as follows:

the invention provides a pedestrian gesture recognition and interaction method based on a depth-level separable convolutional network, and aims to solve the problems of large model calculation amount, low recognition speed and poor pedestrian and vehicle interactivity in the process of recognizing and interacting pedestrian gestures of an autonomous driving automobile.

The invention adopts the following technical scheme for solving the technical problems:

the invention provides a pedestrian gesture recognition and interaction method based on a depth-level separable convolutional network, which is characterized by comprising the following steps of:

step one, collecting an image containing a pedestrian;

inputting the image into a depth separable convolution network, detecting a pedestrian bounding box, inputting the image of the bounding box region into a gesture recognition network, and outputting a characteristic diagram of the pedestrian region;

step three, calculating joint point coordinates and classifying the joint point coordinates to obtain gesture recognition results;

step four, sorting the priority of the gestures;

and step five, obtaining a final interaction decision of the moving vehicle according to the gesture expression with the prior priority.

As mentioned above, the pedestrian gesture recognition and interaction method based on the depth-level separable convolutional network, further, the depth-level separable convolutional neural network in the second step specifically includes:

step 2.1, deep convolution;

step 2.2, batch normalization;

step 2.3, Relu activation;

step 2.4, point convolution;

step 2.5, batch normalization;

and 2.6, Relu activation.

The pedestrian gesture recognition and interaction method based on the depth-level separable convolutional network is further characterized in that the feature points in the feature map in the step two comprise the probabilities of 12 human body joint points existing at the feature points and the offset vector of each joint point at the point.

The pedestrian gesture recognition and interaction method based on the depth-level separable convolutional network is further characterized in that a depth-level separable convolutional structure reduction model is adopted for joint point classification in the second step.

The pedestrian gesture recognition and interaction method based on the depth-level separable convolutional network as described above, further, the specific step of classifying the joint point in step three includes:

step 3.1, calculating the coordinates of the joint points: finding out the point with the highest confidence coefficient in each characteristic diagram by combining the confidence coefficient of the distribution characteristic diagram of the human body joint points contained in the characteristic points obtained in the step two and the offset vector characteristic diagram of the corresponding point to determine the type of the joint points, and then obtaining the positions of the joint points from the offset vectors so as to obtain the complete information of the human body joint points;

step 3.2, normalization: after obtaining the coordinates of the human body joint points, taking the central point of the connecting line of the left shoulder and the right shoulder as the center, subtracting the coordinates of the central point from all the joint points, and then carrying out normalization processing;

step 3.3, classification: and classifying the normalized data by using a support vector machine or a layer of fully-connected network to obtain a final pedestrian gesture recognition result.

According to the pedestrian gesture recognition and interaction method based on the depth-level separable convolutional network, further, in the fifth step, when a plurality of pedestrians around the vehicle are detected to make different gestures at the same time, action decisions are made by adopting the most conservative strategy according to different priorities of the gestures of the pedestrians. When a plurality of pedestrians appear in front of the vehicle at the same time, the model needs to identify the gestures of the plurality of pedestrians at the same time; after the gesture information of a plurality of pedestrians is obtained, the gestures are sorted according to the priority of the gesture information, and then the most conservative strategy is adopted to respond. For example, if some pedestrians require the vehicle to decelerate, and some pedestrians require the vehicle to stop, the parking strategy is preferentially executed. This ensures traffic safety with maximum probability.

The model updates the pedestrian state in the visual field in time, and when no pedestrian exists in the visual field or the gestures of all pedestrians do not require the vehicle to give way, the vehicle enters a normal driving state.

Compared with the prior art, the invention adopting the technical scheme has the following technical effects:

because the method is realized based on the depth-level separable convolution model, compared with the traditional deep learning model, the method has the advantages that the scale is reduced by times, the support of special hardware or GPU equipment is not needed, and the application cost is reduced. Meanwhile, the identification precision can be ensured, and the application scene is greatly widened. The technical scheme provided by the invention can realize the real-time recognition of the pedestrian gesture information on low-power-consumption mobile equipment such as a mobile phone. And, after the information is recognized, the vehicle and the pedestrian make effective interaction. In addition, for a scene with a plurality of pedestrians in front of the vehicle, the model can adopt the most conservative strategy to make a decision according to the priority of the pedestrian gesture, and the traffic safety is guaranteed to the maximum extent.

Drawings

FIG. 1 is a schematic diagram of a deep separable convolutional network;

FIG. 2 is a schematic of the process of the present invention.

Detailed Description

The technical scheme of the invention is further explained in detail by combining the attached drawings:

it will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

The invention provides a pedestrian gesture recognition and interaction method based on a depth-level separable convolutional network. FIG. 2 is a schematic of the process of the present invention. As shown in fig. 2. The method comprises the following steps:

a front image is first captured by a camera mounted in front of the vehicle. The parameters of video data collected by a forward-looking camera used in the invention are 1280 multiplied by 720@60FPS, video frames are color images and comprise RGB three-channel color information, the color information is expressed by tensor of (1280,720,3) dimensionality, each element in the tensor is an integer, and the value range is [0,255 ].

The image is then input into a depth level separable convolutional neural network to detect pedestrian bounding boxes. The invention utilizes the depth-level separable convolution structure to divide the traditional convolution structure into two steps of depth convolution and point convolution, so that the division can reduce the volume of the model by times on the premise of ensuring the identification effect of the model. Fig. 1 is a schematic diagram of a deep separable convolutional network. As shown in fig. 1, this structure divides the common convolution operation into a deep convolution and a point convolution. The deep convolution adopts different convolution kernels for each input channel, namely one convolution kernel corresponds to one input channel; dot convolution is just a common convolution, except that it uses a 1 × 1 convolution kernel. And (3) extracting a feature map through cascading a plurality of depth-level separable convolution modules, and obtaining a pedestrian bounding box in the feature map.

And then inputting the obtained pedestrian area image into a gesture recognition network. And constructing a feature extraction network of the human body joint points by cascading a plurality of depth-level separable convolution modules. The feature map output by the pedestrian gesture recognition network comprises S multiplied by 36 features, wherein S represents the size of the output feature map, and each feature point is composed of a feature vector containing 36 data. These 36 data contain the probabilities of 12 human body joint points existing at the feature point, and the offset vector of each joint point at that point. And obtaining the coordinates of the joint points of the human body of the pedestrian by combining the probability characteristic diagram and the offset vector diagram.

After the coordinates of the human body joint points are obtained, the central point of the connecting line of the left shoulder and the right shoulder is taken as the center, all the joint points are subtracted from the coordinates of the central point, normalization processing is carried out, and finally, the normalized data are classified by using a support vector machine or a layer of full-connection network, so that the final pedestrian gesture recognition result is obtained.

In the step, the gesture recognition network utilizes a depth-level separable convolution structure simplified model, and finally obtains a gesture classification result by using a support vector machine or a full connection layer.

When a plurality of pedestrians appear in front of the vehicle at the same time, the model needs to identify the gestures of the plurality of pedestrians at the same time; after the gesture information of a plurality of pedestrians is obtained, the gestures are sorted according to the priority of the gesture information, and then the most conservative strategy is adopted to respond. For example, if some pedestrians require the vehicle to decelerate, and some pedestrians require the vehicle to stop, the parking strategy is preferentially executed. This ensures traffic safety with maximum probability.

When no pedestrian is in front of the vehicle or no extra request is made to the vehicle by the pedestrian gesture in the field of view, the vehicle enters a normal driving state.

The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. a pedestrian gesture recognition and interaction method based on a depth-level separable convolutional network, is characterized in that, comprises the following steps:

Step 1: Collect images containing pedestrians;

Step 2: Input the image into the depthwise separable convolutional network, detect the pedestrian bounding box, input the image of the bounding box area into the gesture recognition network, and output the feature map of the pedestrian area;

Step 3: Calculate the joint point coordinates and classify the joint point coordinates to obtain a gesture recognition result;

Step 4. Sort the priority of gestures;

Step 5: Obtain the final interaction decision of the moving vehicle according to the gesture expressed by the gesture with the highest priority.

2. a kind of pedestrian gesture recognition and interaction method based on depth-level separable convolutional network according to claim 1, is characterized in that, described in step 2, depth-level separable convolutional neural network specifically comprises:

Step 2.1, depth convolution;

Step 2.2, batch normalization;

Step 2.3, Relu activation;

Step 2.4, point convolution;

Step 2.5, batch normalization;

Step 2.6, Relu activation.

3. a kind of pedestrian gesture recognition and interaction method based on depth level separable convolutional network according to claim 1, is characterized in that, the feature point in the feature map described in step 2 comprises 12 human body joint points in this feature The probability that the point exists and the offset vector of each joint point at that point.

4 . The pedestrian gesture recognition and interaction method based on a depth-level separable convolutional network according to claim 1 , wherein the step 2 adopts a depth-level separable convolution structure simplified model for the classification of joint points. 5 .

5. A kind of pedestrian gesture recognition and interaction method based on depth-level separable convolutional network according to claim 4, is characterized in that, the concrete step of classifying joint points described in step 3 comprises:

Step 3.1. Calculate the coordinates of the joint points: The confidence of the distribution feature map of human joint points contained in the feature points obtained in step 2, combined with the offset vector feature map of the corresponding point, find the point with the highest confidence in each feature map. Determine the joint point category, and then obtain the joint point position from the offset vector, so as to obtain the complete information of the human joint point;

Step 3.2. Normalization: After obtaining the coordinates of the joint points of the human body, take the center point of the line connecting the left and right shoulders as the center, and then normalize all the joint points after subtracting the coordinates of the center point;

Step 3.3. Classification: Use the support vector machine or a fully connected network to classify the normalized data to obtain the final pedestrian gesture recognition result.

6 . The pedestrian gesture recognition and interaction model according to claim 1 , wherein in the step 5, when it is detected that there are multiple pedestrians around the vehicle making different gestures, the priority of the pedestrian gestures is different according to the priority of the pedestrian gestures. 7 . , using the most conservative strategy to make action decisions.