CN113989495A

CN113989495A - Vision-based pedestrian calling behavior identification method

Info

Publication number: CN113989495A
Application number: CN202111362421.5A
Authority: CN
Inventors: 连静; 王政皓; 李琳辉
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2021-11-17
Filing date: 2021-11-17
Publication date: 2022-01-28
Anticipated expiration: 2041-11-17
Also published as: CN113989495B

Abstract

The invention discloses a vision-based pedestrian calling behavior identification method, which comprises the following steps of: image preprocessing and intent inference. The invention adopts a computer vision method to accurately and efficiently identify the pedestrians with the taxi calling behavior from the image, realizes that the automatic taxi driving finds the passengers more efficiently, improves the use efficiency of the automatic taxi driving, and also improves the trip efficiency of the passengers. The invention adopts the spatial reasoning network to realize the reasoning of the pedestrian car-calling behavior, reduces the dependence on time dimension information, reduces the time characteristic extraction process compared with the traditional behavior recognition algorithm, can simplify the network and improve the real-time performance of behavior reasoning. The invention adopts a set of fusion rules with logical interpretability to realize the fusion of random forests and graph convolution networks, the characteristic of logical interpretability can improve the environmental adaptability and the behavior recognition precision of the algorithm, and the fusion algorithm can realize more stable and accurate reasoning on the pedestrian car-calling intention.

Description

A Vision-Based Pedestrian Car-hailing Behavior Recognition Method

技术领域technical field

本发明属于车辆智能化领域，尤其涉及一种自动驾驶出租车识别行人行为意图的方法。The invention belongs to the field of vehicle intelligence, and in particular relates to a method for recognizing pedestrian behavior intention by an automatic driving taxi.

背景技术Background technique

交通场景中的车辆识别行人的行为属于车辆智能化的范畴。准确有效的识别行人的召车意图可以帮助自动驾驶出租车在道路上快速寻找到有召车意图的行人，这对提高行人的出行效率和提高自动驾驶出租车的使用效率，避免交通拥堵具有重要意义。The behavior of vehicles in traffic scenes to identify pedestrians belongs to the category of vehicle intelligence. Accurate and effective identification of pedestrians' car-hailing intentions can help autonomous taxis quickly find pedestrians with car-hailing intentions on the road, which is important for improving pedestrians' travel efficiency, improving the use efficiency of autonomous taxis, and avoiding traffic congestion. significance.

行人召车行为识别是指利用计算机视觉的方法对交通场景中的行人进行分析，寻找具有召车意图的行人。交通场景具有高度的复杂性，交通参与者(包括行人、车辆、骑行者等)的数目和种类远高于其他应用场景，这增加了行为识别的难度。召车的行为与行人的其他行为(走路、跑步、骑行等)相比具有明显的随机性和瞬时性特点：首先，当前场景中的任何一个行人在任何时间都有可能转化成一个具有召车意图的人；另外，召车行为具有明显的瞬时的特性，司机判断一个人是否具有召车意图仅仅需要单独的一张图像就可以实现，而不需要考虑这张图像的前后连续几帧图像的信息。基于上述两个特点，传统的基于3DCNN(3D Convolutional Neural Network)和LSTM(Long Short Term Memory Network)的行为识别算法不能适用于具有瞬时特性的召车意图推理。行人的手势是表达行人意图的关键信息，而目前的大多数手势识别算法主要应用于室内的场景，且基于视觉的手势识别算法对图像中手部轮廓的分辨率要求较高，但智能车搭载的车载相机无法实现在复杂的交通场景中生成如此高质量的图像。Pedestrian car-hailing behavior recognition refers to the use of computer vision methods to analyze pedestrians in traffic scenes to find pedestrians with the intention of car-hailing. Traffic scenes are highly complex, and the number and types of traffic participants (including pedestrians, vehicles, cyclists, etc.) are much higher than other application scenarios, which increases the difficulty of behavior recognition. Compared with other behaviors of pedestrians (walking, running, cycling, etc.), the car-hailing behavior has obvious randomness and transient characteristics: First, any pedestrian in the current scene may be transformed into a car-hailing vehicle at any time. In addition, the car-hailing behavior has obvious instantaneous characteristics, and the driver only needs a single image to determine whether a person has the car-hailing intention, and does not need to consider the consecutive frames before and after the image. Information. Based on the above two characteristics, traditional behavior recognition algorithms based on 3DCNN (3D Convolutional Neural Network) and LSTM (Long Short Term Memory Network) cannot be applied to the reasoning of car-hailing intent with instantaneous characteristics. Pedestrian gestures are the key information to express the intention of pedestrians, and most of the current gesture recognition algorithms are mainly used in indoor scenes, and vision-based gesture recognition algorithms have high requirements on the resolution of the hand contour in the image, but smart cars are equipped with The in-vehicle cameras of 100% cannot achieve such high-quality images in complex traffic scenes.

发明内容SUMMARY OF THE INVENTION

为解决现有技术存在的上述问题，本发明要设计一种环境适应性强、识别精度高且基于视觉的行人召车行为识别方法，能够通过处理车载相机采集的图像，实时对图像中有召车意图行人的准确识别，从而帮助自动驾驶出租车更高效的发现乘客。In order to solve the above-mentioned problems existing in the prior art, the present invention aims to design a pedestrian car-calling behavior recognition method with strong environmental adaptability, high recognition accuracy and based on vision, which can process the images collected by the vehicle-mounted camera, and real-timely identify the calls in the images in real time. Accurate identification of vehicles and pedestrians to help autonomous taxis find passengers more efficiently.

为了实现上述目的，本发明的技术方案如下：一种基于视觉的行人召车行为识别方法，包括以下步骤：In order to achieve the above object, the technical solution of the present invention is as follows: a vision-based pedestrian car-hailing behavior recognition method, comprising the following steps:

A、图像预处理A. Image preprocessing

采用目标检测算法和人体关键点提取算法实现对图像的预处理，得到行人的检测框D以及每个检测框内所对应的行人的关键点参数K，在召车行为推理的过程中，人体的面部注意力是判断其是否具有召车意图的关键线索，在真实的场景中，行人召车的过程，行人将会对出租车具有高度的注意力。对面部注意力的推理，从两个方面进行，首先利用人体关键点检测中所检测到的面部关键点进行推理，以左耳关键点和右耳关键点的横坐标之差h_p为基准，以σ为放大系数，形成一个边长为σh_p的正方形框S作为面部区域；当左耳关键点和鼻关键点的横向距离h_f大于h_p，意味着行人的面部以相对侧面的角度正对出租车，即行人对车辆的注意力较小；当h_f小于h_p，将面部区域S输入到面部注意力深度网络中计算行人的面部注意力概率；面部注意力深度网络包括前部网络和后部网络，前部网络为特征提取网络，采用Resnet50作为基准网络，提取面部特征；后部网络为由全连接层组成的特征连接网络，实现将前部网络所提取的面部特征连接，得到全局特征，输出为面部注意力概率ρ_f；The target detection algorithm and the human body key point extraction algorithm are used to preprocess the image, and the pedestrian detection frame D and the pedestrian key point parameter K corresponding to each detection frame are obtained. Facial attention is a key clue to determine whether it has the intention of hailing a car. In a real scene, in the process of pedestrian hailing a car, pedestrians will have a high degree of attention to taxis. The reasoning of facial attention is carried out from two aspects. First, the facial key points detected in the human body key point detection are used for inference, and the difference h _p between the abscissas of the left ear key point and the right ear key point is used as the benchmark. Taking σ as the magnification factor, a square frame S with side length σh _p is formed as the face area; when the lateral distance h _f between the left ear key point and the nose key point is greater than h _p , it means that the pedestrian's face is at a positive angle from the opposite side. For taxis, that is, pedestrians pay less attention to vehicles; when h _f is less than h _p , input the face region S into the facial attention depth network to calculate the pedestrian's facial attention probability; the facial attention depth network includes the front network and the rear network, the front network is a feature extraction network, and Resnet50 is used as the benchmark network to extract facial features; the rear network is a feature connection network composed of fully connected layers, and the facial features extracted by the front network are connected. global feature, the output is the facial attention probability ρ _f ;

B、意图推理B. Intentional reasoning

采用随机森林算法和图卷积网络相结合进行行人的意图推理，具体步骤如下：Using the combination of random forest algorithm and graph convolutional network to infer pedestrian intent, the specific steps are as follows:

B1、采用随机森林算法推理人体关键点之间的连接角度和行人意图的关系，随机森林的输入是人体关键点的连接角度，为了防止出现过拟合的现象，选取一些与行人召车关系较强的关键点角度作为随机森林的输入，包括以颈关键点、左肩关键点、右肩关键点、左肘关键点、右肘关键点为顶点的连接角度，随机森林的输出为行人具有召车意图的概率ρ_r。B1. The random forest algorithm is used to infer the relationship between the connection angle between the key points of the human body and the intention of pedestrians. The input of the random forest is the connection angle of the key points of the human body. The strong key point angle is used as the input of random forest, including the connection angle of neck key point, left shoulder key point, right shoulder key point, left elbow key point, and right elbow key point as vertices. The output of random forest is that pedestrians have car-hailing Probability ρ _r of intent.

B2、采用图卷积网络推理人体关键点位置与行人意图的关系，图卷积网络的输入为人体图模型G(v,e)，其中，v为人体图模型的节点，即人体关键点，节点特征为关键点的坐标，e为人体图模型的边，即节点之间的连接。由于目标检测所获取的检测框D的尺寸不固定，为了降低检测框尺寸对意图推理的影响，采用坐标转换实现将人体关键点的图像坐标转化为以人体颈部关键点为原点的关联坐标：B2. The graph convolution network is used to infer the relationship between the position of the human body key points and the pedestrian intent. The input of the graph convolution network is the human body graph model G(v,e), where v is the node of the human body graph model, that is, the human body key point, The node features are the coordinates of the key points, and e is the edge of the human body graph model, that is, the connection between the nodes. Since the size of the detection frame D obtained by the target detection is not fixed, in order to reduce the influence of the size of the detection frame on the intention inference, coordinate transformation is used to convert the image coordinates of the human body key points into the associated coordinates with the human neck key point as the origin:

其中，x_inew和y_inew为第i个人体关键点转换后的横坐标和纵坐标；u_i与v_i为第i个人体关键点的转换前的横坐标和纵坐标；u₁与v₁为颈部关键点的横坐标和纵坐标。Among them, x _inew and y _inew are the abscissa and ordinate after the transformation of the key point of the ith person; u _i and v _i are the abscissa and ordinate before the transformation of the key point of the ith person; u ₁ and v ₁ are the abscissa and ordinate of the neck key point.

图卷积网络的过程为：The process of graph convolutional network is:

其中，

A是人体图模型的邻接矩阵；

是人体图模型的度矩阵；H^(l)是第l层图卷积的输出特征，H^(l+1)为第l+1层图卷积的输出特征；W^(l)为第l层图卷积的参数矩阵；

是激活函数；Z是图卷积网络的输出，即行人具有召车意图的概率ρ_g；H^(z)是最后一层图卷积的特征矩阵；W^(z)是最后一层图卷积的参数矩阵；readout(·)是由全连接层组成的图读出网络，实现将人体图模型中的所有节点特征聚合连接。in,

A is the adjacency matrix of the human figure model;

is the degree matrix of the human graph model; H ^(l) is the output feature of the lth layer graph convolution, H ^(l+1) is the output feature of the l+1th layer graph convolution; W ^(l) is the lth layer The parameter matrix of graph convolution;

is the activation function; Z is the output of the graph convolution network, that is, the probability ρ _g that the pedestrian has the intention to call a car; H ^(z) is the feature matrix of the last layer of graph convolution; W ^(z) is the last layer of graph convolution The parameter matrix of ; readout( ) is a graph readout network composed of fully connected layers, which realizes the aggregation and connection of all node features in the human graph model.

B3、算法融合B3. Algorithm fusion

通过随机森林和图卷积网络，分别得到行人具有召车意图的概率随机森林输出概率ρ_r和图卷积网络输出概率ρ_g，为了得到更稳定准确的意图推理，提出一套具有逻辑上可解释的融合规则实现将随机森林和图卷积网络融合，融合规则如下：Through random forest and graph convolutional network, the probability of the pedestrian's intention to call a car is obtained. The random forest output probability ρ _r and the graph convolution network output probability ρ _g are respectively obtained. In order to obtain more stable and accurate intention reasoning, a set of logically feasible The explained fusion rule realizes the fusion of random forest and graph convolutional network. The fusion rule is as follows:

其中，p是融合后行人具有召车意图的概率。当p_g＞0.5且p_r＞0.5或者p_g＜0.5且p_r＜0.5时，意味着随机森林算法和图卷积网络算法具有相同的推理结果，则融合概率p为

当p_g＞0.5且p_r＜0.5时，则意味着随机森林算法和图卷积网络算法具有不同的推理结果，图卷积网络的推理结果为行人具有召车意图，随机森林的推理结果为行人没有召车意图，为了得到一个更准确的推理结果，面部注意力概率p_f作为动态权重对p_g和p_r实现动态加权平均，即，当p_f＞0.5，意味着行人具有较高的召车概率，则赋予图卷积网络的输出一个更高的权重，而随机森林的输出赋予一个较低的权重；当p_f＜0.5时，则赋予随机森林的输出一个更高的权重，而赋予图卷积网络的输出一个更低的权重；当p_g＜0.5且p_r＞0.5时，则意味着另一种随机森林算法和图卷积网络算法具有不同的推理结果的情况，图卷积网络的推理结果为行人没有召车意图，而随机森林的推理结果为行人具有召车意图，当p_f＞0.5时，意味着随机森林的推理结果有更高的概率为正确的结果，则随机森林的输出赋予更高的权重，而图卷积网络的输出赋予更低的权重；反之，当p_f＜0.5时，则图卷积网络的输出赋予更高的权重，而随机森林的输出赋予更低的权重。Among them, p is the probability that the pedestrian has the intention to call a car after fusion. When p _g >0.5 and p _r >0.5 or p _g <0.5 and p _r <0.5, it means that the random forest algorithm and the graph convolutional network algorithm have the same inference result, then the fusion probability p is

When p _g > 0.5 and p _r < 0.5, it means that the random forest algorithm and the graph convolutional network algorithm have different inference results. The inference result of the graph convolution network is that the pedestrian has the intention to call a car, and the inference result of the random forest is Pedestrians have no intention to call a car. In order to obtain a more accurate inference result, the facial attention probability p _{f is} used as a dynamic weight to achieve a dynamic weighted average of p _g and p _r , that is, when p _f > 0.5, it means that the pedestrian has a higher car-hailing probability, the output of the graph convolutional network is given a higher weight, and the output of the random forest is given a lower weight; when p _f < 0.5, the output of the random forest is given a higher weight, and Give the output of the graph convolutional network a lower weight; when p _g < 0.5 and p _r > 0.5, it means that another random forest algorithm and the graph convolutional network algorithm have different inference results. The reasoning result of the product network is that the pedestrian has no intention to call a car, while the reasoning result of the random forest is that the pedestrian has the intention to call a car. When p _f > 0.5, it means that the reasoning result of the random forest has a higher probability of being the correct result, then The output of the random forest is given a higher weight, while the output of the graph convolutional network is given a lower weight; conversely, when p _f < 0.5, the output of the graph convolutional network is given a higher weight, and the output of the random forest is given a higher weight. assign lower weights.

与现有技术相比，本发明的有益效果和益处如下：Compared with the prior art, the beneficial effects and benefits of the present invention are as follows:

1、本发明采用计算机视觉的方法从图像中准确高效的识别出具有召车行为的行人，实现自动驾驶出租车更高效的发现乘客，提高了自动驾驶出租车的使用效率，也提高了乘客的出行效率。1. The present invention uses the method of computer vision to accurately and efficiently identify pedestrians with car-hailing behaviors from images, so that the autonomous taxi can find passengers more efficiently, improve the use efficiency of the autonomous taxi, and improve the passenger's safety. travel efficiency.

2、本发明采用了空间推理网络实现对行人召车行为的推理，减少了对时间维度信息的依赖，与传统的行为识别算法相比，减少了时间特征提取的过程，能够简化网络，提高行为推理的实时性。2. The present invention adopts a spatial reasoning network to realize the reasoning of pedestrian car-hailing behavior, which reduces the dependence on time dimension information. Compared with the traditional behavior recognition algorithm, the process of time feature extraction is reduced, which can simplify the network and improve the behavior. Real-time inference.

3、本发明采用了一套具有逻辑上可解释的融合规则，实现随机森林和图卷积网络的融合，逻辑上可解释的特性能够提高算法的环境适应性和行为识别的精度，实现融合算法对行人召车意图更稳定准确的推理。3. The present invention adopts a set of logically interpretable fusion rules to realize the fusion of random forest and graph convolutional network. The logically interpretable characteristics can improve the environmental adaptability of the algorithm and the accuracy of behavior recognition, and realize the fusion algorithm. More stable and accurate reasoning about pedestrian car-hailing intentions.

附图说明Description of drawings

图1是本发明的流程示意图。FIG. 1 is a schematic flow chart of the present invention.

图2是OpenPose提取的人体关键点示意图。Figure 2 is a schematic diagram of human key points extracted by OpenPose.

图3是面部注意力深度网络示意图。Figure 3 is a schematic diagram of the facial attention deep network.

图4是随机森林示意图。Figure 4 is a schematic diagram of a random forest.

图5是图卷积网络示意图。Figure 5 is a schematic diagram of a graph convolutional network.

具体实施方式Detailed ways

下面结合附图对本发明进行进一步的描述，如图1所示，一种基于视觉的行人召车行为识别方法，包括以下步骤：The present invention will be further described below in conjunction with the accompanying drawings. As shown in Figure 1, a visual-based pedestrian car-hailing behavior recognition method includes the following steps:

A、图像预处理A. Image preprocessing

采用Yolov5作为目标检测方法和人体关键点提取算法OpenPose实现对图像的预处理，得到行人的检测框D以及每个检测框内所对应的行人的关键点参数K，其中，关键点的参数如图2所示，关键点的序列与人体部位的对应关系为：Using Yolov5 as the target detection method and the human body key point extraction algorithm OpenPose to realize the preprocessing of the image, the pedestrian detection frame D and the pedestrian key point parameter K corresponding to each detection frame are obtained. The parameters of the key points are shown in the figure 2, the correspondence between the sequence of key points and the body parts is:

目标检测所提供的检测框可以提高人体关键点提取的精度。在召车意图推理的过程中，人体的面部注意力是判断其是否具有召车意图的关键线索，在真实的场景中，行人召车的过程，行人将会对出租车具有高度的注意力。对面部注意力的推理，本发明主要从两个方面进行，首先利用人体关键点检测中所检测到的面部关键点进行推理，以关键点16和关键点17的横坐标之差h_p为基准，以σ＝1.2为放大系数，形成一个边长为σh_p的正方形框S作为面部区域，当关键点16和关键点0之间的横向距离h_f大于h_p，意味着行人的面部以相对侧面的角度正对出租车，即行人对车辆的注意力较小，设置面部注意力概率ρ_f＝0.1，当h_f小于h_p，则很难通过上述判断行人是否注意到车辆，因此，将面部区域S输入到面部注意力深度网络中计算其面部注意力概率，面部注意力深度网络示意图如图3所示，主要有两部分组成，前部分为特征提取网络，采用Resnet50作为基准网络，提取面部特征，后部分为由全连接层组成的特征连接网络，将前部分所提取的特征连接，得到全局特征，输出为面部注意力概率ρ_f。The detection frame provided by target detection can improve the accuracy of human key point extraction. In the process of car-hailing intention reasoning, the facial attention of the human body is the key clue to determine whether it has the car-hailing intention. In the real scene, pedestrians will have a high degree of attention to the taxi during the process of hailing a car. For the reasoning of facial attention, the present invention is mainly carried out from two aspects. First, the facial key points detected in the human body key point detection are used for reasoning, and the difference h _p between the abscissas of the key point 16 and the key point 17 is used as the benchmark. , with σ=1.2 as the magnification factor, a square frame S with side length σh _p is formed as the face area. When the horizontal distance h _f between the key point 16 and the key point 0 is greater than h _p , it means that the pedestrian's face is relatively The angle of the side faces the taxi, that is, the pedestrian pays less attention to the vehicle, and the facial attention probability ρ _f = 0.1 is set. When h _f is less than h _p , it is difficult to judge whether the pedestrian pays attention to the vehicle through the above. Therefore, set the The facial area S is input into the facial attention depth network to calculate its facial attention probability. The schematic diagram of the facial attention depth network is shown in Figure 3. It mainly consists of two parts. The first part is the feature extraction network. Resnet50 is used as the benchmark network. Facial features, the latter part is a feature connection network composed of fully connected layers, and the features extracted in the former part are connected to obtain global features, and the output is the facial attention probability ρ _f .

B、意图推理B. Intentional reasoning

通过步骤A可以得到行人的目标检测框D，目标检测框内行人的人体关键点K以及对应行人的面部注意力概率ρ_f。本发明采用随机森林算法和图卷积网络相结合进行行人的意图推理。Through step A, the target detection frame D of the pedestrian, the human key points K of the pedestrian in the target detection frame and the facial attention probability ρ _f of the corresponding pedestrian can be obtained. The present invention adopts the combination of random forest algorithm and graph convolution network to perform pedestrian intention reasoning.

B1、随机森林主要推理人体关键点之间的连接角度和行人意图的关系。因此，随机森林的输入是人体关键点的连接角度，为了防止出现过拟合的现象，本发明中，选取一些与行人召车关系较强的关键点角度作为随机森林的输入，包括以关键点1、关键点2、关键点3、关键点5和关键点6为顶点的连接角度，随机森林的输出为行人具有召车意图的概率为ρ_r，所输入的关键点连接角度为以关键点1为顶点的∠318、∠6111、∠418、∠7111、∠618、∠617；以关键点2为顶点的∠123、∠124；以关键点5为顶点的∠156、∠157；以关键点3为顶点的∠234、∠438、∠134；以关键点6为顶点的∠567、∠7611、∠167。B1. Random forest mainly infers the relationship between the connection angle between human key points and pedestrian intent. Therefore, the input of the random forest is the connection angle of the key points of the human body. In order to prevent the phenomenon of overfitting, in the present invention, some key point angles that have a strong relationship with the pedestrian car-hailing are selected as the input of the random forest, including the key point 1. The key point 2, key point 3, key point 5 and key point 6 are the connection angles of the vertices. The output of the random forest is that the probability that the pedestrian has the intention to call a car is ρ _r , and the input key point connection angle is based on the key point. ∠318, ∠6111, ∠418, ∠7111, ∠618, ∠617 with 1 as the vertex; ∠123, ∠124 with the key point 2 as the vertex; ∠156, ∠157 with the key point 5 as the vertex; Point 3 is ∠234, ∠438, ∠134; the key point 6 is ∠567, ∠7611, ∠167.

随机森林的示意图如图4所示，随机森林是由N个独立的决策树组成，其中，N＝55，使用不同的数据集来训练不同的决策树，得到包含训练参数的相应的模型。每棵决策树都是一个特定的分类器，并根据输入数据做出独立的决策。决策聚合的过程采用多数投票法，输出为决策是召车意图的决策树的数目与决策树总数的比值，即行人具有召车意图的概率ρ_r。The schematic diagram of the random forest is shown in Figure 4. The random forest is composed of N independent decision trees, where N=55. Different data sets are used to train different decision trees to obtain the corresponding model including the training parameters. Each decision tree is a specific classifier and makes independent decisions based on the input data. The process of decision aggregation adopts the majority voting method, and the output is the ratio of the number of decision trees whose decision is the car-hailing intention to the total number of decision trees, that is, the probability ρ _r that the pedestrian has the car-hailing intention.

B2、图卷积网络主要推理人体关键点位置与行人意图的关系，因此，图卷积网络的输入为人体图模型G(v,e)，其中，v为人体图模型的节点，即人体关键点，节点特征为关键点的坐标，e为人体图模型的边，即节点之间的连接。由于目标检测所获取的检测框D的尺寸不固定，为了降低检测框尺寸对意图推理的影响，采用坐标转换实现将人体关键点的图像坐标转化为以关键点1为原点的关联坐标：B2. The graph convolutional network mainly infers the relationship between the position of the human body key points and the pedestrian intent. Therefore, the input of the graph convolutional network is the human body graph model G(v,e), where v is the node of the human body graph model, that is, the human body key point, the node feature is the coordinate of the key point, e is the edge of the human body graph model, that is, the connection between the nodes. Since the size of the detection frame D obtained by the target detection is not fixed, in order to reduce the influence of the size of the detection frame on the intention reasoning, coordinate transformation is used to convert the image coordinates of the key points of the human body into the associated coordinates with the key point 1 as the origin:

其中，x_inew、y_inew为第i个人体关键点的转换后的横坐标和纵坐标，u_i与v_i为第i个人体关键点的转换前的横坐标和纵坐标；u₁与v₁为关键点1的横坐标和纵坐标。Among them, x _inew and y _inew are the abscissa and ordinate after the transformation of the ith person's key point, u _i and v _i are the abscissa and ordinate before the transformation of the ith person's key point; u ₁ and v ₁ is the abscissa and ordinate of key point 1.

图卷积网络的示意图如图5所示，将人体图模型输入到图卷积网络中，人体的每个节点特征沿节点之间的边将节点特征传递到相邻的节点中，而每个节点也聚合来自相邻节点所传递的特征，实现节点特征沿边的传递和聚合，为增强模型的表达能力，在每一层图卷积后，采用激活函数RELU实现节点特征非线性映射，最后，采用全连层组成的图读出网络实现将所有节点特征的聚合连接，得到最终的分类结果。The schematic diagram of the graph convolutional network is shown in Figure 5. The human body graph model is input into the graph convolutional network, and each node feature of the human body transfers the node feature to the adjacent nodes along the edges between the nodes, and each Nodes also aggregate the features transmitted from adjacent nodes to realize the transfer and aggregation of node features along the edge. In order to enhance the expressive ability of the model, after each layer of graph convolution, the activation function RELU is used to realize the nonlinear mapping of node features. Finally, A graph readout network composed of fully connected layers is used to aggregate and connect the features of all nodes to obtain the final classification result.

图卷积网络的过程可以总结为：The process of graph convolutional network can be summarized as:

其中，

A是人体图模型的邻接矩阵；

是激活函数RELU；Z是图卷积网络的输出，即行人具有召车意图的概率ρ_g；H^(z)是最后一层图卷积的特征矩阵；W^(z)是最后一层图卷积的参数矩阵；readout(·)是由全连接层组成的图读出网络，能够实现将人体图模型中的所有节点特征聚合连接。in,

A is the adjacency matrix of the human figure model;

is the activation function RELU; Z is the output of the graph convolution network, that is, the probability ρ _g of the pedestrian with the intention to call a car; H ^(z) is the feature matrix of the last layer of graph convolution; W ^(z) is the last layer of graph volume The parameter matrix of the product; readout( ) is a graph readout network composed of fully connected layers, which can aggregate and connect all node features in the human graph model.

B3、算法融合B3. Algorithm fusion

通过随机森林和图卷积网络，分别得到行人具有召车意图的概率ρ_r和ρ_g，为了能够得到更稳定准确的意图推理，本发明提出一套具有逻辑上可解释的融合规则实现将随机森林和图卷积网络融合，融合规则如下：Through random forest and graph convolutional network, the probability ρ _r and ρ _g of pedestrians with car-calling intention are obtained respectively. In order to obtain more stable and accurate intention reasoning, the present invention proposes a set of logically interpretable fusion rules to realize random Fusion of forest and graph convolutional network, the fusion rules are as follows:

当p_g＞0.5且p_r＜0.5时，则意味着随机森林算法和图卷积网络算法具有不同的推理结果，图卷积网络的推理结果为行人具有召车意图，随机森林的推理结果为行人没有召车意图，为了得到一个更准确的推理结果，面部注意力p_f作为动态权重对p_g和p_r实现动态加权平均，即，当p_f＞0.5，意味着行人具有较高的召车概率，则赋予图卷积网络的输出一个更高的权重，而随机森林的输出赋予一个较低的权重；当p_f＜0.5时，则赋予随机森林的输出一个更高的权重，而赋予图卷积网络的输出一个更低的权重；当p_g＜0.5且p_r＞0.5时，则意味着另一种随机森林算法和图卷积网络算法具有不同的推理结果的情况，图卷积网络的推理结果为行人没有召车意图，而随机森林的推理结果为行人具有召车意图，当p_f＞0.5时，意味着随机森林的推理结果有更高的概率为正确的结果，则随机森林的输出赋予更高的权重，而图卷积网络的输出赋予更低的权重；反之，当p_f＜0.5时，则图卷积网络的输出赋予更高的权重，而随机森林的输出赋予更低的权重。Among them, p is the probability that the pedestrian has the intention to call a car after fusion. When p _g >0.5 and p _r >0.5 or p _g <0.5 and p _r <0.5, it means that the random forest algorithm and the graph convolutional network algorithm have the same inference result, then the fusion probability p is

When p _g > 0.5 and p _r < 0.5, it means that the random forest algorithm and the graph convolutional network algorithm have different inference results. The inference result of the graph convolution network is that the pedestrian has the intention to call a car, and the inference result of the random forest is Pedestrians have no intention to call a car. In order to obtain a more accurate inference result, the facial attention p _{f is} used as a dynamic weight to achieve a dynamic weighted average of p _g and p _r , that is, when p _f > 0.5, it means that the pedestrian has a higher calling car probability, the output of the graph convolutional network is given a higher weight, and the output of the random forest is given a lower weight; when p _f < 0.5, the output of the random forest is given a higher weight, and the output of the random forest is given a lower weight; The output of the graph convolutional network has a lower weight; when p _g <0.5 and p _r >0.5, it means that another random forest algorithm and the graph convolutional network algorithm have different inference results. The inference result of the network is that the pedestrian has no intention to call a car, while the inference result of the random forest is that the pedestrian has the intention to call a car. When p _f > 0.5, it means that the inference result of the random forest has a higher probability of being the correct result, then the random forest The output of the forest is given a higher weight, while the output of the graph convolutional network is given a lower weight; conversely, when p _f < 0.5, the output of the graph convolutional network is given a higher weight, and the output of the random forest is given a higher weight. lower weight.

本发明不局限于本实施例，任何在本发明披露的技术范围内的等同构思或者改变，均列为本发明的保护范围。The present invention is not limited to this embodiment, and any equivalent ideas or changes within the technical scope disclosed in the present invention are included in the protection scope of the present invention.

Claims

1. a vision-based pedestrian car-hailing behavior recognition method, is characterized in that: comprise the following steps:

A. Image preprocessing

The target detection algorithm and the human body key point extraction algorithm are used to preprocess the image, and the pedestrian detection frame D and the pedestrian key point parameter K corresponding to each detection frame are obtained. Facial attention is a key clue to determine whether it has the intention of hailing a car. In a real scene, in the process of pedestrian hailing a car, pedestrians will have a high degree of attention to the taxi; the reasoning of facial attention, from two aspects First, use the facial key points detected in the human body key point detection to infer, take the difference h _p between the abscissas of the left ear key point and the right ear key point as the benchmark, and use σ as the amplification factor to form a side length of The square frame S of σh _p is used as the face area; when the lateral distance h _f between the left ear key point and the nose key point is greater than h _p , it means that the pedestrian's face is facing the taxi at the opposite side angle, that is, the pedestrian's attention to the vehicle smaller; when h _f is less than h _p , input the face region S into the facial attention depth network to calculate the pedestrian's facial attention probability; the facial attention depth network includes the front network and the back network, and the front network is the feature extraction The network uses Resnet50 as the benchmark network to extract facial features; the rear network is a feature connection network composed of fully connected layers, which realizes the connection of the facial features extracted by the front network to obtain global features, and the output is the facial attention probability ρ _f ;

B. Intentional reasoning

Using the combination of random forest algorithm and graph convolutional network to infer pedestrian intent, the specific steps are as follows:

B1. The random forest algorithm is used to infer the relationship between the connection angle between the key points of the human body and the intention of pedestrians. The input of the random forest is the connection angle of the key points of the human body. The strong key point angle is used as the input of random forest, including the connection angle of neck key point, left shoulder key point, right shoulder key point, left elbow key point, and right elbow key point as vertices. The output of random forest is that pedestrians have car-hailing the probability of intent ρ _r ;

B2. The graph convolution network is used to infer the relationship between the position of the human body key points and the pedestrian intent. The input of the graph convolution network is the human body graph model G(v,e), where v is the node of the human body graph model, that is, the human body key point, The node features are the coordinates of the key points, and e is the edge of the human body graph model, that is, the connection between the nodes; since the size of the detection frame D obtained by the target detection is not fixed, in order to reduce the influence of the size of the detection frame on the intention reasoning, the coordinates are used. The transformation realizes the transformation of the image coordinates of the key points of the human body into the associated coordinates with the key points of the neck of the human body as the origin:

Among them, x _inew and y _inew are the abscissa and ordinate after the transformation of the key point of the ith person; u _i and v _i are the abscissa and ordinate before the transformation of the key point of the ith person; u ₁ and v ₁ are the abscissa and ordinate of the neck key point;

The process of graph convolutional network is:

in,

A is the adjacency matrix of the human figure model;

is the activation function; Z is the output of the graph convolution network, that is, the probability ρ _g that the pedestrian has the intention to call a car; H ^(z) is the feature matrix of the last layer of graph convolution; W ^(z) is the last layer of graph convolution The parameter matrix of ; readout( ) is a graph readout network composed of fully connected layers, which realizes the aggregation and connection of all node features in the human graph model;

B3. Algorithm fusion

Through random forest and graph convolutional network, the probability of the pedestrian's intention to call a car is obtained. The random forest output probability ρ _r and the graph convolution network output probability ρ _g are respectively obtained. In order to obtain more stable and accurate intention reasoning, a set of logically feasible The explained fusion rule realizes the fusion of random forest and graph convolutional network. The fusion rule is as follows:

Among them, p is the probability that the pedestrian has the intention to call a car after fusion; when p _g >0.5 and p _r >0.5 or p _g <0.5 and p _r <0.5, it means that the random forest algorithm and the graph convolutional network algorithm have the same Inference result, then the fusion probability p is