CN114332919A

CN114332919A - A pedestrian detection method, device and terminal device based on multi-spatial relationship perception

Info

Publication number: CN114332919A
Application number: CN202111510823.5A
Authority: CN
Inventors: 姜峰
Original assignee: Nanjing Xingzheyi Intelligent Transportation Technology Co ltd
Current assignee: Nanjing Xingzheyi Intelligent Transportation Technology Co ltd
Priority date: 2021-12-11
Filing date: 2021-12-11
Publication date: 2022-04-12
Anticipated expiration: 2041-12-11
Also published as: CN114332919B

Abstract

The invention discloses a pedestrian detection method, device and terminal device based on multi-spatial relationship perception. The method includes step 1: collecting a pedestrian image data set and adjusting it to a fixed size for model training; step 2, using a YOLOX detection framework , input the image into the frame model, first perform data enhancement on the image; step 3, input the data-enhanced image into the Focus module, slice the image according to parity, obtain 4 images, and then proceed along the channel direction. Splicing; Step 4, input the spliced image into the backbone network of the YOLOX detection framework, and three branches are connected to the backbone network; Step 5, each branch contains 2 parts, a multi-spatial relationship perception module and a detection head, The multi-spatial relationship-aware module effectively fuses global information and local information together by combining the relationship between features in different spatial dimensions to obtain multi-spatial relationship-aware feature maps. This method not only focuses on global information, but also extracts local information, and effectively fuses the two to obtain more recognizable feature information and improve pedestrian detection performance.

Description

A pedestrian detection method, device and terminal device based on multi-spatial relationship perception

技术领域technical field

本发明涉及图像识别研究领域，尤其是行人检测方法，具体涉及一种基于多空间关系感知的行人检测方法、装置及终端设备。The invention relates to the field of image recognition research, in particular to a pedestrian detection method, and in particular to a pedestrian detection method, device and terminal device based on multi-spatial relationship perception.

背景技术Background technique

随着智慧城市建设的不断发展，许多人工智能新技术应用于智能交通、智能政务、智能工厂等，而每个应用都离不开群众，都是服务于人，因此，行人检测是很多应用技术的前提。然而，现实场景往往比较复杂，如人群密集时导致躯体交错重叠，或被物体遮挡，光照变化强烈，恶劣气候因素(雨雪天气等)导致的画面模糊等，这些真实情况加大了行人检测的难度。为此，急需一种行人检测技术，能够在图像中的行人区域挖掘出更加深层次的、有判别力的特征，足以在各种环境下表征出行人。With the continuous development of smart city construction, many new artificial intelligence technologies are applied to smart transportation, smart government affairs, smart factories, etc., and each application is inseparable from the masses and serves people. Therefore, pedestrian detection is one of many application technologies. the premise. However, real scenes are often more complex, such as dense crowds causing bodies to overlap or overlap, or being blocked by objects, strong changes in illumination, blurred images caused by harsh weather factors (rain and snow, etc.), etc. These real situations increase the difficulty of pedestrian detection. difficulty. To this end, a pedestrian detection technology is urgently needed, which can dig deeper and discriminative features in the pedestrian area in the image, which is sufficient to characterize pedestrians in various environments.

在实现本发明过程中，发明人发现现有技术中至少存在如下问题：目前流行的行人检测技术大多基于卷积神经网络(CNN)，而多数CNN行人检测模型都是使用有限的感受野，很难结合全局信息学习到丰富的结构模式，比如利用CNN对行人进行检测和分割，从而获取最终位置信息；比如使用CNN结合特征融合进行行人检测；虽然有的方法考虑到不同的感受野，但是没有很好的结合全局和局部信息；此外，还有些方法通过堆叠网络深度来增强模型的学习能力，这种模型无论是训练还是部署都十分耗费资源。In the process of realizing the present invention, the inventor found that there are at least the following problems in the prior art: the current popular pedestrian detection technologies are mostly based on convolutional neural networks (CNN), and most CNN pedestrian detection models use a limited receptive field, which is very difficult to achieve. It is difficult to combine global information to learn rich structural patterns, such as using CNN to detect and segment pedestrians to obtain final location information; such as using CNN combined with feature fusion for pedestrian detection; although some methods consider different receptive fields, but no It is a good combination of global and local information; in addition, there are some methods to enhance the learning ability of the model by stacking the network depth, which is very resource-intensive for both training and deployment.

发明内容SUMMARY OF THE INVENTION

为了克服现有技术的不足，本发明提供了一种基于多空间关系感知的行人检测方法、装置及终端设备，该方法既关注于全局信息，又能提取局部信息，并将二者有效融合，从而获取到更加有辨识度的特征信息，提高行人检测性能。技术方案如下：In order to overcome the deficiencies of the prior art, the present invention provides a pedestrian detection method, device and terminal device based on multi-spatial relationship perception. Thereby, more recognizable feature information can be obtained, and the pedestrian detection performance can be improved. The technical solution is as follows:

本发明提供了一种基于多空间关系感知的行人检测方法，该方法包括如下步骤：The present invention provides a pedestrian detection method based on multi-spatial relationship perception, the method comprising the following steps:

步骤1，采集行人图像数据集，调整到固定大小进行模型的训练。Step 1: Collect a pedestrian image dataset and adjust it to a fixed size for model training.

步骤2，采用YOLOX的检测框架，将图像输入到框架模型中，先对图像进行数据增强。Step 2, using YOLOX's detection framework, input the image into the framework model, and first perform data enhancement on the image.

步骤3，将数据增强后的图像输入到Focus模块中，对图像按照奇偶进行切片操作，获得4张图像，再沿着通道方向进行拼接。Step 3: Input the data-enhanced image into the Focus module, slice the image according to parity, obtain 4 images, and then stitch them along the channel direction.

步骤4，将拼接后的图像输入到YOLOX检测框架的主干网络中，与主干网络连接的是三个分支，这三个分支分别对应不同的感受野，三种感受野能够覆盖不同尺寸的目标。Step 4: Input the spliced image into the backbone network of the YOLOX detection framework. Three branches are connected to the backbone network. These three branches correspond to different receptive fields, and the three receptive fields can cover targets of different sizes.

步骤5，每个分支包含2个部分，多空间关系感知模块和检测头，多空间关系感知模块通过在不同空间维度中结合特征之间的关系，将全局信息和局部信息有效地融合在一起，获得多空间关系感知特征图。Step 5, each branch contains 2 parts, a multi-spatial relationship perception module and a detection head. The multi-spatial relationship perception module effectively fuses global information and local information together by combining the relationship between features in different spatial dimensions, Obtain multi-spatial relation-aware feature maps.

多空间关系感知模块的工作流程具体如下：The workflow of the multi-spatial relationship perception module is as follows:

输入到多空间关系感知模块的特征图X维度为H×W×C，H为高，W为宽，C为通道数；The X dimension of the feature map input to the multi-spatial relationship perception module is H×W×C, where H is height, W is width, and C is the number of channels;

(1)构建H×W空间的关系特征图；(1) Construct the relational feature map of H×W space;

在H×W空间范围，将特征图X分解成H×W个长度为C的特征向量，特征向量x_i映射到特征向量x_j的关系信息用r_i，j表示，计算方式如下：In the H×W space range, the feature map X is decomposed into H×W feature vectors of length C, and the relationship information of the feature vector x _i mapped to the feature vector x _j is represented by ri _{, j} , and the calculation method is as follows:

其中，

和φ_H×W为2个嵌入函数，均由一个1×1的卷积层、一个BatchNormalization层和一个ReLU激活层组成。相应的，特征向量x_j映射到特征向量Jx_i的关系信息为r_j，i＝f_H×W(x_j，x_i)，则(r_i，j，r_j，i)描述了特征向量x_i和x_j之间的双向关系；对于单向关系，计算出所有特征向量之间的关系信息并进行堆叠即可得到一个亲和矩阵

矩阵通道数为H×W，因此双向关系可以得到两个不同的亲和矩阵M1和M2，对特征的局部信息进行深度挖掘。in,

and φ _H×W are 2 embedding functions, each consisting of a 1×1 convolutional layer, a BatchNormalization layer and a ReLU activation layer. Correspondingly, the relationship information of the feature vector x _j mapped to the feature vector Jx _i is r _{j, i} =f _H×W (x _j , x _i ), then (r _{i, j} , r _{j, i} ) describes the feature vector Two-way relationship between x _i and x _j ; for one-way relationship, calculate the relationship information between all eigenvectors and stack them to get an affinity matrix

The number of matrix channels is H×W, so two different affinity matrices M1 and M2 can be obtained from the bidirectional relationship, and the local information of the feature can be deeply mined.

将原始的全局结构信息保留下来，具体地，对原始特征图X进行1×1卷积后，在通道方向做全局平均池化操作，获得一个全局结构特征图F，

将全局结构特征图F与两个亲和矩阵串联在一起，获得一个特征矩阵Y，公式如下：Retain the original global structure information. Specifically, after performing 1×1 convolution on the original feature map X, perform a global average pooling operation in the channel direction to obtain a global structure feature map F,

Concatenate the global structural feature map F with two affinity matrices to obtain a feature matrix Y, the formula is as follows:

pool表示全局平均池化，θ_H×W和

均由一个一个1×1的卷积层、一个BatchNormalization层和一个ReLU激活层组成，相比于

和φ_H×W，它们的输出激活节点数量都是不一样的；将特征矩阵Y通过1×1的卷积，来融合特征矩阵中包含的所有全局和局部信息，从而得到属于片×W空间的关系特征图。pool denotes global average pooling, θ _H×W and

Both consist of a 1×1 convolutional layer, a BatchNormalization layer and a ReLU activation layer, compared to

and φ _H×W , the number of output activation nodes is different; the feature matrix Y is fused by 1×1 convolution to fuse all the global and local information contained in the feature matrix, so as to obtain the space belonging to the slice×W The relational feature map of .

(2)构建通道空间C的关系特征图；(2) Construct the relational feature map of the channel space C;

同理，在通道空间范围，将特征图X分解成C个长度为H×W的特征向量，特征向量x_a映射到特征向量x_b的关系信息r_a，b为：Similarly, in the channel space range, the feature map X is decomposed into C feature vectors of length H×W, and the relationship information r _{a and b} of the feature vector x _a mapped to the feature vector x _b are:

其中

阳φ_C函数与

和φ_H×W一致，只是输出维度不同；采用与步骤5(1)相同的计算方式获得亲和矩阵

即双向关系可以得到两个不同的亲和矩阵M′₁和M′₂。in

Yang φ _C function with

Consistent with φ _H×W , but the output dimension is different; use the same calculation method as step 5(1) to obtain the affinity matrix

That is, the bidirectional relationship can obtain two different affinity matrices M' ₁ and M' ₂ .

对原始特征图X进行1×1卷积后，在H×W维度做全局平均池化，获得结构特征图

将结构特征图与两个亲和矩阵串联在一起，获得的特征矩阵Y′，计算方式如下：After performing 1×1 convolution on the original feature map X, perform global average pooling in the H×W dimension to obtain the structural feature map

Concatenate the structural feature map with the two affinity matrices to obtain the feature matrix Y′, which is calculated as follows:

Y′＝[pool(θ_C(X))，θ_C(M′₁)，θ_C(M′₂)]。Y'=[pool(θ _C (X)), θ _C (M' ₁ ), θ _C (M' ₂ )].

θ_C和

函数功能与θ_H×W和

一致，只是输出维度不同；将特征矩阵Y′通过1×1的卷积，来融合特征矩阵中包含的所有全局和局部信息，从而得到属于通道空间C的关系特征图。θ _C and

function function with θ _H×W and

Consistent, but the output dimensions are different; the feature matrix Y′ is convolutional by 1×1 to fuse all the global and local information contained in the feature matrix, so as to obtain the relational feature map belonging to the channel space C.

将H×W空间和通道空间C的关系特征图点乘，获得多空间关系感知特征图。The multi-space relationship-aware feature map is obtained by dot-multiplying the relationship feature map of the H×W space and the channel space C.

步骤6，将多空间关系感知特征图放入检测头中，YOLOX将分类和坐标定位进行解耦，先通过一个1×1的卷积对通道进行降维，后接两个轻量分支，分别进行分类和回归。Step 6: Put the multi-spatial relationship-aware feature map into the detection head. YOLOX decouples the classification and coordinate positioning. First, the channel is dimensionally reduced by a 1×1 convolution, followed by two lightweight branches, respectively. Perform classification and regression.

优选的，步骤2数据增强包括图像的随机水平翻转，颜色抖动，多尺度增强以及马赛克坐标增强方法。Preferably, the data enhancement in step 2 includes random horizontal flipping of the image, color dithering, multi-scale enhancement and mosaic coordinate enhancement methods.

优选的，步骤4中三个分支分别对应的感受野为下采样8倍、16倍、32倍。Preferably, the receptive fields corresponding to the three branches in step 4 are down-sampling 8 times, 16 times, and 32 times respectively.

优选的，在训练阶段，分类损失函数采用交叉熵，回归损失函数采用GIOU损失，并用L1范数对获取的位置信息施加惩罚。Preferably, in the training phase, the classification loss function adopts cross entropy, the regression loss function adopts GIOU loss, and the L1 norm is used to impose a penalty on the acquired position information.

与现有技术相比，上述技术方案中的一个技术方案具有如下有益效果：通过多空间关系感知模块，深度挖掘不同空间维度中，特征与特征之间的关系，既关注于全局信息，又能提取局部信息，并将二者有效融合，将不同空间的特征信息与特征间的关系信息建立联系，使得模型学习到的特征更加具有辨识度和判别性，从而获取到更加有辨识度的特征信息，提高行人检测准确率。Compared with the prior art, one of the above technical solutions has the following beneficial effects: through the multi-spatial relationship perception module, the relationship between features and features in different spatial dimensions can be deeply excavated, which not only focuses on the global information, but also Extract local information, and effectively fuse the two to establish a connection between the feature information in different spaces and the relationship information between the features, so that the features learned by the model are more recognizable and discriminative, so as to obtain more recognizable feature information. , to improve the accuracy of pedestrian detection.

附图说明Description of drawings

图1为本公开实施例提供的一种多空间关系感知模块流程图。FIG. 1 is a flowchart of a multi-spatial relationship perception module provided by an embodiment of the present disclosure.

具体实施方式Detailed ways

为了阐明本发明的技术方案和工作原理，下面将结合附图对本公开实施方式做进一步的详细描述。上述所有可选技术方案，可以采用任意结合形成本公开的可选实施例，在此不再一一赘述。In order to clarify the technical solution and working principle of the present invention, the embodiments of the present disclosure will be further described in detail below with reference to the accompanying drawings. All the above-mentioned optional technical solutions can be combined arbitrarily to form optional embodiments of the present disclosure, which will not be repeated here.

本申请的说明书和权利要求书及上述附图中的术语“步骤1”、“步骤2”、“步骤3”等类似描述是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换，以便这里描述的本申请的实施例能够以除了在这里描述的那些以外的顺序实施。The terms "step 1", "step 2", "step 3" and similar descriptions in the description and claims of the present application and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence. order. It is to be understood that data so used may be interchanged under appropriate circumstances so that the embodiments of the application described herein can be practiced in sequences other than those described herein.

第一方面：本公开实施例提供了一种基于多空间关系感知的行人检测方法，该方法包括如下步骤：A first aspect: an embodiment of the present disclosure provides a pedestrian detection method based on multi-spatial relationship perception, and the method includes the following steps:

步骤2，采用YOLOX的检测框架，此框架结构简洁，无需人工设置锚框，便于训练和部署。将图像输入到框架模型中，先对图像进行数据增强，优选的，步骤2数据增强包括图像的随机水平翻转，颜色抖动，多尺度增强以及马赛克坐标增强方法等，以扩大训练集规模，提高模型的泛化能力。In step 2, the detection framework of YOLOX is adopted, which has a simple structure and does not need to manually set anchor frames, which is convenient for training and deployment. Input the image into the frame model, and first perform data enhancement on the image. Preferably, the data enhancement in step 2 includes random horizontal flipping of the image, color jittering, multi-scale enhancement and mosaic coordinate enhancement methods, etc., in order to expand the scale of the training set and improve the model. generalization ability.

步骤3，将数据增强后的图像输入到Focus模块中，对图像按照奇偶进行切片操作，获得4张图像，再沿着通道方向进行拼接。Focus模块在不增加计算量的同时进行了下采样，而且保留了更加完整了图像信息。Step 3: Input the data-enhanced image into the Focus module, slice the image according to parity, obtain 4 images, and then stitch them along the channel direction. The Focus module performs downsampling without increasing the amount of computation, and retains more complete image information.

步骤4，将拼接后的图像输入到YOLOX检测框架的主干网络中，与主干网络连接的是三个分支，这三个分支分别对应不同的感受野，三种感受野能够覆盖不同尺寸的目标。优选的，步骤4中三个分支分别对应的感受野为下采样8倍、16倍、32倍。Step 4: Input the spliced image into the backbone network of the YOLOX detection framework. Three branches are connected to the backbone network. These three branches correspond to different receptive fields, and the three receptive fields can cover targets of different sizes. Preferably, the receptive fields corresponding to the three branches in step 4 are down-sampling 8 times, 16 times, and 32 times respectively.

附图1为一种多空间关系感知模块工作流程图，结合该图，多空间关系感知模块的工作流程具体如下：Accompanying drawing 1 is a kind of working flow chart of the multi-spatial relationship perception module, in conjunction with this figure, the work flow of the multi-spatial relationship perception module is as follows:

其中，

和φ_H×W为2个嵌入函数，均由一个1×1的卷积层、一个BatchNormalization层和一个ReLU激活层组成。相应的，特征向量x_j映射到特征向量x_i的关系信息为r_j，i＝f_H×W(x_j，x_i)，则(r_i，j，r_j，i)描述了特征向量x_i和x_j之间的双向关系；对于单向关系，计算出所有特征向量之间的关系信息并进行堆叠即可得到一个亲和矩阵

矩阵通道数为H×W，因此双向关系可以得到两个不同的亲和矩阵M₁和M₂，对特征的局部信息进行深度挖掘。in,

and φ _H×W are 2 embedding functions, each consisting of a 1×1 convolutional layer, a BatchNormalization layer and a ReLU activation layer. Correspondingly, the relationship information of the feature vector x _j mapped to the feature vector x _i is r _{j, i} = f _H×W (x _j , x _i ), then (r _{i, j} , r _{j, i} ) describes the feature vector Two-way relationship between x _i and x _j ; for one-way relationship, calculate the relationship information between all eigenvectors and stack them to get an affinity matrix

The number of matrix channels is H×W, so two different affinity matrices M ₁ and M ₂ can be obtained from the bidirectional relationship, and the local information of the feature can be deeply mined.

为了能够同时开发特征的全局信息，则需要将原始的全局结构信息保留下来，具体地，对原始特征图X进行1×1卷积后，在通道方向做全局平均池化操作，获得一个全局结构特征图F，

将全局结构特征图F与两个亲和矩阵串联在一起，获得一个特征矩阵Y，公式如下：In order to develop the global information of features at the same time, it is necessary to retain the original global structure information. Specifically, after performing 1×1 convolution on the original feature map X, perform a global average pooling operation in the channel direction to obtain a global structure feature map F,

Y＝[pool(θ_H×W(X))，θ_H×W(M₁)，θ_H×W(M₂)]。Y=[pool(θ _H×W (X)), θ _H×W (M ₁ ), θ _H×W (M ₂ )].

pool表示全局平均池化，θ_H×W和θ_H×W均由一个一个1×1的卷积层、一个BatchNormalization层和一个ReLU激活层组成，相比于

和φ_H×W，它们的输出激活节点数量都是不一样的；将特征矩阵Y通过1×1的卷积，来融合特征矩阵中包含的所有全局和局部信息，从而得到属于H×W空间的关系特征图。pool represents global average pooling. Both θ _H×W and θ _H×W consist of a 1×1 convolutional layer, a BatchNormalization layer and a ReLU activation layer, compared to

and φ _H×W , the number of output activation nodes is different; the feature matrix Y is fused by 1×1 convolution to fuse all the global and local information contained in the feature matrix, so as to obtain the space belonging to H×W The relational feature map of .

其中

和φ_C函数与

即双向关系可以得到两个不同的亲和矩阵M′₁和M′₂；in

and φ _C function and

That is, the bidirectional relationship can obtain two different affinity matrices M′ ₁ and M′ ₂ ;

不同于步骤5中获得的结构特征图F，在本步骤中对原始特征图X进行1×1卷积后，在H×W维度做全局平均池化，获得结构特征图

将结构特征图与两个亲和矩阵串联在一起，获得的特征矩阵Y′，计算方式如下：Different from the structural feature map F obtained in step 5, in this step, after 1×1 convolution is performed on the original feature map X, global average pooling is performed in the H×W dimension to obtain the structural feature map

θ_C和

函数功能与θ_H×W和

function function with θ _H×W and

将H×W空间和通道空间C的关系特征图点乘，获得多空间关系感知特征图，此关系感知特征图包含了不同空间维度中特征的全局和局部信息，并将其充分融合，提高了特征的有效性和判别能力。Dot multiplication of the relational feature maps of H×W space and channel space C to obtain a multi-space relation-aware feature map, which contains the global and local information of features in different spatial dimensions, and fully fuses them to improve the performance. Validity and discriminative power of features.

步骤6，将多空间关系感知特征图放入检测头中，不同于传统的YOLO系列检测头将分类和坐标定位耦合在一起训练，YOLOX将分类和坐标定位进行解耦，先通过一个1×1的卷积对通道进行降维，后接两个轻量分支，分别进行分类和回归，能够有效提高模型收敛速度。Step 6: Put the multi-spatial relationship-aware feature map into the detection head. Unlike the traditional YOLO series detection head, which couples classification and coordinate positioning for training, YOLOX decouples classification and coordinate positioning, first through a 1 × 1 The convolution of the channel reduces the dimension of the channel, followed by two lightweight branches for classification and regression respectively, which can effectively improve the convergence speed of the model.

第二方面，本公开实施例提供了一种基于多空间关系感知的行人检测装置，基于相同的技术构思，该装置可以实现或执行所有可能的实现方式中任一项所述的一种基于多空间关系感知的行人检测方法。In a second aspect, embodiments of the present disclosure provide a pedestrian detection device based on multi-spatial relationship perception. Based on the same technical concept, the device can implement or execute the multi-space relationship-based detection device in any of the possible implementation manners. A spatial relationship aware pedestrian detection method.

优选的，该装置包括数据获取单元、第一数据处理单元、第二数据处理单元、结果获取单元；Preferably, the device includes a data acquisition unit, a first data processing unit, a second data processing unit, and a result acquisition unit;

所述数据获取单元，用于执行所有可能的实现方式中任一项所述的一种基于多空间关系感知的行人检测方法的步骤1的步骤；The data acquisition unit is used to perform the steps of step 1 of a pedestrian detection method based on multi-spatial relationship perception according to any one of all possible implementations;

所述第一数据处理单元，用于执行所有可能的实现方式中任一项所述的一种基于多空间关系感知的行人检测方法的步骤2和步骤3的步骤；The first data processing unit is used to execute the steps of step 2 and step 3 of a pedestrian detection method based on multi-spatial relationship perception according to any one of all possible implementations;

所述第二数据处理单元，用于执行所有可能的实现方式中任一项所述的一种基于多空间关系感知的行人检测方法的步骤4和步骤5的步骤；The second data processing unit is used to perform the steps of step 4 and step 5 of a pedestrian detection method based on multi-spatial relationship perception described in any one of all possible implementations;

所述结果获取单元，用于执行所有可能的实现方式中任一项所述的一种基于多空间关系感知的行人检测方法的步骤6的步骤。The result obtaining unit is configured to execute the steps of step 6 of the pedestrian detection method based on multi-spatial relationship perception described in any one of the possible implementation manners.

需要说明的是，上述实施例提供的一种基于多空间关系感知的行人检测装置在执行一种基于多空间关系感知的行人检测方法时，仅以上述各功能模块的划分进行举例说明，实际应用中，可以根据需要而将上述功能分配由不同的功能模块完成，即将设备的内部结构划分成不同的功能模块，以完成以上描述的全部或者部分功能。另外上述实施例提供的一种基于多空间关系感知的行人检测装置与一种基于多空间关系感知的行人检测方法实施例属于同一构思，其具体实现过程详见方法实施例，这里不再赘述。It should be noted that, when the pedestrian detection device based on multi-spatial relationship perception provided in the above embodiment executes a pedestrian detection method based on multi-spatial relationship perception, only the division of the above functional modules is used as an example to illustrate, and the practical application In the device, the above-mentioned function distribution can be completed by different function modules according to the needs, that is, the internal structure of the device is divided into different function modules, so as to complete all or part of the functions described above. In addition, a pedestrian detection device based on multi-spatial relationship perception provided by the above embodiments and a pedestrian detection method based on multi-spatial relationship perception belong to the same concept.

第三方面，本公开实施例提供了一种终端设备，该终端设备包括所有可能的实现方式中任一项所述一种基于多空间关系感知的行人检测装置。In a third aspect, an embodiment of the present disclosure provides a terminal device, where the terminal device includes the apparatus for detecting pedestrians based on multi-spatial relationship perception according to any one of all possible implementation manners.

以上结合附图对本发明进行了示例性描述，显然，本发明具体实现并不受上述方式的限制，凡是采用了本发明的方法构思和技术方案进行的各种非实质性的改进；或者未经改进、等同替换，将本发明的上述构思和技术方案直接应用于其他场合的，均在本发明的保护范围之内。The present invention has been exemplarily described above with reference to the accompanying drawings. Obviously, the specific implementation of the present invention is not limited by the above-mentioned methods, and all kinds of insubstantial improvements made by the method concept and technical solution of the present invention are adopted; Improvements, equivalent replacements, and direct application of the above concepts and technical solutions of the present invention to other occasions are all within the protection scope of the present invention.

Claims

1. a pedestrian detection method based on multi-spatial relationship perception, is characterized in that, this method comprises the steps:

Step 1, collect the pedestrian image data set, adjust it to a fixed size for model training;

Step 2, using the detection framework of YOLOX, input the image into the frame model, and first perform data enhancement on the image;

Step 3: Input the image after data enhancement into the Focus module, perform slicing operation on the image according to parity, obtain 4 images, and then splicing along the channel direction;

Step 4: Input the spliced image into the backbone network of the YOLOX detection framework. Three branches are connected to the backbone network. These three branches correspond to different receptive fields, and the three receptive fields can cover targets of different sizes;

Step 5, each branch contains 2 parts, a multi-spatial relationship perception module and a detection head. The multi-spatial relationship perception module effectively fuses global information and local information together by combining the relationship between features in different spatial dimensions, Obtain multi-spatial relation-aware feature maps;

The workflow of the multi-spatial relationship perception module is as follows:

The X dimension of the feature map input to the multi-spatial relationship perception module is H×W×C, where H is height, W is width, and C is the number of channels;

(1) Construct the relational feature map of H×W space;

In the H×W space range, the feature map X is decomposed into H×W feature vectors of length C, and the relationship information of the feature vector x _i mapped to the feature vector x _j is represented by ri _{, j} , and the calculation method is as follows:

in,

and φ _H×W are two embedding functions, which are composed of a 1×1 convolution layer, a BatchNormalization layer and a ReLU activation layer; correspondingly, the relationship information of the feature vector x _j mapped to the feature vector x _i is r _{j, i} = f _H×W (x _j , x _i ), then (r _{i, j} , r _{j, i} ) describes the bidirectional relationship between the feature vectors _xi and x _j ; for the unidirectional relationship, calculate The relationship information between all eigenvectors and stacking can get an affinity matrix

The number of matrix channels is H×W, so two different affinity matrices M ₁ and M ₂ can be obtained from the bidirectional relationship, and the local information of the feature can be deeply mined;

Retain the original global structure information. Specifically, after performing 1×1 convolution on the original feature map X, perform a global average pooling operation in the channel direction to obtain a global structure feature map F,

pool denotes global average pooling, θ _H×W and

and φ _H×W , the number of output activation nodes is different; the feature matrix Y is convolved by 1×1 to fuse all the global and local information contained in the feature matrix, so as to obtain the space belonging to H×W The relational feature map of ;

(2) Construct the relational feature map of the channel space C;

Similarly, in the channel space range, the feature map X is decomposed into C feature vectors of length H×W, and the relationship information r _{a and b} of the feature vector x _a mapped to the feature vector x _b are:

in

and φ _C function and

After performing 1×1 convolution on the original feature map X, perform global average pooling in the H×W dimension to obtain the structural feature map

θ _C and

function function with θ _H×W and

Consistent, but the output dimension is different; the feature matrix Y′ is convolutional by 1×1 to fuse all the global and local information contained in the feature matrix, so as to obtain the relational feature map belonging to the channel space C;

Multiply the relationship feature map of H×W space and channel space C to obtain the multi-space relationship-aware feature map;

Step 6: Put the multi-spatial relationship-aware feature map into the detection head. YOLOX decouples the classification and coordinate positioning. First, the channel is dimensionally reduced by a 1×1 convolution, followed by two lightweight branches, respectively. Perform classification and regression.

2 . The pedestrian detection method based on multi-spatial relationship perception according to claim 1 , wherein the data enhancement in step 2 includes random horizontal flipping of images, color dithering, multi-scale enhancement and mosaic coordinate enhancement methods. 3 .

3 . The pedestrian detection method based on multi-spatial relationship perception according to claim 1 , wherein the receptive fields corresponding to the three branches in step 4 are down-sampling 8 times, 16 times, and 32 times respectively. 4 .

4. A pedestrian detection method based on multi-spatial relationship perception according to any one of claims 1-3, characterized in that, in the training phase, the classification loss function adopts cross entropy, the regression loss function adopts GIOU loss, and uses L1 The norm imposes a penalty on the acquired location information.

5 . A pedestrian detection device based on multi-spatial relationship perception, characterized in that, the device can implement a pedestrian detection method based on multi-spatial relationship perception according to any one of claims 1 to 4 .

6. A pedestrian detection device based on multi-spatial relationship perception according to claim 5, wherein the device comprises a data acquisition unit, a first data processing unit, a second data processing unit, and a result acquisition unit;

The data acquisition unit is used to perform the steps of step 1 of the pedestrian detection method based on multi-spatial relationship perception according to any one of claims 1-4;

The first data processing unit is used to perform the steps of step 2 and step 3 of a pedestrian detection method based on multi-spatial relationship perception according to any one of claims 1-4;

The second data processing unit is used to perform the steps of step 4 and step 5 of the pedestrian detection method based on multi-spatial relationship perception according to any one of claims 1-4;

The result obtaining unit is configured to execute the steps of step 6 of the pedestrian detection method based on multi-spatial relationship perception according to any one of claims 1-4.

7. A terminal device, characterized in that the terminal device comprises the pedestrian detection device based on multi-spatial relationship perception according to any one of claims 5 or 6.