CN110569757B

CN110569757B - Multi-posture pedestrian detection method based on deep learning and computer storage medium

Info

Publication number: CN110569757B
Application number: CN201910792451.6A
Authority: CN
Inventors: 毛亮; 赵丽旦; 冶继民; 朱婷婷; 王祥雪; 谭焕新; 黄仝宇; 汪刚
Original assignee: Xidian University; Gosuncn Technology Group Co Ltd
Current assignee: Xidian University; Gosuncn Technology Group Co Ltd
Priority date: 2019-08-26
Filing date: 2019-08-26
Publication date: 2022-05-06
Anticipated expiration: 2039-08-26
Also published as: CN110569757A

Abstract

The invention provides a multi-pose pedestrian detection method based on deep learning and a computer storage medium, wherein the method comprises the following steps of: s1, defining multiple pedestrian postures and generating a data set of the pedestrian target with multiple postures; s2, classifying the data sets according to different pedestrian postures, and dividing the data sets of different pedestrian postures into a training set and a testing set respectively; s3, combining the training sets of all pedestrian postures into a total training set for training to obtain a training model; s4, testing the test sets of different pedestrian postures respectively by using the training model; and S5, detecting the pedestrian according to the test result. According to the method provided by the embodiment of the invention, different postures of the pedestrian can be effectively detected by classifying the different postures of the pedestrian, and the detection accuracy of the pedestrian with different postures in a complex environment is improved to a certain extent.

Description

Multi-pose pedestrian detection method based on deep learning and computer storage medium

Technical Field

The invention relates to the field of target detection, in particular to a multi-pose pedestrian detection method based on deep learning and a computer storage medium.

Background

Pedestrians are non-rigid targets, and multi-modal pedestrian targets often exist in a plurality of complex scenes in reality, such as sitting, standing, lying, squatting and the like, even if the moving postures of the same target at different moments are different, the shapes of the pedestrians can be different along with different observation visual angles. The pedestrian detection is used as a branch of target detection, is a precondition and a basis for pedestrian re-identification and pedestrian tracking, has the main task of detecting pedestrians from input data and determining the positions of the pedestrians in the input data, is widely applied to the fields of intelligent video monitoring, human-computer interaction, automobile auxiliary driving and the like, and has good development potential and potential use value. At present, a great deal of research is done in the academic world in the aspect of target detection, and the existing pedestrian target detection algorithm mainly focuses on the following aspects:

1. pedestrian detection research based on machine learning: pedestrian detection based on machine learning firstly extracts image features such as Haar, HOG and the like, and then uses a classifier for classification;

2. deep learning-based pedestrian target detection research: by constructing a deep neural network similar to the human brain structure and deeply analyzing the input data such as images, remarkable results in image recognition and voice recognition have been achieved, and deep learning target detection mainly comprises a candidate region classification method and a regression-based detection algorithm.

In order to enable the target detection algorithm to be applied to the video for real-time detection, the target detection speed of a single picture needs to be continuously increased on the premise of ensuring the accuracy. Therefore, a stage target detection algorithm YOLO and SSD are provided, which convert the target detection problem into a regression problem and greatly improve the detection speed. YOLO (you Only Look one) combines target judgment and target identification into one, and the identification performance is greatly improved to reach 45 frames per second, but the YOLO has more missed detections on small targets, and simultaneously, the Loss function of the YOLO does not distinguish bbox with different sizes, so the accuracy is slightly lower than that of Faster-rcnn.

SSD is an improvement on YOLO, combines the regression idea in YOLO and the anchor mechanism in Faster R-CNN, uses multi-scale regional features at each position of a full graph to carry out regression, not only keeps the characteristic of high YOLO speed, but also ensures that window prediction is more accurate as that of Faster R-CNN. Unlike YOLO which uses only the last layer for classification and regression, SSD outputs almost all the previous layers to the next for a classification and regression prediction, at VOC2007 the mapp can reach 72.1%, and the speed reaches 58 frames per second on GPU, its performance lets us see the real possibility of target detection in practical applications, so it is widely used.

Pedestrian detection is used as a branch of target detection, and most of pedestrian detection algorithms based on deep learning are improved on the basis of the target detection algorithms, so that a pedestrian detection task is realized.

The existing detection method does not specially consider the gesture diversity of the target and designs a network by the scale diversity, and due to the activity and randomness of the pedestrian target, the gesture shapes of a plurality of pedestrian targets are different in a picture in an actual scene, so that the difference of the features is large when the features are extracted, the difficulty is increased for judging whether the current target is a pedestrian by a classifier, and the detection precision of a pedestrian detection algorithm is low. The main difficulty of pedestrian target detection lies in:

1. low resolution: pedestrian objects may be far from the observation point and the resolution of the object is small.

2. The picture quality is unstable: due to the problems of illumination, weather and shooting quality, the shot pictures may have the problems of blurring, exposure, dullness, noise and the like, so that the algorithm detection effect is unstable.

3. Non-rigid deformation: a human being is a non-rigid object and may undergo movements based on the joint performing partial part rotation and the like. For example, when a person is in a standing state and a walking state, the appearance characteristics of images are greatly changed due to local motion of limbs of the person, but basic characteristics of the body and the head are not changed. In the detection process, the appearance characteristics of each target component and the position relation of each component are considered, and the complexity of a detection model is increased.

4. Multi-angle: besides non-rigid deformation, appearance changes caused by different shooting angles of the target exist in the imaging process, and the appearance changes can also be interpreted as 2D projection differences caused by rotation in a 3D scene. When the targets are detected, the common characteristics of the targets of the same type and the difference caused by the three-dimensional rotation are considered.

5. Shielding: various targets in a simple scene are relatively sparse in position, the targets rarely influence each other and can be independently detected, however, in an actual application scene, cross overlapping of the targets in a space position is very common, and since occlusion can occur in different local areas of the targets, how to predict the occlusion which may occur or how to compensate the loss caused by the occlusion area by using context information is one of the difficulties in the current research.

Disclosure of Invention

In view of this, the invention provides a deep learning-based multi-pose pedestrian detection method and a computer storage medium, which can effectively improve the detection accuracy of pedestrians in different poses.

In order to solve the technical problem, on one hand, the invention provides a deep learning-based multi-pose pedestrian detection method, which comprises the following steps of: s1, defining multiple pedestrian postures and generating a data set of the pedestrian target with multiple postures; s2, classifying the data sets according to different pedestrian postures, and dividing the data sets of different pedestrian postures into a training set and a testing set respectively; s3, combining the training sets of all pedestrian postures into a total training set for training to obtain a training model; s4, testing the test sets of different pedestrian postures respectively by using the training model; and S5, detecting the pedestrian according to the test result.

According to the multi-pose pedestrian detection method based on deep learning, various pedestrian poses are defined in advance, the data set formed by multi-pose pedestrian targets is classified, then training and testing are carried out, and finally pedestrian detection can be carried out according to the test result.

According to some embodiments of the invention, in step S1, the pedestrian postures include a bending posture, a kneeling posture, a lying posture, a sitting posture, a standing posture and a blocking posture, the bending posture is defined as that at least one part of the pedestrian body is in a bent state, and the kneeling posture is defined as that at least one knee of the pedestrian contacts other objects; the lying posture is defined as the horizontal posture of the pedestrian; the sitting posture is defined as the contact of the buttocks of the pedestrian with other objects; the standing posture is defined as standing without bending of the body of the pedestrian; the occlusion pose is defined to display only a portion of the pedestrian's body.

According to some embodiments of the present invention, in step S3, the total training set is placed in a network for training, resulting in the training model.

According to some embodiments of the invention, the step of training the total training set in the network comprises: s31, acquiring multi-pose pedestrian samples in the total training set; s32, performing feature extraction on the multi-pose pedestrian sample by using a full convolution network; s33, re-extracting the features extracted in the step S32; s34, predicting a target frame by using the anchor-based feature selection branch and the anchor-free feature selection branch simultaneously; and S35, using non-maximum value to suppress and remove redundant target frames to obtain the training model.

According to some embodiments of the invention, in step S32, the backbone network VGG of the full convolution network is a dual branch operation to perform feature extraction on the multi-pose pedestrian sample.

According to some embodiments of the invention, in step S33, the features extracted in step S32 are re-extracted using a feature re-extraction module.

According to some embodiments of the present invention, the feature re-extraction module includes two branches, one branch performs channel compression through 1 × 1 convolution with a step size of 1 and a number of output channels of 128, learns an offset through one 3 × 3 convolution layer, then performs deformable convolution with a convolution kernel size of 3 × 3, performs displacement adjustment on a spatial sampling position for geometric deformation, performs feature extraction, and then recovers the number of channels using 1 × 1 convolution; the other branch uses 1 x 1 convolution with step length of 1 and output channel number of 256 to adjust the channel number, so that the channel number is consistent with that of one branch; and integrating the two branches in a mode of element corresponding addition.

According to some embodiments of the invention, in step S34, the anchor-free feature selection branch adds a FSAF module for anchor-free frame feature selection at the feature map level.

According to some embodiments of the invention, the FSAF module introduces two additional convolutional layers after each extracted feature for predictive classification and regression in the anchor-free feature selection branch.

According to some embodiments of the invention, step S4 further comprises: and combining the test result, and adjusting the network according to the spatial distribution and the sample characteristics of the pedestrian postures with the detection accuracy value lower than the preset accuracy value.

In a second aspect, embodiments of the present invention provide a computer storage medium comprising one or more computer instructions that, when executed, implement a method as in the above embodiments.

Drawings

FIG. 1 is a flow chart of a deep learning-based multi-pose pedestrian detection method according to an embodiment of the present invention;

FIG. 2 is a flow chart of training of a total training set in a network according to an embodiment of the present invention;

FIG. 3 is a network architecture diagram of network training according to an embodiment of the present invention;

FIG. 4 is a diagram of a feature re-extraction module in an embodiment of the invention;

fig. 5 is a schematic diagram of an electronic device according to an embodiment of the invention.

Reference numerals:

an electronic device 300;

a memory 310; an operating system 311; an application 312;

a processor 320; a network interface 330; an input device 340; a hard disk 350; a display device 360.

Detailed Description

The following detailed description of embodiments of the present invention will be made with reference to the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

The following first describes a deep learning-based multi-pose pedestrian detection method according to an embodiment of the present invention in detail with reference to the accompanying drawings.

As shown in fig. 1, the deep learning-based multi-pose pedestrian detection method according to the embodiment of the present invention includes the following steps:

and S1, defining a plurality of pedestrian postures and generating a data set of the multi-posture pedestrian target.

And S2, classifying the data set according to different pedestrian postures, and dividing the data set of different pedestrian postures into a training set and a testing set respectively.

And S3, combining the training sets of all the pedestrian postures into a total training set for training to obtain a training model.

And S4, respectively testing the test sets of different pedestrian postures by using the training model.

And S5, detecting the pedestrian according to the test result.

In other words, the multi-pose pedestrian detection method based on deep learning according to the embodiment of the invention is improved on the basis of the one-stage object detector SSD according to the characteristic of pedestrian pose diversity. Before detecting the pedestrian, the method firstly defines a plurality of different postures which may appear in the pedestrian, prepares a data set of a multi-posture pedestrian target, namely the data set containing the postures of the various pedestrians, and then divides the data set into a training set and a testing set according to the postures of the different pedestrians, namely the data set of each posture of the pedestrian is divided into the training set and the testing set. And then, combining training sets of all kinds of pedestrian postures together to form a total training set for training, respectively testing test sets of different pedestrian postures according to training models obtained by training, and finally, detecting the pedestrian according to the test result.

Therefore, according to the deep learning-based multi-pose pedestrian detection method, multiple pedestrian poses are predefined, a data set consisting of multi-pose pedestrian targets is classified, then training and testing are performed, and finally pedestrian detection can be performed according to a test result.

According to an embodiment of the invention, in step S1, the pedestrian postures include a bending posture, a kneeling posture, a lying posture, a sitting posture, a standing posture and a blocking posture.

Specifically, the bending posture is defined as that at least one part of the body of the pedestrian is in a bending state, the kneeling posture is defined as that at least one knee of the pedestrian contacts other objects, the lying posture is defined as that the pedestrian is in a horizontal posture, the sitting posture is defined as that the hip of the pedestrian contacts other objects, the standing posture is defined as that the body of the pedestrian stands without bending, and the shielding posture is defined as that only one part of the body of the pedestrian can be displayed.

That is, when defining the postures of the pedestrians, six postures which are most common to the pedestrians are selected for defining, wherein the six postures are respectively bending, kneeling, lying, sitting, standing and sheltering postures, and specific posture types and descriptions are shown in table 1.

TABLE 1 description of six pedestrian poses

Therefore, the most common six postures of the pedestrian are defined, the possible actions of the pedestrian target when being detected are basically covered, and the problems that the intra-class difference is large due to the posture diversity generated by the non-rigid deformation of the pedestrian target and the appearance of the target changes due to multi-angle shooting can be effectively solved.

In some embodiments of the present invention, in step S3, the total training set is placed in a network for training, so as to obtain the training model.

Further, the step of training the total training set in the network includes:

and S31, acquiring a multi-pose pedestrian sample in the total training set.

And S32, performing feature extraction on the multi-pose pedestrian sample by using a full convolution network.

And S33, re-extracting the features extracted in the step S32.

And S34, predicting a target frame by using the anchor-based feature selection branch and the anchor-free feature selection branch simultaneously.

And S35, using non-maximum value to suppress and remove redundant target frames to obtain the training model.

In step S32, the backbone network VGG of the full convolution network is a dual-branch operation to perform feature extraction on the multi-pose pedestrian sample. In step S33, feature re-extraction is performed on the features extracted in step S32 using a feature re-extraction module.

In some embodiments of the present invention, the feature re-extraction module includes two branches, one branch performs channel compression through 1 × 1 convolution with a step size of 1 and a number of output channels of 128, learns an offset through one 3 × 3 convolution layer, then performs deformable convolution with a convolution kernel size of 3 × 3, performs displacement adjustment on a spatial sampling position for geometric deformation, performs feature extraction, and then recovers the number of channels using 1 × 1 convolution; the other branch uses 1 x 1 convolution with step length of 1 and output channel number of 256 to adjust the channel number, so that the channel number is consistent with that of one branch; and integrating the two branches in a mode of element corresponding addition.

Preferably, in step S34, the anchor-free feature selection branch adds an FSAF module for anchor-free feature selection at the feature map level.

Further, the FSAF module introduces two additional convolution layers after each extracted feature for predictive classification and regression in the anchor-free feature selection branch.

That is to say, in the deep learning-based multi-pose pedestrian detection method according to the embodiment of the present invention, the way of training the total training set is to train in a network, and a specific network training flow is as shown in fig. 2, where a full convolution network is first used to perform feature extraction, then the extracted features are re-extracted using a feature re-extraction module, so that the network can fully learn information of a target, and finally, a feature selection branch based on an anchor point and a feature selection branch without an anchor point are used to predict a target frame, and then a non-maximum value is used to suppress and remove a redundant target frame.

The specific network structure design of the network training process is as follows:

the first down-sampling operation of the VGG of the backbone network is replaced by the dual-branch operation as shown in fig. 3, so that the feature expression capability is effectively improved without increasing the computational complexity. Performing feature re-extraction by using a feature re-extraction module shown in fig. 4 behind the SSD feature extraction layer, where the module is divided into two branches, one branch is first subjected to 1 × 1 convolution with a step size of 1 and an output channel number of 128 for channel compression, then subjected to a deformable convolution with a convolution kernel size of 3 × 3 and a 3 × 3 convolution learning offset, and then subjected to displacement adjustment of a spatial sampling position for geometric deformation, to perform feature extraction, and then subjected to 1 × 1 convolution for recovery of the channel number; the other uses 1 x 1 convolution with step size of 1 and output channel number of 256 to adjust the channel number to keep it consistent with the other path; the two branches are then integrated by means of the corresponding addition of the elements.

In consideration of the gesture diversity of the pedestrian target and the uncertainty of the size and the aspect ratio of the target, in the deep learning-based multi-gesture pedestrian detection method provided by the embodiment of the invention, an anchor-free feature selection module is introduced, and a network is optimized according to the content of the target, wherein the specific use method comprises the following steps: each feature re-extraction module is followed by two branches, one branch adding the FSAF module for anchor-free feature selection at the feature map level, in order to allow each instance to freely select the best level of features to optimize the network according to the target content, rather than using the size of the bounding box to assign the hierarchy as in the anchor-based approach. The module introduces two additional convolutional layers behind each extracted feature, for predictive classification and regression in the non-anchor branch, respectively.

Specifically, classification in the non-anchor branch is achieved by a 3 x 3 convolution followed by a sigmoid function, and regression in the non-anchor branch is achieved by a 3 x 3 convolution followed by a relu layer. Wherein the number of convolutional layer output channels used for classification is k (k represents the number of target classes), followed by a loss function, focal loss; the number of convolutional layer output channels used for regression is 4 (representing the coordinate offset), followed by the loss function IOU loss.

The other branch is subjected to anchor frame feature selection, and also comprises a classification sub-network and a regression sub-network, the number of channels output by the convolution layer in the classification sub-network is A x k (k refers to the target category, A is the number of preset anchor frames), the number of channels output by the convolution layer in the regression sub-network is 4 x k, and the anchor point-based feature selection and the anchor point-free feature selection are combined for use, so that a challenging target can be found better.

It should be noted that, at present, detection of pedestrians is mainly performed on upright pedestrian targets, and no network is designed specifically on the posture diversity of the pedestrian targets and the target deformation and target scale diversity caused by different shooting angles, so that the method is only suitable for the situation that the target posture is single in a single scene, and for complex scenes simultaneously containing various targets with different postures, false detection and missed detection are easily generated, and the detection result is not ideal. According to the deep learning-based multi-pose pedestrian detection method, the deformable convolution is used in the feature re-extraction module to adjust the spatial sampling position, so that the method is more suitable for multi-pose pedestrian targets with geometrical variation diversity, and meanwhile, two feature selection modes of an anchor-free frame and an anchor-based frame are combined to generate a target frame, so that targets with different scales can be adapted to each layer of receptive field and spatial information, and therefore, the detector can improve the detection accuracy of the targets with different scales and different deformations.

Therefore, according to the multi-pose pedestrian detection method based on deep learning, more effective feature information can be extracted by further extracting the features of the extracted features by using the feature re-extraction module. The variability convolution improves the modeling capacity of the CNN on geometric deformation by changing the spatial sampling position, so that the detection efficiency of targets with different deformations is improved. In addition, the re-extracted features are simultaneously subjected to anchor-free feature selection and anchor-based feature selection, so that the network can learn how to distribute, targets with different sizes can be fully adapted to the receptive field and spatial information of each layer according to the content, and the detection accuracy of the targets with different sizes is improved.

In some embodiments of the present invention, step S4 further includes: and combining the test result, aiming at the pedestrian posture with lower detection precision mAP (mean Average precision) value, such as the pedestrian posture lower than a preset precision value, adjusting the network according to the spatial distribution and the sample characteristic. Therefore, the detection accuracy of the targets with different scales and different deformations can be further improved.

In summary, the multi-pose pedestrian detection method based on deep learning according to the embodiment of the invention is a detection method provided for solving the problems that the intra-class difference is large due to the pose diversity generated by the non-rigid deformation of a pedestrian target and the appearance of the target changes due to multi-angle shooting.

In addition, the present invention also provides a computer storage medium, which includes one or more computer instructions, and when executed, the one or more computer instructions implement any one of the deep learning-based multi-pose pedestrian detection methods described above.

That is, the computer storage medium stores a computer program that, when executed by a processor, causes the processor to execute any one of the deep learning-based multi-pose pedestrian detection methods described above.

As shown in fig. 5, an embodiment of the present invention provides an electronic device 300, which includes a memory 310 and a processor 320, where the memory 310 is configured to store one or more computer instructions, and the processor 320 is configured to call and execute the one or more computer instructions, so as to implement any one of the methods described above.

That is, the electronic device 300 includes: a processor 320 and a memory 310, in which memory 310 computer program instructions are stored, wherein the computer program instructions, when executed by the processor, cause the processor 320 to perform any of the methods described above.

Further, as shown in fig. 5, the electronic device 300 further includes a network interface 330, an input device 340, a hard disk 350, and a display device 360.

The various interfaces and devices described above may be interconnected by a bus architecture. A bus architecture may be any architecture that may include any number of interconnected buses and bridges. Various circuits of one or more Central Processing Units (CPUs), represented in particular by processor 320, and one or more memories, represented by memory 310, are coupled together. The bus architecture may also connect various other circuits such as peripherals, voltage regulators, power management circuits, and the like. It will be appreciated that a bus architecture is used to enable communications among the components. The bus architecture includes a power bus, a control bus, and a status signal bus, in addition to a data bus, all of which are well known in the art and therefore will not be described in detail herein.

The network interface 330 may be connected to a network (e.g., the internet, a local area network, etc.), and may obtain relevant data from the network and store the relevant data in the hard disk 350.

The input device 340 may receive various commands input by an operator and send the commands to the processor 320 for execution. The input device 340 may include a keyboard or a pointing device (e.g., a mouse, a trackball, a touch pad, a touch screen, or the like).

The display device 360 may display the result of the instructions executed by the processor 320.

The memory 310 is used for storing programs and data necessary for operating the operating system, and data such as intermediate results in the calculation process of the processor 320.

It will be appreciated that memory 310 in embodiments of the invention may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read Only Memory (EPROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), or a flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. The memory 310 of the apparatus and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

In some embodiments, memory 310 stores the following elements, executable modules or data structures, or a subset thereof, or an expanded set thereof: an operating system 311 and application programs 312.

The operating system 311 includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application programs 312 include various application programs, such as a Browser (Browser), and are used for implementing various application services. A program implementing methods of embodiments of the present invention may be included in application 312.

The method disclosed by the above embodiment of the present invention can be applied to the processor 320, or implemented by the processor 320. Processor 320 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 320. The processor 320 may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof, and may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present invention. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 310, and the processor 320 reads the information in the memory 310 and completes the steps of the method in combination with the hardware.

It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or any combination thereof. For a hardware implementation, the processing units may be implemented within one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units designed to perform the functions described herein, or a combination thereof.

For a software implementation, the techniques described herein may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.

In particular, the processor 320 is also configured to read the computer program and execute any of the methods described above.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be physically included alone, or two or more units may be integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) to execute some steps of the transceiving method according to various embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A multi-pose pedestrian detection method based on deep learning is characterized by comprising the following steps of:

s1, defining multiple pedestrian postures and generating a data set of the pedestrian target with multiple postures;

s2, classifying the data sets according to different pedestrian postures, and dividing the data sets of different pedestrian postures into a training set and a testing set respectively;

s3, combining the training sets of all pedestrian postures into a total training set for training to obtain a training model;

s4, testing the test sets of different pedestrian postures respectively by using the training models;

s5, detecting pedestrians according to the test result;

in step S3, the total training set is placed in a network for training, so as to obtain the training model;

the step of training the total training set in the network comprises:

s31, acquiring multi-pose pedestrian samples in the total training set;

s32, performing feature extraction on the multi-pose pedestrian sample by using a full convolution network;

s33, re-extracting the features extracted in the step S32;

s34, predicting a target frame by using the anchor-based feature selection branch and the anchor-free feature selection branch simultaneously;

s35, using a non-maximum value to inhibit and remove redundant target frames to obtain the training model;

in step S33, feature re-extraction is performed on the features extracted in step S32 using a feature re-extraction module;

the feature re-extraction module comprises two branches, one branch is subjected to channel compression through 1 × 1 convolution with the step length of 1 and the output channel number of 128, the offset is learned through a 3 × 3 convolution layer, then the deformable convolution with the convolution kernel size of 3 × 3 is carried out, the displacement adjustment is carried out on the spatial sampling position according to the geometric deformation, the feature extraction is carried out, and then the channel number is recovered through 1 × 1 convolution;

the other branch uses 1 x 1 convolution with step length of 1 and output channel number of 256 to adjust the channel number, so that the channel number is consistent with that of one branch;

and integrating the two branches in a mode of element corresponding addition.

2. The method according to claim 1, wherein in step S1, the pedestrian postures comprise a bending posture, a kneeling posture, a lying posture, a sitting posture, a standing posture and a blocking posture,

the bending attitude is defined as a state in which at least a part of the body of the pedestrian is bent;

the kneeling position is defined as at least one knee of the pedestrian contacting other objects;

the lying posture is defined as the horizontal posture of the pedestrian;

the sitting posture is defined as the contact of the buttocks of the pedestrian with other objects;

the standing posture is defined as standing without bending of the body of the pedestrian;

the occlusion pose is defined to display only a portion of the pedestrian's body.

3. The method according to claim 1, wherein in step S32, the backbone network VGG of the full convolution network is a two-branch operation for feature extraction of the multi-pose pedestrian samples.

4. The method of claim 1, wherein in step S34, the anchor-free feature selection branch adds FSAF modules for anchor-free frame feature selection.

5. The method of claim 4, wherein the FSAF module introduces two additional convolutional layers after each extracted feature for predictive classification and regression in the anchor-free feature selection branch.

6. The method according to claim 1, wherein step S4 further comprises: and combining the test result, and adjusting the network according to the spatial distribution and the sample characteristics of the pedestrian postures with the detection accuracy value lower than the preset accuracy value.

7. A computer storage medium comprising one or more computer instructions which, when executed, implement the method of any one of claims 1-6.