CN111783716A

CN111783716A - Pedestrian detection method, system and device based on attitude information

Info

Publication number: CN111783716A
Application number: CN202010664330.6A
Authority: CN
Inventors: 徐常胜; 姚涵涛
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2020-07-10
Filing date: 2020-07-10
Publication date: 2020-10-16

Abstract

The invention belongs to the field of pedestrian detection, and particularly relates to a pedestrian detection method, system and device based on attitude information, aiming at solving the problem that the accuracy of the existing pedestrian detection method cannot meet the requirement in a multi-person environment. The method comprises the following steps: obtaining a pedestrian candidate box and a corresponding first confidence score based on a pre-trained region extraction network_r(ii) a Acquiring comprehensive description of the pedestrian candidate frame based on the pre-trained pedestrian recognition network, performing secondary classification based on the description, and taking a classification result as a second confidence score_p(ii) a The integrated description comprises a visual description f^vAnd attitude description f^p(ii) a Based on score_rAnd score_pAnd acquiring a third confidence score, and executing the range on the set confidence threshold value to determine the pedestrian. The invention can be used forThe problems of shielding and false detection commonly existing in the pedestrian detection task are solved, and the accuracy of pedestrian detection is improved.

Description

Pedestrian detection method, system and device based on attitude information

Technical Field

The invention belongs to the field of pedestrian detection, and particularly relates to a pedestrian detection method, system and device based on attitude information.

Background

As a special branch of object detection, pedestrian detection has received great attention in both academic and industrial circles, with the goal of predicting where a pedestrian is located in a given image and represented by a series of bounding boxes. In addition to early manual characterization studies, pedestrian detection using convolutional neural networks has made tremendous progress over the past few years.

Recently, researchers have demonstrated that models based on convolutional neural networks help improve the performance of pedestrian detection. These convolutional neural network-based models can be divided into two categories: pedestrian detection with anchor points and pedestrian detection without anchor points. Generally, a detection model with an anchor point generates a large number of target candidate frames, and then judges whether each candidate frame contains a pedestrian through a classifier. The disadvantage of this approach is that most candidate blocks are redundant and therefore a lot of time is wasted in learning the feature representation. To avoid the above problems, researchers have designed anchorless detectors that can predict pedestrians directly from pictures. While existing methods can locate pedestrians for a given picture, they are not robust to occluded pedestrian detection.

Because scenes such as streets in the real world are often crowded with pedestrians and various objects, occlusion is a key problem in pedestrian detection. To address this challenge, researchers have attempted to model using pedestrian visual depictions. However, when the background is similar to a pedestrian, using only visual depictions is not sufficient to distinguish occluded pedestrians from the background. Since the detection model with anchor points can generate candidate frames of occluded pedestrians, the core problem of solving occlusion detection is how to generate a robust description to filter occluded pedestrians.

Disclosure of Invention

In order to solve the above problems in the prior art, that is, to solve the problem that the accuracy of the existing pedestrian detection method cannot meet the requirement in a multi-person environment, a first aspect of the present invention provides a pedestrian detection method based on attitude information, the method including the following steps:

step (ii) ofS100, acquiring a pedestrian candidate frame and a corresponding first confidence score based on a pre-trained region extraction network_r；

Step S200, acquiring comprehensive description of the pedestrian candidate frame based on the pre-trained pedestrian recognition network, performing secondary classification based on the description, and taking a classification result as a second confidence score_p(ii) a The integrated description comprises a visual description f^vAnd attitude description f^p；

Step S300, based on score_rAnd score_pAcquiring a third confidence score, and determining the pedestrian by being extensive to a set confidence threshold;

wherein the content of the first and second substances,

the pedestrian recognition network comprises a visual feature module, a human body posture module and a classification module; the visual feature module is constructed based on a feature extraction network and is used for acquiring the visual description; the human body posture module is constructed based on a convolutional neural network and is used for acquiring the posture description f^p(ii) a The classification module is a two-classification network and is used for acquiring a second confidence score based on the comprehensive description_p。

In some preferred embodiments, the area extraction network is constructed based on an object detection network, the loss function L of which_rpnIs composed of

Wherein L is_clsIs a cross-entropy loss of two classes, L_regIs the regression loss, gamma is a predetermined coordination parameter, p_iIs the predicted probability of the i-th pedestrian candidate frame,

judging the correct probability for the ith pedestrian candidate frame classification, t_iIs a vector of coordinates of the i-th pedestrian candidate frame,

and marking a vector of the frame coordinates corresponding to the real pedestrian for the ith pedestrian candidate frame.

In some preferred embodiments, the classification loss L_clsComprises the following steps:

regression loss L_regComprises the following steps:

in some preferred embodiments, the visual feature module is composed of a top 10 layer network of VGG-19 and a convolution block, and obtains the visual description f based on the pedestrian candidate box^vDescription of the vision f by a full link layer^vCarry out two classifications to obtain confidence score₁。

In some preferred embodiments, the human body posture module comprises a feature extraction network, a first sub-network, a second sub-network, a full connectivity layer;

the feature extraction network is constructed on the basis of a convolution network of VGG-19 and is used for extracting a feature map F of the pedestrian candidate frame;

the first sub-network and the second sub-network are respectively constructed based on a convolutional neural network, and a confidence map S and an associated domain L of a corresponding pedestrian candidate frame are predicted based on a feature map F;

the full connection layer is used for obtaining the attitude description f based on the confidence coefficient graph S and the association domain L^pAnd obtaining a confidence score₂。

In some preferred embodiments, the classification module is configured to classify the image based on a visual description f^vAnd attitude description f^pObtaining confidence score₃And based on the confidence score₁Confidence score₂Confidence score₃Carrying out weighted summation through a preset weighting coefficient to obtain a second confidence score_p。

In some preferred embodiments, the third confidence score is calculated by:

score＝αscore_r+βscore_p

wherein α and β are preset weighting parameters.

In some preferred embodiments, one or more of the visual feature module, the human body posture module, and the classification module are respectively constrained by corresponding cross entropy loss function memorability during training.

In a second aspect of the present invention, a pedestrian detection system based on attitude information is provided, the system comprising a first unit, a second unit, and a third unit:

the first unit is configured to acquire a pedestrian candidate frame and a corresponding first confidence score based on a pre-trained region extraction network_r；

The second unit is configured to acquire a comprehensive description of the pedestrian candidate frame based on a pre-trained pedestrian recognition network, perform secondary classification based on the description, and use a classification result as a second confidence score_p(ii) a The integrated description comprises a visual description f^vAnd attitude description f^p；

The third unit is configured to calculate score based on a preset weight_rAnd score_pTaking the sum as a third confidence score, and then, executing the range on the set confidence threshold value to determine the pedestrian;

wherein the content of the first and second substances,

In a third aspect of the present invention, a storage device is provided, in which a plurality of programs are stored, the programs being adapted to be loaded and executed by a processor to implement the above-described pedestrian detection method based on attitude information.

In a fourth aspect of the present invention, a processing apparatus is provided, which includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is adapted to be loaded and executed by a processor to implement the above-described pedestrian detection method based on attitude information.

The invention has the beneficial effects that:

the pedestrian detection method and the pedestrian detection device can well solve the problems of shielding and false detection commonly existing in the pedestrian detection task, and improve the accuracy of pedestrian detection. The invention can be well embedded into any existing detector (with or without anchor points), thereby greatly improving the detection efficiency and the generalization.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is a flow chart of a pedestrian detection method based on attitude information according to an embodiment of the present invention;

FIG. 2 is a block diagram of a pedestrian detection network based on attitude information in one embodiment of the present invention;

fig. 3 is a detailed structural diagram of a pedestrian identification network according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The invention discloses a pedestrian detection method based on attitude information, which comprises the following steps as shown in figure 1:

step S100, acquiring a pedestrian candidate frame and a corresponding first confidence score based on a pre-trained region extraction network_r；

wherein the content of the first and second substances,

the pedestrian recognition network comprises a visual feature module, a human body posture module and a classification module; the visual feature module is constructed based on a feature extraction network and is used for acquiring the visual description; the human body posture module is constructed based on a convolutional neural network and is used for obtaining the posture description fp; the classification module is a two-classification network and is used for acquiring a second confidence score based on the comprehensive description_p。

In order to more clearly explain the pedestrian detection method based on the attitude information, the following will describe each step in an embodiment of the method in detail with reference to the accompanying drawings.

The implementation of the detection method in an embodiment of the present invention needs to rely on the trained network obtained by the construction of the corresponding detection network and the prior training, so the following description of the technical solution is first performed from the construction of the detection network to be trained.

The detection network on which the method of the invention is implemented comprises an area extraction network, a pedestrian recognition network and a detection output network as shown in figure 2.

For convenience of description, the training samples will be described as follows: picture I corresponding to training sample, determining all n pedestrians existing in picture I and using rectangular frame T^*＝{t₁ ^*,t₂ ^*,…,t_n ^*Position, where the real frame coordinates

Is the coordinate of the center point of the rectangular frame,

the width and height of the rectangular frame.

1. Area extraction network

Any existing target detector may be used as the area extraction network for global modeling to generate a series of pedestrian candidate boxes and corresponding confidence scores.

The network passes a multi-tasking loss function L_rpnOptimizing the network:

In this embodiment, when the ratio of the intersection to the union between the target frame i and any one of the real frames is greater than 0.5,

otherwise

Loss of classification L_clsComprises the following steps:

regression loss L_regComprises the following steps:

wherein, t_i＝[t_x,t_y,t_w,t_h]Is a vector representing the predicted candidate box coordinates,

is t_iCorresponding real frame coordinates.

Wherein x, y, w, h respectively represent the center coordinates and width and height of the candidate frame, x_a、y_a、w_a、h_aRepresenting the coordinates of the center point of the anchor box and the width and height, x, respectively^*、y^*、w^*、h^*Representing the coordinates of the center point of the real box and the width and height, respectively.

To eliminate the redundant detection results generated for the same pedestrian, all candidate boxes may be fused using non-maximum suppression and set IoU to a threshold of 0.5.

2. Pedestrian identification network

After a candidate frame possibly containing pedestrians is generated by using a region extraction network, a local candidate region is modeled by using a pedestrian recognition network, the confidence score of the candidate region is optimized by obtaining visual feature description and human body posture description, and a false detection frame is removed. The pedestrian recognition network is composed of three modules, namely a visual feature module, a human body posture module and a classification module, as shown in fig. 3.

(1) Visual feature module

For a pedestrian candidate box output by the area extraction network, pixels of the pedestrian candidate box are adjusted into 256 × 256, then the pedestrian candidate box is sent into the visual feature module to obtain a 128-dimensional visual description fv, and then the visual description is subjected to secondary classification by using a full-connection layer and confidence coefficient is obtained

This module loses L through cross entropy during training_vAnd (6) carrying out constraint.

The probability of the background is predicted, and the probability of the pedestrian is predicted, and the value of the probability is 0 or 1.

(2) Human body posture module

The human body posture module comprises a feature extraction network, a first sub-network, a second sub-network and a full connection layer, and is adjusted to 256 × 256 pixels for eachFirstly, extracting a feature map F of the pedestrian candidate frame through a feature extraction network constructed based on a convolutional network construction of VGG-19, then respectively predicting a confidence map S and an association domain L (the confidence map and the association domain respectively represent key points and connection relations between points in human body posture information) of the corresponding pedestrian candidate frame based on the feature map F by using a first sub-network and a second sub-network constructed based on a convolutional neural network, and finally obtaining a posture description F through a full connection layer based on the confidence map S and the association domain L^pAnd obtaining a confidence score₂。

Attitude description f^pThe acquisition can be divided into the following stages:

the human body posture module generates a confidence map S in the first stage₁＝ρ₁(F) And associated domain

Where ρ is₁And

the convolutional neural networks are all formed by three convolutional layers of 3 × 3 and two layers of 1 × 1;

in each of the following stages, we combine the predictions of the two subnetworks in the previous stage with the feature F of the original image to generate a new prediction, which is detailed as follows:

where ρ is_tAnd

(t is the stage, t is more than or equal to 2) are all convolutional neural networks formed by five convolutional layers of 7 × 7 and two convolutional layers of 1 × 1;

in the last stage, we combine the confidence maps S₆And shut offLinked domain L₆Obtaining the posture description f of the human body^p。

The human body posture module can perform parameter initialization by using the trained OpenPose model, and the parameters of the human body posture module are fixed and cannot be updated when the whole pedestrian recognition network is trained. Then, inputting the attitude information into the full connection layer to obtain 128-dimensional attitude description f^pAnd using a full connection layer to perform two classifications on the attitude description to obtain confidence

This module loses L through cross entropy during training_pAnd (6) carrying out constraint.

(3) Classification module

In obtaining a visual description f^vAnd attitude description f^pThey are then combined into 256-dimensional descriptions, then binned through several fully connected layers, and constrained visually and by cross-entropy loss L.

In the module based on the visual description f^vAnd attitude description f^pTwo-class confidence acquisition through several fully-connected layers

Based on confidence score₁Confidence score₂Confidence score₃Carrying out weighted summation through a preset weighting coefficient to obtain a second confidence score_p. For example, a weighting coefficient e may be set₁、e₂、e₃Then the second confidence score_pIs composed of

score_p＝score₁e₁+score₂e₂+score₃e₃

Wherein e is₁、e₂、e₃The sum is 1.

In this embodiment, the detailed structure of the pedestrian identification network is shown in the figure, and passes through the loss function L_prnThe constraint is specifically expressed as follows:

L_prn＝L+λ₂L_v+λ₃L_p

l, L therein_v、L_pLoss functions of a visual characteristic module, a human body posture module and a classification module respectively, and two super parameters lambda₂＝λ₃＝0.5。

In the training process, based on the loss function L_prnAnd carrying out integral training on the pedestrian recognition network.

3. Detecting an output network

Confidence score output by regional extraction network_rConfidence score with pedestrian recognition network output_pPerforming fusion as a final confidence score of the generated candidate region score:

score＝αscore_r+βscore_p

wherein

And

wherein

Which represents the probability of predicting a pedestrian,

represents the probability of predicting the background, where r, p, α and β are weighting parametersWhen re is low, the candidate area is determined as the background.

And training the detection network based on a pre-constructed training sample to obtain the optimal parameters of each part of network and obtain the optimized network.

Based on the optimized network, the pedestrian detection method based on the attitude information comprises the following steps:

Step S300, based on score_rAnd score_pAnd acquiring a third confidence score, and executing the range on the set confidence threshold value to determine the pedestrian.

A pedestrian detection system based on attitude information according to a second embodiment of the present invention includes a first unit, a second unit, and a third unit:

wherein the content of the first and second substances,

the pedestrian recognition network comprises a visual feature module, a human body posture module and a classification module(ii) a The visual feature module is constructed based on a feature extraction network and is used for acquiring the visual description; the human body posture module is constructed based on a convolutional neural network and is used for acquiring the posture description f^p(ii) a The classification module is a two-classification network and is used for acquiring a second confidence score based on the comprehensive description_p。

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiments, and will not be described herein again.

It should be noted that, the pedestrian detection system based on the posture information provided in the foregoing embodiment is only exemplified by the division of the above functional modules, and in practical applications, the above functions may be distributed by different functional modules according to needs, that is, the modules or steps in the embodiment of the present invention are further decomposed or combined, for example, the modules in the foregoing embodiment may be combined into one module, or may be further split into a plurality of sub-modules, so as to complete all or part of the above described functions. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.

A storage device of a third embodiment of the present invention stores therein a plurality of programs adapted to be loaded and executed by a processor to implement the above-described pedestrian detection method based on attitude information.

A processing apparatus according to a fourth embodiment of the present invention includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is adapted to be loaded and executed by a processor to implement the above-described pedestrian detection method based on attitude information.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication section, and/or installed from a removable medium. The computer program, when executed by a Central Processing Unit (CPU), performs the above-described functions defined in the method of the present application. It should be noted that the computer readable medium mentioned above in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.

The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. A pedestrian detection method based on attitude information is characterized by comprising the following steps:

Step S300, calculating score based on preset weight_rAnd score_pTaking the sum as a third confidence score, and then, executing the range on the set confidence threshold value to determine the pedestrian;

wherein the content of the first and second substances,

the pedestrian recognition network comprises a visual feature module, a human body posture module and a classification module; the visual characteristic modelThe block is constructed based on a feature extraction network and is used for acquiring the visual description; the human body posture module is constructed based on a convolutional neural network and is used for acquiring the posture description f^p(ii) a The classification module is a two-classification network and is used for acquiring a second confidence score based on the comprehensive description_p。

2. The pedestrian detection method based on attitude information of claim 1, wherein the area extraction network is constructed based on an object detection network, and a loss function L thereof_rpnIs composed of

3. The pedestrian detection method based on the attitude information of claim 2, characterized in that the classification loss L is_clsComprises the following steps:

regression loss L_regComprises the following steps:

4. the pedestrian detection method based on the attitude information of claim 1, wherein the visual feature module is composed of a front 10 network of VGG-19 and a convolution block, and obtains the visual description f based on the pedestrian candidate frame^vDescription of the vision f by a full link layer^vCarry out two classifications to obtain confidence score₁。

5. The pedestrian detection method based on the attitude information of claim 4, wherein the human body attitude module comprises a feature extraction network, a first sub-network, a second sub-network, a full connection layer;

6. The pedestrian detection method based on attitude information of claim 5, wherein the classification module is configured to classify the pedestrian based on a visual description f^vAnd attitude description f^pObtaining confidence score₃And based on the confidence score₁Confidence score₂Confidence score₃Carrying out weighted summation through a preset weighting coefficient to obtain a second confidence score_p。

7. The pedestrian detection method based on the attitude information of claim 6, wherein the third confidence score is calculated by:

score＝αscore_r+βscore_p

wherein α and β are preset weighting parameters.

8. The pedestrian detection method based on the posture information as claimed in any one of claims 1 to 7, wherein one or more of the visual feature module, the human body posture module and the classification module are respectively constrained by corresponding cross entropy loss function memorability during training.

9. A pedestrian detection system based on attitude information is characterized by comprising a first unit, a second unit and a third unit:

wherein the content of the first and second substances,

10. A storage device having stored therein a plurality of programs, characterized in that the programs are adapted to be loaded and executed by a processor to implement the pedestrian detection method based on attitude information according to any one of claims 1 to 8.

11. A processing device comprising a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; characterized in that the program is adapted to be loaded and executed by a processor to implement the pedestrian detection method based on attitude information according to any one of claims 1 to 8.