CN108009466B

CN108009466B - Pedestrian detection method and device

Info

Publication number: CN108009466B
Application number: CN201610971349.9A
Authority: CN
Inventors: 俞刚; 彭雨翔
Original assignee: Beijing Kuangshi Technology Co Ltd; Beijing Megvii Technology Co Ltd
Current assignee: Beijing Kuangshi Technology Co Ltd; Beijing Megvii Technology Co Ltd
Priority date: 2016-10-28
Filing date: 2016-10-28
Publication date: 2022-03-15
Anticipated expiration: 2036-10-28
Also published as: CN108009466A

Abstract

The embodiment of the invention provides a pedestrian detection method and device. The pedestrian detection method includes: acquiring an image to be processed; detecting a pedestrian in the image to be processed to obtain a pedestrian detection preliminary result, wherein the pedestrian detection preliminary result comprises one or more pedestrian frames, and each pedestrian frame is used for indicating an area in which the pedestrian possibly exists in the image to be processed; performing skeleton analysis on the pedestrians contained in each of the one or more pedestrian frames to obtain skeleton information respectively related to each of the one or more pedestrian frames; and screening the one or more pedestrian frames according to the skeleton information respectively related to each of the one or more pedestrian frames to obtain at least one pedestrian frame corresponding to at least part of pedestrians in the image to be processed one by one. According to the method and the device, the pedestrian frames are screened by using the skeleton information of the pedestrians contained in the pedestrian frames, redundant pedestrian frames of the same pedestrian can be filtered, and meanwhile pedestrian frames of different pedestrians can be reserved.

Description

Pedestrian detection method and device

Technical Field

The invention relates to the field of computers, in particular to a pedestrian detection method and device.

Background

In the field of monitoring, pedestrian detection plays a very important role. The current pedestrian detection algorithm usually extracts a plurality of windows (each window is a rectangular frame and may also be called a pedestrian frame) with different scales from an image to be processed by a sliding-window (sliding-window) method, and judges whether a pedestrian exists in each window. Each window may be scored for a score representing the probability of a pedestrian being present in the window. The sliding window method results in a high score for multiple windows that may exist for the same pedestrian, so after being processed by the sliding window method, it is often necessary to filter the multiple windows corresponding to the same pedestrian using non-maximum suppression (NMS). Those skilled in the art will appreciate that NMS is based primarily on an inter-section-over-intersection of two windows, using a high scoring window to filter other windows that overlap significantly with this window. The approach of filtering windows with the NMS may result in windows containing different pedestrians than the high scoring windows being filtered in pedestrian dense scenarios (e.g., when multiple people are in close proximity), which may greatly reduce the recall (recall) of the pedestrian detection algorithm.

Disclosure of Invention

The present invention has been made in view of the above problems. The invention provides a pedestrian detection method and device.

According to an aspect of the present invention, there is provided a pedestrian detection method. The method comprises the following steps: acquiring an image to be processed; detecting pedestrians in the image to be processed to obtain a pedestrian detection preliminary result, wherein the pedestrian detection preliminary result comprises one or more pedestrian frames, and each pedestrian frame is used for indicating an area in which a pedestrian is likely to exist in the image to be processed; performing skeleton analysis on the pedestrians contained in each of the one or more pedestrian frames to obtain skeleton information respectively related to each of the one or more pedestrian frames; and screening the one or more pedestrian frames according to the skeleton information respectively related to each of the one or more pedestrian frames to obtain at least one pedestrian frame corresponding to at least part of pedestrians in the image to be processed one by one.

Illustratively, after the screening the one or more pedestrian frames according to the skeleton information respectively related to each of the one or more pedestrian frames, the pedestrian detection method further comprises: and for each of the at least one pedestrian frame, judging whether the pedestrian contained in the pedestrian frame is a real pedestrian according to the skeleton information related to the pedestrian frame, and if not, filtering the pedestrian frame.

For example, the skeleton information related to any one of the one or more pedestrian frames includes a keypoint confidence degree in one-to-one correspondence with a certain number of keypoints of the pedestrian included in the pedestrian frame, and for each of the at least one pedestrian frame, determining whether the pedestrian included in the pedestrian frame is a real pedestrian according to the skeleton information related to the pedestrian frame includes: for each of the at least one pedestrian frame, performing summation or averaging on the key point confidence coefficients corresponding to the specific number of key points of the pedestrians contained in the pedestrian frame one by one to obtain a first total confidence coefficient; and for each of the at least one pedestrian frame, comparing the first total confidence with a corresponding confidence threshold, if the first total confidence is greater than the corresponding confidence threshold, determining that the pedestrian contained in the pedestrian frame is a real pedestrian, otherwise determining that the pedestrian contained in the pedestrian frame is not a real pedestrian.

Illustratively, the screening the one or more pedestrian frames according to the skeleton information respectively related to each of the one or more pedestrian frames to obtain at least one pedestrian frame corresponding to at least part of pedestrians in the image to be processed one by one comprises: determining, for any two pedestrian frames of the plurality of pedestrian frames, whether pedestrians contained in the two pedestrian frames are the same pedestrian according to skeleton information respectively associated with each of the two pedestrian frames, if the preliminary result of pedestrian detection includes the plurality of pedestrian frames; and for two or more pedestrian frames containing the same pedestrian, selecting a unique pedestrian frame from the two or more pedestrian frames as one of the at least one pedestrian frame.

For example, the skeleton information related to any one of the one or more pedestrian frames includes a key point feature map in one-to-one correspondence with a specific number of key points of pedestrians included in the pedestrian frame, and the determining, for any two of the multiple pedestrian frames, whether the pedestrians included in the two pedestrian frames are the same pedestrian according to the skeleton information respectively related to each of the two pedestrian frames includes: for any two pedestrian frames in the plurality of pedestrian frames, calculating the similarity between the skeletons of the pedestrians contained in the two pedestrian frames by using the key point feature maps which are in one-to-one correspondence with the specific number of key points of the pedestrians contained in each of the two pedestrian frames; and comparing the calculated similarity with a similarity threshold value for any two pedestrian frames in the plurality of pedestrian frames, if the calculated similarity is greater than the similarity threshold value, determining that the pedestrians contained in the two pedestrian frames are the same pedestrian, otherwise determining that the pedestrians contained in the two pedestrian frames are not the same pedestrian.

Illustratively, the preliminary pedestrian detection result further includes pedestrian frame confidence levels corresponding to the one or more pedestrian frames one to one, the skeleton information related to any one of the one or more pedestrian frames includes keypoint confidence levels corresponding to a certain number of keypoints of the pedestrians included in the pedestrian frame one to one, and for two or more pedestrian frames including the same pedestrian, selecting a unique pedestrian frame from the two or more pedestrian frames as one of the at least one pedestrian frame includes: for two or more pedestrian frames containing the same pedestrian, selecting the unique pedestrian frame from the two or more pedestrian frames according to a pedestrian frame confidence corresponding to the two or more pedestrian frames one to one and a keypoint confidence corresponding to the specific number of keypoints of the pedestrian contained in each of the two or more pedestrian frames.

Illustratively, for two or more pedestrian frames containing the same pedestrian, the selecting the unique pedestrian frame from the two or more pedestrian frames according to the pedestrian frame confidence levels corresponding to the two or more pedestrian frames one-to-one and the keypoint confidence levels corresponding to the specific number of keypoints of the pedestrian contained in each of the two or more pedestrian frames comprises: for two or more pedestrian frames containing the same pedestrian, for each of the two or more pedestrian frames, performing summation or averaging on the confidence coefficient of the pedestrian frame corresponding to the pedestrian frame and the confidence coefficients of the key points, corresponding to the specific number of the key points, of the pedestrian contained in the pedestrian frame to obtain a second total confidence coefficient; and for two or more pedestrian frames containing the same pedestrian, selecting the pedestrian frame with the highest second total confidence as the unique pedestrian frame.

Illustratively, the detecting the pedestrian in the image to be processed to obtain a pedestrian detection preliminary result comprises: inputting the image to be processed into a first convolution neural network so as to extract the characteristics of the image to be processed; and inputting the characteristics of the image to be processed into a second convolutional neural network to obtain the preliminary result of pedestrian detection.

Illustratively, the skeleton analyzing the pedestrians contained in each of the one or more pedestrian frames to obtain skeleton information respectively related to each of the one or more pedestrian frames comprises: inputting the features of the image to be processed and the preliminary result of pedestrian detection into a full convolution network to obtain the skeleton information respectively associated with each of the one or more pedestrian frames.

Illustratively, the pedestrian detection method further includes: acquiring a training image, wherein a target pedestrian frame corresponding to each pedestrian in the training image and target positions of a specific number of key points of each pedestrian are marked; constructing a first loss function by taking target pedestrian frames corresponding to pedestrians in the training image in a one-to-one mode as target values of one or more pedestrian frames in a pedestrian detection preliminary result obtained by the second convolutional neural network aiming at the training image, and constructing a second loss function by taking target positions of the pedestrians in the training image and the key points of the specific number as target values of skeleton information obtained by the full convolutional network aiming at the training image; and training parameters in the first convolutional neural network, the second convolutional neural network, and the full convolutional network using the first loss function and the second loss function.

According to another aspect of the present invention, a pedestrian detection apparatus is provided. The device includes: the image to be processed acquisition module is used for acquiring an image to be processed; the detection module is used for detecting pedestrians in the image to be processed to obtain a pedestrian detection preliminary result, wherein the pedestrian detection preliminary result comprises one or more pedestrian frames, and each pedestrian frame is used for indicating an area where pedestrians are likely to exist in the image to be processed; the skeleton analysis module is used for carrying out skeleton analysis on the pedestrians contained in each of the one or more pedestrian frames so as to obtain skeleton information respectively related to each of the one or more pedestrian frames; and the screening module is used for screening the one or more pedestrian frames according to the skeleton information respectively related to each of the one or more pedestrian frames so as to obtain at least one pedestrian frame in one-to-one correspondence with at least part of pedestrians in the image to be processed.

Exemplarily, the pedestrian detection device further includes: and the real pedestrian judgment module is used for judging whether the pedestrian contained in the pedestrian frame is a real pedestrian or not according to the skeleton information related to the pedestrian frame for each of the at least one pedestrian frame, and filtering the pedestrian frame if the pedestrian contained in the pedestrian frame is not the real pedestrian.

Illustratively, the skeleton information related to any one of the one or more pedestrian frames includes a keypoint confidence degree in one-to-one correspondence with a specific number of keypoints of the pedestrians included in the pedestrian frame, and the real pedestrian determination module includes: a first total confidence obtaining submodule, configured to sum or average, for each of the at least one pedestrian frame, the confidence levels of the keypoints that correspond to the pedestrians included in the pedestrian frame in the specific number one to one, so as to obtain a first total confidence level; and the confidence degree comparison submodule is used for comparing the first total confidence degree with a corresponding confidence degree threshold value for each of the at least one pedestrian frame, if the first total confidence degree is greater than the corresponding confidence degree threshold value, the pedestrian contained in the pedestrian frame is determined to be a real pedestrian, otherwise, the pedestrian contained in the pedestrian frame is determined not to be a real pedestrian.

Illustratively, the screening module includes: the same pedestrian determination submodule is used for determining whether pedestrians contained in any two pedestrian frames in the pedestrian frames are the same pedestrian or not according to skeleton information respectively relevant to each of the two pedestrian frames under the condition that the preliminary pedestrian detection result comprises the multiple pedestrian frames; and a pedestrian frame selection sub-module for selecting, for two or more pedestrian frames containing the same pedestrian, a unique pedestrian frame from among the two or more pedestrian frames as one of the at least one pedestrian frame, in a case where the preliminary result of pedestrian detection includes a plurality of pedestrian frames.

Illustratively, the skeleton information related to any one of the one or more pedestrian frames includes a key point feature map in one-to-one correspondence with a specific number of key points of pedestrians included in the pedestrian frame, and the same pedestrian determination submodule includes: a similarity calculation unit configured to calculate, for any two of the plurality of pedestrian frames, a similarity between skeletons of pedestrians included in the two pedestrian frames using a keypoint feature map that is one-to-one corresponding to the specific number of keypoints of the pedestrians included in each of the two pedestrian frames; and the similarity comparison unit is used for comparing the calculated similarity with a similarity threshold value for any two pedestrian frames in the plurality of pedestrian frames, if the calculated similarity is greater than the similarity threshold value, the pedestrians contained in the two pedestrian frames are determined to be the same pedestrian, otherwise, the pedestrians contained in the two pedestrian frames are determined not to be the same pedestrian.

Illustratively, the preliminary result of pedestrian detection further includes a confidence level of a pedestrian frame corresponding to one of the one or more pedestrian frames, the skeleton information related to any one of the one or more pedestrian frames includes a confidence level of a keypoint corresponding to one of a specific number of keypoints of the pedestrian included in the pedestrian frame, and the pedestrian frame selection submodule includes: a pedestrian frame selection unit configured to, for two or more pedestrian frames including the same pedestrian, select the unique pedestrian frame from the two or more pedestrian frames according to a pedestrian frame confidence corresponding to the two or more pedestrian frames one to one and a keypoint confidence corresponding to the specific number of keypoints of the pedestrian included in each of the two or more pedestrian frames one to one.

Illustratively, the pedestrian frame selection unit includes: a second total confidence obtaining subunit, configured to, for each of two or more pedestrian frames that include the same pedestrian, sum or average a pedestrian frame confidence corresponding to the pedestrian frame and a keypoint confidence corresponding to the pedestrian included in the pedestrian frame and corresponding to the specific number of keypoints one by one, to obtain a second total confidence; and a pedestrian frame selection subunit, configured to select, as the unique pedestrian frame, the pedestrian frame with the highest second total confidence, for two or more pedestrian frames that include the same pedestrian.

Illustratively, the detection module includes: the first input submodule is used for inputting the image to be processed into a first convolution neural network so as to extract the characteristics of the image to be processed; and the second input submodule is used for inputting the characteristics of the image to be processed into a second convolutional neural network so as to obtain the pedestrian detection preliminary result.

Illustratively, the skeletal analysis module comprises: and the third input submodule is used for inputting the characteristics of the image to be processed and the preliminary pedestrian detection result into a full convolution network so as to obtain the skeleton information respectively related to each of the one or more pedestrian frames.

Exemplarily, the pedestrian detection device further includes: the training image acquisition module is used for acquiring a training image, wherein a target pedestrian frame corresponding to each pedestrian in the training image and target positions of a specific number of key points of each pedestrian are marked; a loss function constructing module, configured to construct a first loss function by using a target pedestrian frame corresponding to a pedestrian in the training image in a one-to-one manner as a target value of one or more pedestrian frames in a pedestrian detection preliminary result obtained by the second convolutional neural network for the training image, and construct a second loss function by using a target position of the pedestrian in the training image and the key points of the specific number as a target value of skeleton information obtained by the full convolutional network for the training image; and a training module for training parameters in the first convolutional neural network, the second convolutional neural network and the full convolutional network by using the first loss function and the second loss function.

According to the pedestrian detection method and device provided by the embodiment of the invention, the pedestrian frames are screened by using the skeleton information of the pedestrians contained in the pedestrian frames, so that the aims of filtering redundant pedestrian frames of the same pedestrian and retaining pedestrian frames of different pedestrians can be achieved.

Drawings

The above and other objects, features and advantages of the present invention will become more apparent by describing in more detail embodiments of the present invention with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings, like reference numbers generally represent like parts or steps.

FIG. 1 shows a schematic block diagram of an example electronic device for implementing a pedestrian detection method and apparatus in accordance with embodiments of the invention;

FIG. 2 shows a schematic flow diagram of a pedestrian detection method according to one embodiment of the invention;

FIG. 3 shows a schematic diagram of a data processing flow of a pedestrian detection method according to one embodiment of the invention;

FIG. 4 shows a schematic flow diagram of a pedestrian detection method according to another embodiment of the invention;

FIG. 5 shows a schematic block diagram of a pedestrian detection arrangement, according to one embodiment of the present invention; and

FIG. 6 shows a schematic block diagram of a pedestrian detection system in accordance with one embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, exemplary embodiments according to the present invention will be described in detail below with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a subset of embodiments of the invention and not all embodiments of the invention, with the understanding that the invention is not limited to the example embodiments described herein. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the invention described herein without inventive step, shall fall within the scope of protection of the invention.

In order to solve the above-mentioned problems, embodiments of the present invention provide a pedestrian detection method and apparatus, which utilize skeleton information of pedestrians contained in a pedestrian frame (i.e., a window) to screen the pedestrian frame, so as to filter redundant pedestrian frames of the same pedestrian and retain pedestrian frames of different pedestrians. The pedestrian detection method provided by the embodiment of the invention can solve the pedestrian detection problem in a crowded (dense crowd) scene to a great extent, and therefore, the pedestrian detection method can be well applied to the monitoring field.

First, an example electronic device 100 for implementing a pedestrian detection method and apparatus according to an embodiment of the present invention is described with reference to fig. 1.

As shown in FIG. 1, electronic device 100 includes one or more processors 102, one or more memory devices 104, an input device 106, an output device 108, and an image capture device 110, which are interconnected via a bus system 112 and/or other form of connection mechanism (not shown). It should be noted that the components and structure of the electronic device 100 shown in fig. 1 are exemplary only, and not limiting, and the electronic device may have other components and structures as desired.

The processor 102 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 100 to perform desired functions.

The storage 104 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. On which one or more computer program instructions may be stored that may be executed by processor 102 to implement client-side functionality (implemented by the processor) and/or other desired functionality in embodiments of the invention described below. Various applications and various data, such as various data used and/or generated by the applications, may also be stored in the computer-readable storage medium.

The input device 106 may be a device used by a user to input instructions and may include one or more of a keyboard, a mouse, a microphone, a touch screen, and the like.

The output device 108 may output various information (e.g., images and/or sounds) to an external (e.g., user), and may include one or more of a display, a speaker, etc.

The image capture device 110 may capture images (including video frames) and store the captured images in the storage device 104 for use by other components. The image capture device 110 may be a surveillance camera. It should be understood that the image capture device 110 is merely an example, and the electronic device 100 may not include the image capture device 110. In this case, an image for pedestrian detection may be captured using another image capturing device and the captured image may be transmitted to the electronic apparatus 100.

Illustratively, an example electronic device for implementing the pedestrian detection method and apparatus in accordance with embodiments of the present invention may be implemented on a device such as a personal computer or a remote server.

Next, a pedestrian detection method according to an embodiment of the invention will be described with reference to fig. 2. FIG. 2 shows a schematic flow diagram of a pedestrian detection method 200 according to one embodiment of the invention. As shown in fig. 2, the pedestrian detection method 200 includes the following steps.

In step S210, an image to be processed is acquired.

The image to be processed may be any suitable image that requires pedestrian detection, such as an image captured for a monitored area. The image to be processed may be an original image acquired by an image acquisition device such as a camera, or may be an image obtained after preprocessing the original image.

The image to be processed may be sent to the electronic device 100 by a client device (such as a security device including a monitoring camera) to be processed by the processor 102 of the electronic device 100, or may be collected by an image collecting device 110 (e.g., a camera) included in the electronic device 100 and transmitted to the processor 102 for processing.

In step S220, a pedestrian in the image to be processed is detected to obtain a pedestrian detection preliminary result, wherein the pedestrian detection preliminary result includes one or more pedestrian frames, and each pedestrian frame is used for indicating an area in the image to be processed where the pedestrian may exist.

Step S220 may be implemented using any conventional pedestrian detection algorithm, such as a pedestrian detection algorithm of HOG (histogram of oriented gradients) in combination with SVM (support vector machine). The pedestrian detection algorithm is used for detecting the pedestrian in the object to be processed, and a pedestrian detection preliminary result can be obtained. The pedestrian detection preliminary result may include several pedestrian frames. The pedestrian frame is a rectangular frame for indicating an area in which a pedestrian may be present in the image to be processed. In addition, the pedestrian detection preliminary result may further include a pedestrian frame confidence corresponding to each pedestrian frame, which indicates a probability that a pedestrian exists in the pedestrian frame.

Note that the pedestrian frame obtained in step S220 is a pedestrian frame that is not processed with the NMS. That is, in the pedestrian frame obtained in step S220, the same pedestrian may correspond to a plurality of different pedestrian frames. In addition, there may be two pedestrians at a short distance in the image to be processed, and in this case, the pedestrian frames of the two pedestrians may have a large overlapping area.

In one example, after detecting a pedestrian in the image to be processed using an existing or future possibly implemented pedestrian detection algorithm, several pedestrian frames are obtained, and all the obtained pedestrian frames may be directly regarded as pedestrian frames in the preliminary result of pedestrian detection for subsequent skeletal analysis. In another example, after detecting a pedestrian in the image to be processed using an existing or future possibly implemented pedestrian detection algorithm, a number of pedestrian frames are obtained, from which at least a portion of the pedestrian frames may be selected as pedestrian frames in the preliminary result of pedestrian detection for subsequent skeletal analysis. For example, a pedestrian frame with a pedestrian frame confidence greater than a preset threshold may be selected from the obtained pedestrian frames as a pedestrian frame in the preliminary result of pedestrian detection for subsequent skeleton analysis, and the pedestrian frame with the pedestrian frame confidence not greater than the preset threshold may be discarded.

In step S230, skeleton analysis is performed on the pedestrians contained in each of the one or more pedestrian frames to obtain skeleton information respectively associated with each of the one or more pedestrian frames.

The skeleton of the pedestrian can be represented by some key points (or joint points) on the pedestrian, and the key points can be, for example, the head, the neck, the left shoulder, the right shoulder, the left hand, the right hand, the left foot, the right foot, and the like. The location of the keypoints representing the skeleton of the pedestrian on the pedestrian body and the number of keypoints (i.e., the specific number described herein) can be set as desired, and the invention is not limited thereto. In one example, the skeletal information described herein may include the locations of a particular number of keypoints of a pedestrian. In another example, the skeletal information described herein may include a keypoint feature map that corresponds one-to-one to a particular number of keypoints for pedestrians. The location of each keypoint may be determined from the keypoint signature corresponding thereto. The key point feature map will be described in detail below, and will not be described herein.

The skeleton analysis algorithm is a pedestrian posture estimation algorithm, and the positions of certain key points of the pedestrian can be determined by using the algorithm, so that the skeleton information of the pedestrian is obtained.

In step S240, one or more pedestrian frames are filtered according to the skeleton information respectively associated with each of the one or more pedestrian frames, so as to obtain at least one pedestrian frame corresponding to at least a part of pedestrians in the image to be processed.

It is understood that if the number of pedestrian frames obtained in step S220 is only one, the pedestrian frame may be directly retained in step S240, i.e., only one pedestrian in the image to be processed is detected. If the number of pedestrian frames obtained in step S220 is more than one, the obtained pedestrian frames may be subjected to filtering. For the latter case, it may be determined whether different pedestrian frames contain the same pedestrian first, and then, according to the determination result, two cases may be further processed. First, for a case where the number of pedestrian frames obtained in step S220 is more than one and pedestrians included in the obtained pedestrian frames are different from each other, all the pedestrian frames obtained in step S220 may be retained. In addition, for the case where there is more than one pedestrian frame obtained in step S220 and there is at least two pedestrians contained in the obtained pedestrian frames that are the same pedestrian, a plurality of pedestrian frames containing the same pedestrian may be filtered, only one pedestrian frame is reserved for each pedestrian, and finally at least one pedestrian frame corresponding to at least part of the pedestrians in the image to be processed one by one may be obtained.

It can be understood that if the pedestrians contained in the two pedestrian frames are the same pedestrian, the skeleton information respectively related to the two pedestrian frames should be consistent, that is, the positions of the key points of the pedestrians contained in the two pedestrian frames are the same. Therefore, the pedestrian frames can be screened according to the skeleton information related to the pedestrian frames, one pedestrian frame is selected from a plurality of pedestrian frames containing the same pedestrian, and the rest pedestrian frames containing the same pedestrian are discarded. Through the above operation, only one pedestrian frame can be reserved for each pedestrian, so that at least one pedestrian frame corresponding to at least one pedestrian in the image to be processed in a one-to-one mode is obtained. In contrast, if the pedestrians included in the two pedestrian frames are not the same pedestrian, the skeleton information respectively related to the two pedestrian frames is different, and even if the two pedestrian frames have a large overlapping area, the other pedestrian frame is not filtered out by one pedestrian frame. Thus, the manner in which pedestrian frames are screened using skeletal information rather than NMS may avoid erroneously filtering pedestrian frames containing different pedestrians.

For example, at least one pedestrian frame obtained in step S240 may be regarded as a final result of pedestrian detection. Of course, some subsequent processing may be performed on at least one pedestrian frame obtained in step S240, for example, as described below, it may be determined whether a pedestrian included in the pedestrian frame is a real pedestrian and the pedestrian frame may be filtered according to the determination result. Subsequently, the pedestrian frame obtained through the subsequent processing is taken as a final result of the pedestrian detection.

According to the pedestrian detection method provided by the embodiment of the invention, because the pedestrian frames are screened by using the skeleton information of the pedestrians contained in the pedestrian frames, the aims of filtering redundant pedestrian frames of the same pedestrian and simultaneously reserving pedestrian frames of different pedestrians can be achieved. The pedestrian detection method can avoid the problem that the pedestrian frame of a certain pedestrian is used for filtering the pedestrian frames of other pedestrians brought by NMS, so that the accuracy of the pedestrian detection result can be improved, and the pedestrian detection method has very important value for pedestrian monitoring (particularly pedestrian monitoring in a pedestrian dense scene).

Illustratively, the pedestrian detection method according to the embodiment of the invention may be implemented in a device, apparatus or system having a memory and a processor.

The pedestrian detection method can be deployed at an image acquisition end, for example, the pedestrian detection method can be deployed at the image acquisition end of a community access control system or the image acquisition end of a security monitoring system in public places such as stations, shopping malls, banks and the like. Alternatively, the pedestrian detection method according to the embodiment of the present invention may also be distributively deployed at the server side (or cloud side) and the client side. For example, an image may be collected at a client, and the client transmits the collected image to a server (or a cloud), so that the server (or the cloud) performs pedestrian detection.

According to an embodiment of the present invention, step S220 may include: inputting an image to be processed into a first convolution neural network so as to extract the characteristics of the image to be processed; and inputting the characteristics of the image to be processed into a second convolutional neural network to obtain a pedestrian detection preliminary result.

The first Convolutional Neural Network (CNN) and the second convolutional Neural Network may be pre-trained with a large number of training images.

Referring to fig. 3, a schematic diagram illustrating a data processing flow of a pedestrian detection method according to an embodiment of the present invention is shown. As shown in fig. 3, after the to-be-processed image is acquired, the to-be-processed image is input into the first convolutional neural network for feature extraction. The image to be processed may be a static image or any video frame in a video. At the output of the first convolutional neural network, a plurality of feature maps (feature maps) may be obtained. The characteristic graphs output by the first convolution neural network are the characteristics of the image to be processed. Illustratively, the first convolutional neural network may be implemented using a VGG model or a residual network (ResNet) model obtained by pre-training on an ImagNet data set. Using the first convolutional neural network, valuable information in the image to be processed can be extracted, and window prediction can then be performed based on this information, as described below.

All feature maps output by the first convolutional neural network can be input into the second convolutional neural network for processing. The second convolutional neural network is a classifier that can implement the sliding window method described above. That is, the processing procedure of the second convolutional neural network may be understood as extracting windows of various scales from the image to be processed and determining the probability of the existence of a pedestrian in each window (i.e., the pedestrian frame confidence). For example, a window with a probability greater than a preset threshold may be selected, and the selected window is one or more pedestrian frames in the preliminary result of pedestrian detection described herein. As described above, the pedestrian detection preliminary result may include one or more pedestrian frames and pedestrian frame confidences in one-to-one correspondence with the one or more pedestrian frames.

The convolutional neural network is a network capable of autonomous learning, and the first convolutional neural network and the second convolutional neural network can be used for accurately and efficiently detecting the pedestrians in the image to be processed.

According to the embodiment of the present invention, step S230 may include: features of the image to be processed and the preliminary pedestrian detection result are input into a full convolution network to obtain skeletal information respectively associated with each of the one or more pedestrian frames.

The full-Convolutional Network (FCN) described herein may be similar to a full-Convolutional Network for semantic segmentation. With continued reference to fig. 3, the features of the image to be processed output by the first convolutional neural network and the preliminary pedestrian detection result output by the second convolutional neural network may be input to a full convolutional network for skeleton analysis. The pedestrian frames in the preliminary result of pedestrian detection may be input into the full convolution network one by one. After the features of the image to be processed and a certain pedestrian frame are input into the full convolution network, skeleton information related to the pedestrian frame can be obtained at the output end of the full convolution network. For example, assuming that the skeleton of a pedestrian is represented by 15 keypoints (including head, neck, left hand, left foot, etc.), for each pedestrian frame, 15 keypoint feature maps (including head feature map, neck feature map, left hand feature map, left foot feature map, etc.) can be obtained at the output of the full convolution network. Each keypoint feature map may represent the location of the corresponding keypoint. Illustratively, any keypoint feature map is consistent with the size of the image to be processed, and the pixel value of each pixel on any keypoint feature map represents the probability that the keypoint corresponding to the keypoint feature map exists at the pixel on the image to be processed, which is consistent with the position of the pixel.

Similar to the first convolutional neural network and the second convolutional neural network, the full convolutional network may be trained in advance with a large number of training images. The training modes of the first convolutional neural network, the second convolutional neural network and the full convolutional network will be described below, and are not described herein again.

According to the embodiment of the present invention, step S240 may include: determining whether pedestrians contained in any two pedestrian frames in the multiple pedestrian frames are the same pedestrian according to skeleton information respectively related to each of the two pedestrian frames under the condition that the preliminary pedestrian detection result comprises the multiple pedestrian frames; and for two or more pedestrian frames containing the same pedestrian, selecting a unique pedestrian frame from the two or more pedestrian frames as one of the at least one pedestrian frame.

In step S240, it may be determined whether two pedestrians respectively included in the two pedestrian frames are the same pedestrian according to the skeleton information related to the two pedestrian frames. For example, the similarity between the skeletons of two pedestrians contained in two pedestrian frames can be determined by using the keypoint feature maps respectively related to the two pedestrian frames, two pedestrians with sufficient similarity are regarded as the same pedestrian, and two pedestrians with insufficient similarity are regarded as different pedestrians.

If two or more pedestrian frames containing the same pedestrian exist among the pedestrian frames obtained in step S220, a unique pedestrian frame is selected from the pedestrian frames as the pedestrian frame corresponding to the pedestrian, and the remaining pedestrian frames containing the pedestrian can be discarded.

In this way, a plurality of pedestrian frames including the same pedestrian may not exist in the pedestrian frames obtained by screening.

According to an embodiment of the present invention, the skeleton information associated with any one of the one or more pedestrian frames includes a key point feature map in one-to-one correspondence with a specific number of key points of pedestrians included in the pedestrian frame, and for any two of the plurality of pedestrian frames, determining whether the pedestrians included in the two pedestrian frames are the same pedestrian according to the skeleton information associated with each of the two pedestrian frames, respectively, includes: for any two pedestrian frames in the multiple pedestrian frames, calculating the similarity between skeletons of pedestrians contained in the two pedestrian frames by using a key point feature map which is in one-to-one correspondence with the specific number of key points of the pedestrians contained in each of the two pedestrian frames; and comparing the calculated similarity with a similarity threshold value for any two pedestrian frames in the plurality of pedestrian frames, if the calculated similarity is greater than the similarity threshold value, determining that the pedestrians contained in the two pedestrian frames are the same pedestrian, otherwise determining that the pedestrians contained in the two pedestrian frames are not the same pedestrian.

The number of key points, i.e. the specific number, used to represent the skeleton of the pedestrian may be any suitable number, e.g. 5, 10, 15, etc. For example, assuming that the skeleton of a pedestrian is represented by 15 key points, each pedestrian frame corresponds to 15 key point feature maps, which are respectively used to represent the positions of the corresponding key points.

In one example, for each of a certain number of key points, the distance between the positions of the key points of two pedestrians can be calculated by using the key point feature maps corresponding to the key points of the two pedestrians respectively contained in the two pedestrian frames. For example, the distance between the head of the pedestrian a and the head of the pedestrian B may be calculated using the head feature map corresponding to the head of the pedestrian a included in the pedestrian frame a and the head feature map corresponding to the head of the pedestrian B included in the pedestrian frame B, the distance … … between the left hand of the pedestrian a and the left hand of the pedestrian B may be calculated using the left-hand feature map corresponding to the left hand of the pedestrian a included in the pedestrian frame a and the left-hand feature map corresponding to the left hand of the pedestrian B included in the pedestrian frame B, and so on, and finally a certain number (e.g., 15) of distances may be obtained. These distances may reflect the difference between the skeleton of the pedestrian a and the skeleton of the pedestrian b, and therefore the similarity between the two may be calculated based on these distances. And comparing the calculated similarity with a similarity threshold, if the calculated similarity is greater than the similarity threshold, determining that the pedestrian a and the pedestrian b are the same pedestrian, and otherwise, determining that the pedestrian a and the pedestrian b are not the same pedestrian. The similarity threshold may be any suitable value, which may be set as desired.

In another example, differences between other skeletal features of pedestrians may be calculated from the keypoint feature map. For example, a head-neck connecting line from the head to the neck of the pedestrian can be calculated according to the head characteristic diagram and the neck characteristic diagram of the pedestrian contained in each pedestrian frame, a neck-back connecting line from the neck to the back center of the pedestrian can be calculated according to the neck characteristic diagram and the waist characteristic diagram of the pedestrian contained in each pedestrian frame, and the like. These lines between certain parts of the person in the row can also be considered as skeletal features. Subsequently, the difference between the skeleton features of the two pedestrians respectively contained in the two pedestrian frames can be calculated, and the similarity between the skeletons of the two pedestrians can be determined. For example, the distance between the head-neck connecting line of the pedestrian a included in the pedestrian frame a and the head-neck connecting line of the pedestrian B included in the pedestrian frame B may be calculated, the distance … … between the nape connecting line of the pedestrian a included in the pedestrian frame a and the nape connecting line of the pedestrian B included in the pedestrian frame B may be calculated, and so on, and finally the distances of a plurality of skeleton features may be obtained. Similarly to the above example, the similarity between the skeleton of the pedestrian a and the skeleton of the pedestrian b may be calculated based on these distances, thereby determining whether the pedestrian a and the pedestrian b are the same pedestrian.

Whether the pedestrians contained in the two pedestrian frames are the same pedestrian can be simply and accurately judged through the similarity between the skeletons.

According to an embodiment of the present invention, the preliminary result of pedestrian detection may further include a confidence level of a pedestrian frame corresponding to one of the one or more pedestrian frames, the skeleton information related to any one of the one or more pedestrian frames may include confidence levels of key points corresponding to one-to-one with a specific number of key points of pedestrians included in the pedestrian frame, and for two or more pedestrian frames including the same pedestrian, selecting a unique pedestrian frame from the two or more pedestrian frames as one of the at least one pedestrian frame may include: for two or more pedestrian frames containing the same pedestrian, a unique pedestrian frame is selected from the two or more pedestrian frames according to the pedestrian frame confidence levels corresponding to the two or more pedestrian frames one to one and the keypoint confidence levels corresponding to a specific number of keypoints of the pedestrian contained in each of the two or more pedestrian frames one to one.

Illustratively, in addition to outputting the keypoint feature map, the full convolution network may also output a keypoint confidence corresponding to each keypoint for representing the probability that the keypoint is a true keypoint.

Following the example above, the skeleton of the pedestrian is represented with 15 keypoints, then for each pedestrian frame it has 1 pedestrian frame confidence and 15 keypoint confidence. Some arithmetic operations may be performed on the 1 pedestrian box confidence and the 15 keypoint confidence to take these confidences into account in combination. For example, for each pedestrian box, 1 pedestrian box confidence and 15 keypoint confidence may simply be added to obtain one total confidence (i.e., the second total confidence described herein). For another example, for each pedestrian frame, the 1 pedestrian frame confidence and the 15 keypoint confidence may be arithmetically averaged to obtain a total confidence (i.e., the second total confidence described herein). For another example, for each pedestrian frame, the confidence levels of 1 pedestrian frame and 15 keypoint confidence levels may be weighted and averaged to obtain a total confidence level (i.e., the second total confidence level described herein), wherein the weight of each confidence level may be set as needed. Of course, the above calculation manner of the second total confidence is only an example and is not limited, and the invention may also adopt other suitable manners to calculate the second total confidence for measuring whether the pedestrian frame can be selected as the unique pedestrian frame corresponding to the pedestrian contained therein.

As described above, the pedestrian frame confidence may represent a probability that a pedestrian exists in the pedestrian frame, and the keypoint confidence may represent a probability that the corresponding keypoint is a true keypoint, and thus, when selecting a unique pedestrian frame corresponding to each pedestrian, the pedestrian frame confidence and the keypoint confidence may be considered in combination, so that the reliability of the selected unique pedestrian frame is high.

According to an embodiment of the present invention, for two or more pedestrian frames containing the same pedestrian, selecting a unique pedestrian frame from the two or more pedestrian frames according to the pedestrian frame confidence levels corresponding to the two or more pedestrian frames one to one and the keypoint confidence levels corresponding to a specific number of keypoints of the pedestrian contained in each of the two or more pedestrian frames includes: for two or more pedestrian frames containing the same pedestrian, aiming at each of the two or more pedestrian frames, carrying out summation or averaging on the confidence coefficient of the pedestrian frame corresponding to the pedestrian frame and the confidence coefficients of the key points which are in one-to-one correspondence with the pedestrian and the key points in a specific number contained in the pedestrian frame so as to obtain a second total confidence coefficient; and for two or more pedestrian frames containing the same pedestrian, selecting the pedestrian frame with the highest second total confidence as the only pedestrian frame.

The calculation manner of the second total confidence has already been described above, and is not described herein again. The selection of the unique pedestrian frame corresponding to each pedestrian is described below. For example, it is assumed that 5 pedestrian frames in the preliminary result of pedestrian detection contain the same pedestrian, which is denoted by pedestrian X, i.e., 5 pedestrian frames correspond to pedestrian X. Further, assuming that the second total confidence degrees of the 5 pedestrian frames are 0.8, 0.65, 0.9, 0.75, and 0.7, respectively, the pedestrian frame having the second total confidence degree of 0.9 is selected as the unique pedestrian frame corresponding to the pedestrian X, and the remaining 4 pedestrian frames are discarded.

The only pedestrian frame corresponding to each pedestrian is selected through the confidence coefficient, and the most accurate and reasonable pedestrian frame corresponding to each pedestrian can be obtained.

FIG. 4 shows a schematic flow diagram of a pedestrian detection method 400 according to another embodiment of the invention. In fig. 4, steps S410 to S440 correspond to steps S210 to S240 of the pedestrian detection method 200 shown in fig. 2, respectively. The embodiments of steps S410 to S440 shown in fig. 4 can be understood by referring to the above description about fig. 2, and are not repeated. According to the present embodiment, the pedestrian detection method 400 further includes step S450.

In step S450, for each of at least one pedestrian frame, it is determined whether the pedestrian included in the pedestrian frame is a real pedestrian according to the skeleton information related to the pedestrian frame, and if not, the pedestrian frame is filtered.

Some false positives in pedestrian detection (false positives) can be filtered using the results of the skeletal analysis. For example, if a certain pedestrian frame is falsely reported to contain a pedestrian, but the skeleton of the pedestrian contained in the pedestrian frame is unreasonable, which indicates that the pedestrian contained in the pedestrian frame is not a real pedestrian. Therefore, whether the skeleton of the pedestrian contained in the pedestrian frame is reasonable or not can be judged by utilizing the skeleton information, and if the skeleton is not reasonable, the pedestrian frame is filtered. The accuracy of the pedestrian detection result can be improved by using the mode of filtering the pedestrian frame by the skeleton information.

According to the embodiment of the present invention, the skeleton information related to any one of the one or more pedestrian frames includes the confidence degrees of the key points corresponding to the specific number of key points of the pedestrians included in the pedestrian frame, and step S450 may include: for each of at least one pedestrian frame, carrying out summation or averaging on the key point confidence coefficients which are in one-to-one correspondence with the key points of a specific number of pedestrians contained in the pedestrian frame so as to obtain a first total confidence coefficient; and for each of the at least one pedestrian frame, comparing the first total confidence with the corresponding confidence threshold, if the first total confidence is greater than the corresponding confidence threshold, determining that the pedestrian contained in the pedestrian frame is a real pedestrian, otherwise determining that the pedestrian contained in the pedestrian frame is not a real pedestrian.

The first overall confidence is calculated in a similar manner as the second overall confidence, except that the keypoint confidence is primarily considered in calculating the first overall confidence. For example, assuming that the skeleton of a pedestrian is represented by 15 keypoints, some arithmetic operations, such as simple addition, arithmetic average, weighted average, or the like, may be performed on the confidence degrees of the 15 keypoints related to a certain pedestrian frame to obtain a first total confidence degree of the pedestrian frame. Those skilled in the art can refer to the above calculation method of the second total confidence level to understand the calculation method of the first total confidence level, which is not described herein again.

The confidence threshold may be any suitable value, which may be determined by experimental testing or theoretical calculations, etc. It will be appreciated that the confidence thresholds may differ as to the manner in which the first overall confidence is calculated. Thus, the "corresponding confidence threshold" described herein is the confidence threshold corresponding to the manner in which the first total confidence is calculated. When the first total confidence degree is compared with the confidence degree threshold value, the corresponding confidence degree threshold value can be selected to participate in the comparison according to the calculation mode of the first total confidence degree. And regarding the pedestrian frames with the first overall confidence degrees larger than the confidence degree threshold value, regarding the contained pedestrians as real pedestrians, and regarding the pedestrian frames not as real pedestrians otherwise.

It is understood that the pedestrian frames obtained in step S440 are in one-to-one correspondence with pedestrians, and therefore, after filtering in step S450, the pedestrian frames that include pedestrians that are not real pedestrians are excluded, and the remainder is at least one pedestrian frame in one-to-one correspondence with at least one real pedestrian. Of course, it is also possible that all pedestrian frames are discarded after the filtering in step S450.

Illustratively, before the step S240 (or S440), the method 200 (or 400) may further include: and for each of one or more pedestrian frames, judging whether the pedestrian contained in the pedestrian frame is a real pedestrian according to the skeleton information related to the pedestrian frame, and if not, filtering the pedestrian frame.

That is to say, the step of determining whether the pedestrian included in the pedestrian frame is a real pedestrian and filtering the pedestrian frame according to the determination result may also be performed before the pedestrian frame is screened, and the implementation manner is similar to that in step S450, and is not described again. Compared with the method of filtering the misjudged pedestrian frame before screening the pedestrian frame, the method of filtering the misjudged pedestrian frame after screening the pedestrian frame has smaller data amount, can avoid meaningless operation and improve the pedestrian detection efficiency.

As described above, the first convolutional neural network, the second convolutional neural network, and the full convolutional network may be obtained by training in advance using a large number of training images, and exemplary training steps are described below.

According to an embodiment of the invention, the method 200 (or 400) may further comprise: acquiring a training image, wherein a target pedestrian frame corresponding to each pedestrian in the training image and target positions of a specific number of key points of each pedestrian are marked; constructing a first loss function by taking target pedestrian frames which are in one-to-one correspondence with pedestrians in the training image as target values of one or more pedestrian frames in a pedestrian detection preliminary result obtained by a second convolutional neural network aiming at the training image, and constructing a second loss function by taking target positions of a specific number of key points of the pedestrians in the training image as target values of skeleton information obtained by a full convolutional network aiming at the training image; and training parameters in the first convolutional neural network, the second convolutional neural network and the full convolutional network by using the first loss function and the second loss function.

Illustratively, in the process of training the first convolutional neural network, the second convolutional neural network and the full convolutional network, pre-training may be performed on the ImageNet data set first, and then fine-tuning (fine-tune) may be performed on the pedestrian-specific data set, so that the convergence speed of the network may be increased, and meanwhile, some underlying network information learned for general (general) images is also effective for pedestrian images.

A loss function (i.e., the first loss function) may be added at the output of the second convolutional neural network to help it learn some valuable information. In addition, a loss function (i.e., a second loss function) can be added to the output end of the full convolution network to train the skeleton analysis model. Illustratively, the second loss function may be a cross-entropy loss function. Compared with the conventional Euclidean distance loss function, the penalty term of the cross entropy loss function is more reasonable in design, for example, the penalty degree can be reduced along with the increase of the confidence coefficient of a pedestrian frame, and therefore the network can be trained better.

In the process of training the first convolutional neural network, the second convolutional neural network and the full convolutional network, a conventional back propagation algorithm may be used for training, and those skilled in the art can understand the implementation manner of the back propagation algorithm, which is not described herein in detail.

According to another aspect of the present invention, a pedestrian detection apparatus is provided. Fig. 5 shows a schematic block diagram of a pedestrian detection apparatus 500 according to one embodiment of the invention.

As shown in fig. 5, the pedestrian detection apparatus 500 according to the embodiment of the present invention includes a to-be-processed image acquisition module 510, a detection module 520, a skeleton analysis module 530, and a screening module 540. The various modules may perform the various steps/functions of the pedestrian detection method described above in connection with fig. 2-4, respectively. Only the main functions of the respective components of the pedestrian detection apparatus 500 will be described below, and details that have been described above will be omitted.

The to-be-processed image obtaining module 510 is configured to obtain an image to be processed. The pending image acquisition module 510 may be implemented by the processor 102 in the electronic device shown in fig. 1 executing program instructions stored in the storage 104.

The detection module 520 is configured to detect a pedestrian in the image to be processed to obtain a preliminary result of pedestrian detection, where the preliminary result of pedestrian detection includes one or more pedestrian frames, and each pedestrian frame is used to indicate an area in the image to be processed where a pedestrian may exist. The detection module 520 may be implemented by the processor 102 in the electronic device shown in fig. 1 executing program instructions stored in the storage 104.

The skeleton analysis module 530 is configured to perform skeleton analysis on the pedestrians included in each of the one or more pedestrian frames to obtain skeleton information respectively related to each of the one or more pedestrian frames. Skeletal analysis module 530 may be implemented by processor 102 in the electronic device shown in fig. 1 executing program instructions stored in storage 104.

The screening module 540 is configured to screen the one or more pedestrian frames according to the skeleton information respectively associated with each of the one or more pedestrian frames, so as to obtain at least one pedestrian frame corresponding to at least some pedestrians in the image to be processed one to one. The screening module 540 may be implemented by the processor 102 in the electronic device shown in fig. 1 executing program instructions stored in the storage 104.

According to an embodiment of the present invention, the pedestrian detection apparatus 500 further includes: and a real pedestrian judging module (not shown) for judging, for each of the at least one pedestrian frame, whether a pedestrian contained in the pedestrian frame is a real pedestrian according to the skeleton information related to the pedestrian frame, and if not, filtering the pedestrian frame.

According to an embodiment of the present invention, the skeleton information related to any one of the one or more pedestrian frames includes a keypoint confidence degree in one-to-one correspondence with a specific number of keypoints of pedestrians included in the pedestrian frame, and the real pedestrian determination module includes: a first total confidence obtaining submodule, configured to sum or average, for each of the at least one pedestrian frame, the confidence levels of the keypoints that correspond to the pedestrians included in the pedestrian frame in the specific number one to one, so as to obtain a first total confidence level; and the confidence degree comparison submodule is used for comparing the first total confidence degree with a corresponding confidence degree threshold value for each of the at least one pedestrian frame, if the first total confidence degree is greater than the corresponding confidence degree threshold value, the pedestrian contained in the pedestrian frame is determined to be a real pedestrian, otherwise, the pedestrian contained in the pedestrian frame is determined not to be a real pedestrian.

According to an embodiment of the present invention, the screening module 540 includes: the same pedestrian determination submodule is used for determining whether pedestrians contained in any two pedestrian frames in the pedestrian frames are the same pedestrian or not according to skeleton information respectively relevant to each of the two pedestrian frames under the condition that the preliminary pedestrian detection result comprises the multiple pedestrian frames; and a pedestrian frame selection sub-module for selecting, for two or more pedestrian frames containing the same pedestrian, a unique pedestrian frame from among the two or more pedestrian frames as one of the at least one pedestrian frame, in a case where the preliminary result of pedestrian detection includes a plurality of pedestrian frames.

According to an embodiment of the present invention, the skeleton information related to any one of the one or more pedestrian frames includes a key point feature map in one-to-one correspondence with a specific number of key points of pedestrians included in the pedestrian frame, and the same pedestrian determination submodule includes: a similarity calculation unit configured to calculate, for any two of the plurality of pedestrian frames, a similarity between skeletons of pedestrians included in the two pedestrian frames using a keypoint feature map that is one-to-one corresponding to the specific number of keypoints of the pedestrians included in each of the two pedestrian frames; and the similarity comparison unit is used for comparing the calculated similarity with a similarity threshold value for any two pedestrian frames in the plurality of pedestrian frames, if the calculated similarity is greater than the similarity threshold value, the pedestrians contained in the two pedestrian frames are determined to be the same pedestrian, otherwise, the pedestrians contained in the two pedestrian frames are determined not to be the same pedestrian.

According to an embodiment of the present invention, the preliminary result of pedestrian detection further includes a confidence level of a pedestrian frame corresponding to the one or more pedestrian frames, the skeleton information related to any one of the one or more pedestrian frames includes a confidence level of a keypoint corresponding to a specific number of keypoints of pedestrians included in the pedestrian frame, and the pedestrian frame selection submodule includes: a pedestrian frame selection unit configured to, for two or more pedestrian frames including the same pedestrian, select the unique pedestrian frame from the two or more pedestrian frames according to a pedestrian frame confidence corresponding to the two or more pedestrian frames one to one and a keypoint confidence corresponding to the specific number of keypoints of the pedestrian included in each of the two or more pedestrian frames one to one.

According to an embodiment of the present invention, the pedestrian frame selecting unit includes: a second total confidence obtaining subunit, configured to, for each of two or more pedestrian frames that include the same pedestrian, sum or average a pedestrian frame confidence corresponding to the pedestrian frame and a keypoint confidence corresponding to the pedestrian included in the pedestrian frame and corresponding to the specific number of keypoints one by one, to obtain a second total confidence; and a pedestrian frame selection subunit, configured to select, as the unique pedestrian frame, the pedestrian frame with the highest second total confidence, for two or more pedestrian frames that include the same pedestrian.

According to an embodiment of the present invention, the detecting module 520 includes: the first input submodule is used for inputting the image to be processed into a first convolution neural network so as to extract the characteristics of the image to be processed; and the second input submodule is used for inputting the characteristics of the image to be processed into a second convolutional neural network so as to obtain the pedestrian detection preliminary result.

According to an embodiment of the present invention, the skeleton analysis module 530 includes: and the third input submodule is used for inputting the characteristics of the image to be processed and the preliminary pedestrian detection result into a full convolution network so as to obtain the skeleton information respectively related to each of the one or more pedestrian frames.

According to an embodiment of the present invention, the pedestrian detection apparatus 500 further includes: the training image acquisition module is used for acquiring a training image, wherein a target pedestrian frame corresponding to each pedestrian in the training image and target positions of a specific number of key points of each pedestrian are marked; a loss function constructing module, configured to construct a first loss function by using a target pedestrian frame corresponding to a pedestrian in the training image in a one-to-one manner as a target value of one or more pedestrian frames in a pedestrian detection preliminary result obtained by the second convolutional neural network for the training image, and construct a second loss function by using a target position of the pedestrian in the training image and the key points of the specific number as a target value of skeleton information obtained by the full convolutional network for the training image; and a training module for training parameters in the first convolutional neural network, the second convolutional neural network and the full convolutional network by using the first loss function and the second loss function.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

FIG. 6 shows a schematic block diagram of a pedestrian detection system 600 according to one embodiment of the invention. The pedestrian detection system 600 includes an image capture device 610, a storage device 620, and a processor 630.

The image capturing device 610 is used for capturing an image to be processed. The image capture device 610 is optional and the pedestrian detection system 600 may not include the image capture device 610. In this case, an image for pedestrian detection may be acquired using another image acquisition device and the acquired image may be transmitted to the pedestrian detection system 600.

The storage device 620 stores program codes for implementing respective steps in the pedestrian detection method according to the embodiment of the invention.

The processor 630 is configured to run the program codes stored in the storage device 620 to execute the corresponding steps of the pedestrian detection method according to the embodiment of the invention, and is configured to be implemented in the pedestrian detection device according to the embodiment of the invention.

In one embodiment, the program code, when executed by the processor 630, causes the pedestrian detection system 600 to perform the steps of: acquiring an image to be processed; detecting pedestrians in the image to be processed to obtain a pedestrian detection preliminary result, wherein the pedestrian detection preliminary result comprises one or more pedestrian frames, and each pedestrian frame is used for indicating an area in which a pedestrian is likely to exist in the image to be processed; performing skeleton analysis on the pedestrians contained in each of the one or more pedestrian frames to obtain skeleton information respectively related to each of the one or more pedestrian frames; and screening the one or more pedestrian frames according to the skeleton information respectively related to each of the one or more pedestrian frames to obtain at least one pedestrian frame corresponding to at least part of pedestrians in the image to be processed one by one.

In one embodiment, after the program code being executed by the processor 630 causes the pedestrian detection system 600 to perform the step of filtering the one or more pedestrian frames according to the skeletal information respectively associated with each of the one or more pedestrian frames, the program code being executed by the processor 630 further causes the pedestrian detection system 600 to perform: and for each of the at least one pedestrian frame, judging whether the pedestrian contained in the pedestrian frame is a real pedestrian according to the skeleton information related to the pedestrian frame, and if not, filtering the pedestrian frame.

In one embodiment, the skeleton information related to any one of the one or more pedestrian frames includes a confidence level of a key point corresponding to a specific number of key points of the pedestrian included in the pedestrian frame, and the program code, when executed by the processor 630, causes the pedestrian detection system 600 to perform, for each of the at least one pedestrian frame, the step of determining whether the pedestrian included in the pedestrian frame is a real pedestrian according to the skeleton information related to the pedestrian frame includes: for each of the at least one pedestrian frame, performing summation or averaging on the key point confidence coefficients corresponding to the specific number of key points of the pedestrians contained in the pedestrian frame one by one to obtain a first total confidence coefficient; and for each of the at least one pedestrian frame, comparing the first total confidence with a corresponding confidence threshold, if the first total confidence is greater than the corresponding confidence threshold, determining that the pedestrian contained in the pedestrian frame is a real pedestrian, otherwise determining that the pedestrian contained in the pedestrian frame is not a real pedestrian.

In one embodiment, the program code, when executed by the processor 630, causes the pedestrian detection system 600 to perform the step of filtering the one or more pedestrian frames according to the skeleton information respectively associated with each of the one or more pedestrian frames to obtain at least one pedestrian frame corresponding to at least a part of pedestrians in the image to be processed, including: determining, for any two pedestrian frames of the plurality of pedestrian frames, whether pedestrians contained in the two pedestrian frames are the same pedestrian according to skeleton information respectively associated with each of the two pedestrian frames, if the preliminary result of pedestrian detection includes the plurality of pedestrian frames; and for two or more pedestrian frames containing the same pedestrian, selecting a unique pedestrian frame from the two or more pedestrian frames as one of the at least one pedestrian frame.

In one embodiment, the skeleton information associated with any one of the one or more pedestrian frames includes a key point feature map corresponding to a specific number of key points of pedestrians included in the pedestrian frame, and the step of determining, by the processor 630, whether the pedestrians included in the two pedestrian frames are the same pedestrian according to the skeleton information associated with each of the two pedestrian frames, for any two of the pedestrian frames, executed by the pedestrian detection system 600, includes: for any two pedestrian frames in the plurality of pedestrian frames, calculating the similarity between the skeletons of the pedestrians contained in the two pedestrian frames by using the key point feature maps which are in one-to-one correspondence with the specific number of key points of the pedestrians contained in each of the two pedestrian frames; and comparing the calculated similarity with a similarity threshold value for any two pedestrian frames in the plurality of pedestrian frames, if the calculated similarity is greater than the similarity threshold value, determining that the pedestrians contained in the two pedestrian frames are the same pedestrian, otherwise determining that the pedestrians contained in the two pedestrian frames are not the same pedestrian.

In one embodiment, the preliminary pedestrian detection result further includes a pedestrian frame confidence level corresponding to one-to-one to the one or more pedestrian frames, the skeleton information associated with any one of the one or more pedestrian frames includes a keypoint confidence level corresponding to one-to-one to a certain number of keypoints of the pedestrians included in the pedestrian frame, and the step of selecting a unique pedestrian frame from the two or more pedestrian frames as one of the at least one pedestrian frame, executed by the pedestrian detection system 600 when the program code is executed by the processor 630, includes: for two or more pedestrian frames containing the same pedestrian, selecting the unique pedestrian frame from the two or more pedestrian frames according to a pedestrian frame confidence corresponding to the two or more pedestrian frames one to one and a keypoint confidence corresponding to the specific number of keypoints of the pedestrian contained in each of the two or more pedestrian frames.

In one embodiment, the step of selecting the unique pedestrian frame from the two or more pedestrian frames according to the pedestrian frame confidence levels corresponding to the two or more pedestrian frames and the keypoint confidence levels corresponding to the specific number of keypoints of the pedestrians contained in each of the two or more pedestrian frames, for two or more pedestrian frames containing the same pedestrian, executed by the pedestrian detection system 600, by the processor 630, comprises: for two or more pedestrian frames containing the same pedestrian, for each of the two or more pedestrian frames, performing summation or averaging on the confidence coefficient of the pedestrian frame corresponding to the pedestrian frame and the confidence coefficients of the key points, corresponding to the specific number of the key points, of the pedestrian contained in the pedestrian frame to obtain a second total confidence coefficient; and for two or more pedestrian frames containing the same pedestrian, selecting the pedestrian frame with the highest second total confidence as the unique pedestrian frame.

In one embodiment, the program code, when executed by the processor 630, causes the pedestrian detection system 600 to perform the step of detecting a pedestrian in the image to be processed to obtain a pedestrian detection preliminary result comprising: inputting the image to be processed into a first convolution neural network so as to extract the characteristics of the image to be processed; and inputting the characteristics of the image to be processed into a second convolutional neural network to obtain the preliminary result of pedestrian detection.

In one embodiment, the step of performing skeleton analysis on the pedestrians contained in each of the one or more pedestrian frames to obtain skeleton information respectively related to each of the one or more pedestrian frames, which is executed by the pedestrian detection system 600 by the processor 630, includes: inputting the features of the image to be processed and the preliminary result of pedestrian detection into a full convolution network to obtain the skeleton information respectively associated with each of the one or more pedestrian frames.

In one embodiment, the program code when executed by the processor 630 further causes the pedestrian detection system 600 to perform: acquiring a training image, wherein a target pedestrian frame corresponding to each pedestrian in the training image and target positions of a specific number of key points of each pedestrian are marked; constructing a first loss function by taking target pedestrian frames corresponding to pedestrians in the training image in a one-to-one mode as target values of one or more pedestrian frames in a pedestrian detection preliminary result obtained by the second convolutional neural network aiming at the training image, and constructing a second loss function by taking target positions of the pedestrians in the training image and the key points of the specific number as target values of skeleton information obtained by the full convolutional network aiming at the training image; and training parameters in the first convolutional neural network, the second convolutional neural network, and the full convolutional network using the first loss function and the second loss function.

Furthermore, according to an embodiment of the present invention, there is also provided a storage medium on which program instructions are stored, which when executed by a computer or a processor are used for executing the respective steps of the pedestrian detection method according to an embodiment of the present invention and for implementing the respective modules in the pedestrian detection apparatus according to an embodiment of the present invention. The storage medium may include, for example, a memory card of a smart phone, a storage component of a tablet computer, a hard disk of a personal computer, a Read Only Memory (ROM), an Erasable Programmable Read Only Memory (EPROM), a portable compact disc read only memory (CD-ROM), a USB memory, or any combination of the above storage media.

In one embodiment, the computer program instructions, when executed by a computer or processor, may cause the computer or processor to implement the various functional modules of the pedestrian detection apparatus according to the embodiment of the invention, and/or may perform the pedestrian detection method according to the embodiment of the invention.

In one embodiment, the computer program instructions, when executed by a computer, cause the computer to perform the steps of: acquiring an image to be processed; detecting pedestrians in the image to be processed to obtain a pedestrian detection preliminary result, wherein the pedestrian detection preliminary result comprises one or more pedestrian frames, and each pedestrian frame is used for indicating an area in which a pedestrian is likely to exist in the image to be processed; performing skeleton analysis on the pedestrians contained in each of the one or more pedestrian frames to obtain skeleton information respectively related to each of the one or more pedestrian frames; and screening the one or more pedestrian frames according to the skeleton information respectively related to each of the one or more pedestrian frames to obtain at least one pedestrian frame corresponding to at least part of pedestrians in the image to be processed one by one.

In one embodiment, after the computer program instructions, when executed by a computer, cause the computer to perform the step of filtering the one or more pedestrian frames according to the skeletal information respectively associated with each of the one or more pedestrian frames, the computer program instructions, when executed by a computer, further cause the computer to perform: and for each of the at least one pedestrian frame, judging whether the pedestrian contained in the pedestrian frame is a real pedestrian according to the skeleton information related to the pedestrian frame, and if not, filtering the pedestrian frame.

In one embodiment, the skeleton information associated with any one of the one or more pedestrian frames includes a keypoint confidence level in one-to-one correspondence with a certain number of keypoints of the pedestrian included in the pedestrian frame, and the computer program instructions, when executed by the computer, cause the computer to perform, for each of the at least one pedestrian frame, the step of determining whether the pedestrian included in the pedestrian frame is a real pedestrian according to the skeleton information associated with the pedestrian frame includes: for each of the at least one pedestrian frame, performing summation or averaging on the key point confidence coefficients corresponding to the specific number of key points of the pedestrians contained in the pedestrian frame one by one to obtain a first total confidence coefficient; and for each of the at least one pedestrian frame, comparing the first total confidence with a corresponding confidence threshold, if the first total confidence is greater than the corresponding confidence threshold, determining that the pedestrian contained in the pedestrian frame is a real pedestrian, otherwise determining that the pedestrian contained in the pedestrian frame is not a real pedestrian.

In one embodiment, the computer program instructions, when executed by a computer, cause the computer to perform the step of filtering the one or more pedestrian frames according to the skeleton information respectively associated with each of the one or more pedestrian frames to obtain at least one pedestrian frame corresponding to at least part of pedestrians in the image to be processed one by one, including: determining, for any two pedestrian frames of the plurality of pedestrian frames, whether pedestrians contained in the two pedestrian frames are the same pedestrian according to skeleton information respectively associated with each of the two pedestrian frames, if the preliminary result of pedestrian detection includes the plurality of pedestrian frames; and for two or more pedestrian frames containing the same pedestrian, selecting a unique pedestrian frame from the two or more pedestrian frames as one of the at least one pedestrian frame.

In one embodiment, the skeleton information associated with any one of the one or more pedestrian frames includes a key point feature map in one-to-one correspondence with a specific number of key points of pedestrians contained in the pedestrian frame, and the step of determining, for any two of the plurality of pedestrian frames, whether the pedestrians contained in the two pedestrian frames are the same pedestrian according to the skeleton information associated with each of the two pedestrian frames, which is executed by the computer, includes: for any two pedestrian frames in the plurality of pedestrian frames, calculating the similarity between the skeletons of the pedestrians contained in the two pedestrian frames by using the key point feature maps which are in one-to-one correspondence with the specific number of key points of the pedestrians contained in each of the two pedestrian frames; and comparing the calculated similarity with a similarity threshold value for any two pedestrian frames in the plurality of pedestrian frames, if the calculated similarity is greater than the similarity threshold value, determining that the pedestrians contained in the two pedestrian frames are the same pedestrian, otherwise determining that the pedestrians contained in the two pedestrian frames are not the same pedestrian.

In one embodiment, the preliminary pedestrian detection result further includes a pedestrian frame confidence level corresponding to one-to-one to the one or more pedestrian frames, the skeleton information associated with any one of the one or more pedestrian frames includes a keypoint confidence level corresponding to one-to-one to a certain number of keypoints of pedestrians included in the pedestrian frame, and the step of selecting a unique pedestrian frame from the two or more pedestrian frames as one of the at least one pedestrian frame, which is executed by the computer, includes, for two or more pedestrian frames including the same pedestrian, when executed by the computer: for two or more pedestrian frames containing the same pedestrian, selecting the unique pedestrian frame from the two or more pedestrian frames according to a pedestrian frame confidence corresponding to the two or more pedestrian frames one to one and a keypoint confidence corresponding to the specific number of keypoints of the pedestrian contained in each of the two or more pedestrian frames.

In one embodiment, the computer program instructions, when executed by a computer, cause the computer to perform the step of selecting the unique pedestrian frame from the two or more pedestrian frames for two or more pedestrian frames containing the same pedestrian according to a pedestrian frame confidence corresponding to the two or more pedestrian frames one-to-one and a keypoint confidence corresponding to the certain number of keypoints for the pedestrian contained in each of the two or more pedestrian frames, comprising: for two or more pedestrian frames containing the same pedestrian, for each of the two or more pedestrian frames, performing summation or averaging on the confidence coefficient of the pedestrian frame corresponding to the pedestrian frame and the confidence coefficients of the key points, corresponding to the specific number of the key points, of the pedestrian contained in the pedestrian frame to obtain a second total confidence coefficient; and for two or more pedestrian frames containing the same pedestrian, selecting the pedestrian frame with the highest second total confidence as the unique pedestrian frame.

In one embodiment, the computer program instructions, when executed by a computer, cause the computer to perform the step of detecting a pedestrian in the image to be processed to obtain a pedestrian detection preliminary result comprising: inputting the image to be processed into a first convolution neural network so as to extract the characteristics of the image to be processed; and inputting the characteristics of the image to be processed into a second convolutional neural network to obtain the preliminary result of pedestrian detection.

In one embodiment, the computer program instructions, when executed by a computer, cause the computer to perform the step of performing a skeleton analysis on the pedestrians contained in each of the one or more pedestrian frames to obtain skeleton information respectively associated with each of the one or more pedestrian frames, including: inputting the features of the image to be processed and the preliminary result of pedestrian detection into a full convolution network to obtain the skeleton information respectively associated with each of the one or more pedestrian frames.

In one embodiment, the computer program instructions, when executed by a computer, further cause the computer to perform: acquiring a training image, wherein a target pedestrian frame corresponding to each pedestrian in the training image and target positions of a specific number of key points of each pedestrian are marked; constructing a first loss function by taking target pedestrian frames corresponding to pedestrians in the training image in a one-to-one mode as target values of one or more pedestrian frames in a pedestrian detection preliminary result obtained by the second convolutional neural network aiming at the training image, and constructing a second loss function by taking target positions of the pedestrians in the training image and the key points of the specific number as target values of skeleton information obtained by the full convolutional network aiming at the training image; and training parameters in the first convolutional neural network, the second convolutional neural network, and the full convolutional network using the first loss function and the second loss function.

The modules in the pedestrian detection system according to the embodiment of the invention may be implemented by the processor of the electronic device implementing pedestrian detection according to the embodiment of the invention running computer program instructions stored in the memory, or may be implemented when computer instructions stored in the computer-readable storage medium of the computer program product according to the embodiment of the invention are run by a computer.

According to the pedestrian detection method and device provided by the embodiment of the invention, the pedestrian frames are screened by using the skeleton information of the pedestrians contained in the pedestrian frames, so that the aims of filtering redundant pedestrian frames of the same pedestrian and retaining pedestrian frames of different pedestrians can be achieved. The pedestrian detection method and the device can avoid the problem that the pedestrian frame of a certain pedestrian is used for filtering the pedestrian frames of other pedestrians, which is brought by NMS (network management system), so that the accuracy of the pedestrian detection result can be improved, and the method and the device have very important value for pedestrian monitoring (particularly pedestrian monitoring in a pedestrian dense scene).

Although the illustrative embodiments have been described herein with reference to the accompanying drawings, it is to be understood that the foregoing illustrative embodiments are merely exemplary and are not intended to limit the scope of the invention thereto. Various changes and modifications may be effected therein by one of ordinary skill in the pertinent art without departing from the scope or spirit of the present invention. All such changes and modifications are intended to be included within the scope of the present invention as set forth in the appended claims.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another device, or some features may be omitted, or not executed.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the invention and aiding in the understanding of one or more of the various inventive aspects. However, the method of the present invention should not be construed to reflect the intent: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

It will be understood by those skilled in the art that all of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where such features are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.

The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. It will be appreciated by those skilled in the art that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functionality of some of the modules in a pedestrian detection arrangement according to embodiments of the present invention. The present invention may also be embodied as apparatus programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

The above description is only for the specific embodiment of the present invention or the description thereof, and the protection scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and the changes or substitutions should be covered within the protection scope of the present invention. The protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A pedestrian detection method, comprising:

acquiring an image to be processed;

detecting pedestrians in the image to be processed by a sliding window method to obtain a pedestrian detection preliminary result, wherein the pedestrian detection preliminary result comprises one or more pedestrian frames, and each pedestrian frame is used for indicating an area in which a pedestrian is likely to exist in the image to be processed;

performing skeleton analysis on the pedestrians contained in each of the one or more pedestrian frames to obtain skeleton information respectively related to each of the one or more pedestrian frames; and

and screening the one or more pedestrian frames according to the skeleton information respectively related to each of the one or more pedestrian frames to obtain at least one pedestrian frame corresponding to at least part of pedestrians in the image to be processed one by one.

2. The pedestrian detection method of claim 1, wherein, after the screening the one or more pedestrian frames according to the skeletal information respectively associated with each of the one or more pedestrian frames, the pedestrian detection method further comprises:

and for each of the at least one pedestrian frame, judging whether the pedestrian contained in the pedestrian frame is a real pedestrian according to the skeleton information related to the pedestrian frame, and if not, filtering the pedestrian frame.

3. The pedestrian detection method according to claim 2, wherein the skeleton information relating to any one of the one or more pedestrian frames includes a keypoint confidence degree in one-to-one correspondence with a specific number of keypoints of pedestrians included in the pedestrian frame,

for each of the at least one pedestrian frame, judging whether the pedestrian contained in the pedestrian frame is a real pedestrian according to the skeleton information related to the pedestrian frame includes:

for each of the at least one pedestrian box,

the confidence degrees of the key points which are in one-to-one correspondence with the key points of the specific number of the pedestrians contained in the pedestrian frame are subjected to summation or averaging to obtain a first total confidence degree; and

and comparing the first total confidence with a corresponding confidence threshold, if the first total confidence is greater than the corresponding confidence threshold, determining that the pedestrian contained in the pedestrian frame is a real pedestrian, otherwise, determining that the pedestrian contained in the pedestrian frame is not a real pedestrian.

4. The pedestrian detection method according to claim 1, wherein the screening the one or more pedestrian frames according to the skeleton information respectively associated with each of the one or more pedestrian frames to obtain at least one pedestrian frame in one-to-one correspondence with at least some pedestrians in the image to be processed comprises:

in the case where the preliminary result of pedestrian detection includes a plurality of pedestrian frames,

for any two pedestrian frames in the plurality of pedestrian frames, determining whether pedestrians contained in the two pedestrian frames are the same pedestrian according to skeleton information respectively related to each of the two pedestrian frames; and

for two or more pedestrian frames containing the same pedestrian, selecting a unique pedestrian frame from the two or more pedestrian frames as one of the at least one pedestrian frame.

5. The pedestrian detection method according to claim 4, wherein the skeleton information relating to any one of the one or more pedestrian frames includes a key point feature map in one-to-one correspondence with a specific number of key points of pedestrians included in the pedestrian frame,

the determining, for any two of the plurality of pedestrian frames, whether the pedestrians contained in the two pedestrian frames are the same pedestrian according to the skeleton information respectively associated with each of the two pedestrian frames includes:

for any two pedestrian frames of the plurality of pedestrian frames,

calculating the similarity between skeletons of the pedestrians contained in the two pedestrian frames by using the key point feature maps which are in one-to-one correspondence with the specific number of key points of the pedestrians contained in each of the two pedestrian frames; and

and comparing the calculated similarity with a similarity threshold, if the calculated similarity is greater than the similarity threshold, determining that the pedestrians contained in the two pedestrian frames are the same pedestrian, and otherwise determining that the pedestrians contained in the two pedestrian frames are not the same pedestrian.

6. The pedestrian detection method according to claim 4 or 5, wherein the pedestrian detection preliminary result further includes pedestrian frame confidence degrees that correspond one-to-one to the one or more pedestrian frames, the skeleton information relating to any one of the one or more pedestrian frames includes keypoint confidence degrees that correspond one-to-one to a specific number of keypoints of pedestrians included in the pedestrian frame,

the selecting, for two or more pedestrian frames containing the same pedestrian, a unique pedestrian frame from the two or more pedestrian frames as one of the at least one pedestrian frame includes:

for two or more pedestrian frames containing the same pedestrian, selecting the unique pedestrian frame from the two or more pedestrian frames according to a pedestrian frame confidence corresponding to the two or more pedestrian frames one to one and a keypoint confidence corresponding to the specific number of keypoints of the pedestrian contained in each of the two or more pedestrian frames.

7. The pedestrian detection method according to claim 6, wherein the selecting, for two or more pedestrian frames containing the same pedestrian, the unique pedestrian frame from the two or more pedestrian frames according to a pedestrian frame confidence corresponding to the two or more pedestrian frames one to one and a keypoint confidence corresponding to the certain number of keypoints for the pedestrian contained in each of the two or more pedestrian frames, comprises:

for two or more pedestrian frames containing the same pedestrian,

for each of the two or more pedestrian frames, performing summation or averaging on the confidence coefficient of the pedestrian frame corresponding to the pedestrian frame and the confidence coefficients of the key points, corresponding to the specific number of the key points, of the pedestrians contained in the pedestrian frame one by one to obtain a second total confidence coefficient; and

and selecting the pedestrian frame with the highest second total confidence coefficient as the unique pedestrian frame.

8. The pedestrian detection method according to claim 1, wherein the detecting a pedestrian in the image to be processed to obtain a pedestrian detection preliminary result comprises:

inputting the image to be processed into a first convolution neural network so as to extract the characteristics of the image to be processed; and

inputting the features of the image to be processed into a second convolutional neural network to obtain the preliminary result of pedestrian detection, wherein the second convolutional neural network is used for realizing the sliding window method.

9. The pedestrian detection method according to claim 8, wherein the performing skeleton analysis on the pedestrians included in each of the one or more pedestrian frames to obtain skeleton information respectively associated with each of the one or more pedestrian frames comprises:

inputting the features of the image to be processed and the preliminary result of pedestrian detection into a full convolution network to obtain the skeleton information respectively associated with each of the one or more pedestrian frames.

10. The pedestrian detection method according to claim 9, wherein the pedestrian detection method further comprises:

acquiring a training image, wherein a target pedestrian frame corresponding to each pedestrian in the training image and target positions of a specific number of key points of each pedestrian are marked;

constructing a first loss function by taking target pedestrian frames corresponding to pedestrians in the training image in a one-to-one mode as target values of one or more pedestrian frames in a pedestrian detection preliminary result obtained by the second convolutional neural network aiming at the training image, and constructing a second loss function by taking target positions of the pedestrians in the training image and the key points of the specific number as target values of skeleton information obtained by the full convolutional network aiming at the training image; and

training parameters in the first convolutional neural network, the second convolutional neural network, and the full convolutional network using the first loss function and the second loss function.

11. A pedestrian detection apparatus comprising:

the image to be processed acquisition module is used for acquiring an image to be processed;

the detection module is used for detecting pedestrians in the image to be processed through a sliding window method to obtain a pedestrian detection preliminary result, wherein the pedestrian detection preliminary result comprises one or more pedestrian frames, and each pedestrian frame is used for indicating an area where pedestrians are possibly present in the image to be processed;

the skeleton analysis module is used for carrying out skeleton analysis on the pedestrians contained in each of the one or more pedestrian frames so as to obtain skeleton information respectively related to each of the one or more pedestrian frames; and

and the screening module is used for screening the one or more pedestrian frames according to the skeleton information respectively related to each of the one or more pedestrian frames so as to obtain at least one pedestrian frame in one-to-one correspondence with at least part of pedestrians in the image to be processed.

12. The pedestrian detection device according to claim 11, wherein the pedestrian detection device further comprises:

and the real pedestrian judgment module is used for judging whether the pedestrian contained in the pedestrian frame is a real pedestrian or not according to the skeleton information related to the pedestrian frame for each of the at least one pedestrian frame, and filtering the pedestrian frame if the pedestrian contained in the pedestrian frame is not the real pedestrian.

13. The pedestrian detection apparatus of claim 12, wherein the skeletal information associated with any of the one or more pedestrian frames includes keypoint confidences that are one-to-one corresponding to a particular number of keypoints for the pedestrian contained by that pedestrian frame,

the real pedestrian judgment module includes:

a first total confidence obtaining submodule, configured to sum or average, for each of the at least one pedestrian frame, the confidence levels of the keypoints that correspond to the pedestrians included in the pedestrian frame in the specific number one to one, so as to obtain a first total confidence level; and

and the confidence degree comparison submodule is used for comparing the first total confidence degree with a corresponding confidence degree threshold value for each of the at least one pedestrian frame, and if the first total confidence degree is greater than the corresponding confidence degree threshold value, determining that the pedestrian contained in the pedestrian frame is a real pedestrian, otherwise, determining that the pedestrian contained in the pedestrian frame is not a real pedestrian.

14. The pedestrian detection apparatus of claim 11, wherein the screening module comprises:

the same pedestrian determination submodule is used for determining whether pedestrians contained in any two pedestrian frames in the pedestrian frames are the same pedestrian or not according to skeleton information respectively relevant to each of the two pedestrian frames under the condition that the preliminary pedestrian detection result comprises the multiple pedestrian frames; and

a pedestrian frame selection sub-module configured to, in a case where the preliminary result of pedestrian detection includes a plurality of pedestrian frames, select, as one of the at least one pedestrian frame, a unique pedestrian frame from among two or more pedestrian frames for two or more pedestrian frames containing the same pedestrian.

15. The pedestrian detection device according to claim 14, wherein the skeleton information relating to any one of the one or more pedestrian frames includes a keypoint feature map in one-to-one correspondence with a specific number of keypoints of pedestrians included in the pedestrian frame,

the same pedestrian determination submodule includes:

a similarity calculation unit configured to calculate, for any two of the plurality of pedestrian frames, a similarity between skeletons of pedestrians included in the two pedestrian frames using a keypoint feature map that is one-to-one corresponding to the specific number of keypoints of the pedestrians included in each of the two pedestrian frames; and

and the similarity comparison unit is used for comparing the calculated similarity with a similarity threshold value for any two pedestrian frames in the plurality of pedestrian frames, if the calculated similarity is greater than the similarity threshold value, determining that the pedestrians contained in the two pedestrian frames are the same pedestrian, and otherwise, determining that the pedestrians contained in the two pedestrian frames are not the same pedestrian.

16. The pedestrian detection device according to claim 14 or 15, wherein the preliminary pedestrian detection result further includes pedestrian frame confidence degrees that correspond one-to-one to the one or more pedestrian frames, the skeleton information relating to any one of the one or more pedestrian frames includes keypoint confidence degrees that correspond one-to-one to a specific number of keypoints of pedestrians included in that pedestrian frame,

the pedestrian frame selection submodule comprises:

a pedestrian frame selection unit configured to, for two or more pedestrian frames including the same pedestrian, select the unique pedestrian frame from the two or more pedestrian frames according to a pedestrian frame confidence corresponding to the two or more pedestrian frames one to one and a keypoint confidence corresponding to the specific number of keypoints of the pedestrian included in each of the two or more pedestrian frames one to one.

17. The pedestrian detection apparatus according to claim 16, wherein the pedestrian frame selection unit includes:

a second total confidence obtaining subunit, configured to, for each of two or more pedestrian frames that include the same pedestrian, sum or average a pedestrian frame confidence corresponding to the pedestrian frame and a keypoint confidence corresponding to the pedestrian included in the pedestrian frame and corresponding to the specific number of keypoints one by one, to obtain a second total confidence; and

and the pedestrian frame selection subunit is used for selecting the pedestrian frame with the highest second total confidence coefficient as the unique pedestrian frame for two or more pedestrian frames containing the same pedestrian.

18. The pedestrian detection apparatus of claim 11, wherein the detection module comprises:

the first input submodule is used for inputting the image to be processed into a first convolution neural network so as to extract the characteristics of the image to be processed; and

and the second input submodule is used for inputting the characteristics of the image to be processed into a second convolutional neural network so as to obtain the preliminary result of the pedestrian detection, and the second convolutional neural network is used for realizing the sliding window method.

19. The pedestrian detection apparatus of claim 18, wherein the skeletal analysis module comprises:

and the third input submodule is used for inputting the characteristics of the image to be processed and the preliminary pedestrian detection result into a full convolution network so as to obtain the skeleton information respectively related to each of the one or more pedestrian frames.

20. The pedestrian detection device according to claim 19, wherein the pedestrian detection device further comprises:

the training image acquisition module is used for acquiring a training image, wherein a target pedestrian frame corresponding to each pedestrian in the training image and target positions of a specific number of key points of each pedestrian are marked;

a loss function constructing module, configured to construct a first loss function by using a target pedestrian frame corresponding to a pedestrian in the training image in a one-to-one manner as a target value of one or more pedestrian frames in a pedestrian detection preliminary result obtained by the second convolutional neural network for the training image, and construct a second loss function by using a target position of the pedestrian in the training image and the key points of the specific number as a target value of skeleton information obtained by the full convolutional network for the training image; and

a training module, configured to train parameters in the first convolutional neural network, the second convolutional neural network, and the full convolutional network using the first loss function and the second loss function.