CN112232231B

CN112232231B - Pedestrian attribute identification method, system, computer equipment and storage medium

Info

Publication number: CN112232231B
Application number: CN202011124766.2A
Authority: CN
Inventors: 郁强; 张香伟; 毛云青
Original assignee: CCI China Co Ltd
Current assignee: CCI China Co Ltd
Priority date: 2020-10-20
Filing date: 2020-10-20
Publication date: 2024-02-02
Anticipated expiration: 2040-10-20
Also published as: CN112232231A

Abstract

The application relates to a pedestrian attribute identification method, a system, a computer device and a storage medium, wherein the method comprises the steps of extracting an image frame from a real-time video, inputting the image frame into a trained target detection model, and obtaining a pedestrian image output by the trained target detection model; inputting the pedestrian image into a trained pedestrian recognition model, and obtaining a classification result output by the trained pedestrian recognition model according to preset pedestrian attributes; the preset pedestrian attributes comprise gender attributes and age attributes; and obtaining the number of pedestrians and the proportion of pedestrians with different attribute of different ages according to the classification result. According to the invention, pedestrian images can be extracted from real-time videos, pedestrian image data are classified and counted to obtain pedestrian flow information with different attributes, and the popularity degree and popularity group of sightseeing spots are indirectly displayed by monitoring the pedestrian flow information with different attributes, so that the traffic fluency and business service range can be reasonably planned.

Description

Pedestrian attribute identification method, system, computer equipment and storage medium

Technical Field

The present application relates to the field of object detection, and in particular, to a pedestrian attribute identification method, system, computer device, and storage medium.

Background

Image target detection algorithm is an important research direction of deep learning, and prior to deep learning, traditional target detection mainly utilizes manually marked features to generate candidate frames through selective search, and then classification and regression are carried out. Such algorithms include the face detection algorithm of Viola-Jones, the Support Vector Machine (SVM), and the extended DPM (Deformable Parts Model) algorithm of HOG (Histograms of Oriented Gradients), among others.

The deep learning based static image object detection algorithm is developed mainly from an R-CNN detector, which is developed from an object candidate box generated by an unsupervised algorithm and classified by using a convolutional neural network. The model is scale-invariant, but the computational cost of training and reasoning for R-CNN is linear with the number of candidate boxes. To alleviate this computational bottleneck, the fast-RCNN began to propose setting anchor boxes so that the network was more targeted to the subject of learning, and an RPN (regional candidate network) network was employed to extract the candidate boxes, reaching 27.2% for the co dataset being the mAP. Then in single-stage target detection, a target detection method represented by yolo and SSD algorithms utilizes a characteristic pyramid network structure to predict small targets by utilizing shallow characteristics and large targets by utilizing deep characteristics, wherein the YOLOv3 of Joseph Redmon achieves mAP to 33%, and the higher than Zhang refinished reaches 41.8%. In the field of video target detection, dai et al depth feature flow estimates optical flow on non-key video frames by using a FlowNet network, and a feature map of the non-key frames is obtained by bilinear deformation of features extracted from the key frames. Wang et al introduced a time domain convolutional neural network to re-score each pipe, thus re-evaluating the confidence of each candidate box with time domain information. Zhu et al's THP-VID proposed sparse recursive feature aggregation and time-adaptive keyframe selection approach to reach 78.6% mAP on the ImageNet VID video detection dataset. The two-stage detection algorithm also has better HyperNet, MSCNN, PVANet and Light-Head R-CNN characteristic networks, more accurate MR-CNN, FPN and CRAT of RPN networks, more perfect R-FCN, coupleNet, mask R-CNN and Cascade R-CNN of ROI classification, and a neural network MegDet with larger sample post-processing OHEM, soft-NMS and A-Fast-RCNN.

The Anchor's nature is a candidate box with the main ideas mostly originating from DensBox in 2015 and UnitBOX in 2016, which goes into the sense of a rather weak blowout in 2019 Anchor Free method. These are classified as keypoint-based CornerNet, centerNet, extremeNet and dense prediction FSAF, FCOS, foveaBox, which are all well behaved in the direction of target detection.

Entering the 2020 neural network architecture search has become a hotspot for recent deep learning algorithms. Neural architecture search based on reinforcement learning the model description of the neural network is generated using a recurrent neural network, and the proposed neural architecture search is gradient-based. For transferable architecture learning in the field of scalable image recognition, a module is first built up by searching a structure on a small dataset and then transferred again to a large dataset. A hierarchical representation of efficient structure search, a scalable evolutionary search method variant, a hierarchical representation method describing the structure of a neural network, is presented. The PNASNet method adopts an optimization strategy based on a sequence model to learn the structure of a convolutional neural network. Auto-Keras uses Bayesian optimization to guide network deformation to improve NAS efficiency. Nasbook proposes a neural structure search framework based on gaussian processes. DARTS constructs tasks in a scalable way, solving the scalability problem of structure searches.

Many researchers have made some progress in the field of object detection, but many problems remain in practical design and use, mainly in the following two aspects:

(1) The detection effect of the video target detection is not obvious in practical application, and how to improve the detection precision of the video target in practical application is still a problem; specifically, the current target detection has low feature extraction capability, and for the problem of identification of pedestrian attributes in scenic spots, when detection is performed through a monitoring scene, the targets in the video are gradually rich along with the deepening of semantic information of the network, but the target resolution is more and more fuzzy, so that the target detection precision is low, and the current video target detection precision is problematic, so that pedestrians in the scenic spots cannot be efficiently extracted, and the statistical result of the pedestrian attributes in the scenic spots is influenced.

(2) The attribute effect of target identification still needs to be improved, and particularly the problem of small targets and shielding targets in a monitoring video state is still a great challenge; specifically, in the current target detection algorithm, a multi-layer detector is set in a mode of constructing a feature pyramid, so that the problem of how to further improve the detection effect in the stage of feature fusion to generate more distinguishable features and how to judge popular groups in scenic spots is needed to be solved.

At present, no effective solution is proposed for the problems of the target detection technology that pedestrians with different attributes cannot be effectively identified and the flow of pedestrians with various attributes cannot be effectively monitored.

Disclosure of Invention

The embodiment of the application provides a pedestrian attribute identification method, a system, computer equipment and a storage medium, which at least solve the problems that pedestrians with different attributes in scenic spots in the related technology cannot be effectively identified and the flow of pedestrians with all the attributes cannot be effectively monitored.

In a first aspect, an embodiment of the present application provides a method for identifying a pedestrian attribute, where the method includes: acquiring a real-time video, extracting an image frame from the real-time video, inputting the image frame into a trained target detection model, and obtaining a pedestrian image output by the trained target detection model; the trained target detection model is a neural network model for pedestrian target detection, which is obtained after training by using a pedestrian image sample set; inputting the pedestrian image into a trained pedestrian recognition model, and obtaining a classification result output by the trained pedestrian recognition model according to preset pedestrian attributes; wherein the preset pedestrian attribute comprises a gender attribute and an age attribute; and obtaining the number of pedestrians and the proportion of pedestrians with different attribute of different ages according to the classification result.

In some of these embodiments, the trained object detection model includes a feature extraction network and a prediction network; acquiring a real-time video, extracting an image frame from the real-time video, inputting the image frame into a trained target detection model, and obtaining a pedestrian image output by the trained target detection model comprises the following steps: acquiring a real-time video; obtaining images to be detected of the same place in a continuous period of time according to the real-time video; inputting the image to be detected into a feature extraction network, and obtaining a shallow feature map, a middle feature map and a deep feature map of the image to be detected through a plurality of residual modules in the feature extraction network; each residual module comprises at least one residual block, attention aiming at a channel is screened out in the residual blocks by learning and utilizing correlation among characteristic map channels, and an output item of the residual block and a characteristic map of a bypass connection branch are spliced to be used as an input characteristic map of the next residual block; and inputting the shallow layer feature map, the middle layer feature map and the deep layer feature map into a prediction network for fusion to obtain one or more pedestrian images in the image to be detected.

In some embodiments, filtering the attention to the channel by learning and using the correlation between the channels of the feature map in the residual block, and splicing the output item of the residual block and the feature map of the bypass connection branch as the input feature map of the next residual block includes: performing 1*1 convolution on the image to be detected, performing mixed depth separable convolution on the image to be detected to perform feature extraction, and outputting a feature map; inputting the feature map to a channel attention module and a feature map attention module respectively; pooling, reshaping, dimension increasing and feature compressing the feature map in the channel attention module, multiplying an output item with an input item of the channel attention module, and performing dimension reducing convolution; after the feature map attention module groups the feature maps, carrying out feature extraction through mixed depth separable convolution, splicing output items of each group and carrying out dimension reduction convolution; and performing element-level addition operation on the obtained results of the channel attention module and the feature map attention module, and splicing the output item of the residual block and the feature map of the bypass connection branch to serve as an input feature map of the next residual block.

In some of these embodiments, the prediction network is a cross-bi-directional feature pyramid module.

In some of these embodiments, the image to be detected is input into the trained feature extraction network; wherein the image is a three-way image; the three-channel graph is input into a residual network after being convolved by 3*3, wherein the residual network comprises a first residual module, a second residual module, a third residual module, a fourth residual module, a fifth residual module, a sixth residual module and a seventh residual module from an input end to an output end, and the number of corresponding residual blocks in the first residual module, the second residual module, the third residual module, the fourth residual module, the fifth residual module, the sixth residual module and the seventh residual module is 1, 2, 3, 4 and 1; a shallow layer characteristic diagram is obtained in the fourth residual error module, a middle layer characteristic diagram is obtained in the fifth residual error module, and a deep layer characteristic diagram is obtained in the sixth residual error module; three fusion units are arranged at the output of the third residual error module and the seventh residual error module to fuse the adjacent two-layer or three-layer characteristics; seven fusion units are arranged on the fourth residual error module, the fifth residual error module and the sixth residual error module, the resolutions of each layer are equal, the feature graphs are fused together by the fourth residual error module, the fifth residual error module and the second to last fusion unit of the sixth residual error module, and the fusion method of the fusion units is up sampling or down sampling; and respectively connecting a head prediction module behind the fusion units of the fourth residual error module, the fifth residual error module and the sixth residual error module, and obtaining the position of the target to be detected, the size of the surrounding frame of the target to be detected and the confidence through the head prediction module.

In some of these embodiments, the trained pedestrian recognition model is the trained feature extraction network in the trained object detection model.

In some embodiments, obtaining the number and proportion of pedestrians with different attribute of the classification and different attribute of the age according to the classification result includes: storing the classification result into a pedestrian attribute text, and counting word frequencies of the classification result in the pedestrian attribute text to obtain the number of pedestrians and the proportion of pedestrians with different attribute of different ages; wherein the gender attribute comprises male and female; the age attribute includes teenagers, young adults, and elderly people.

In a second aspect, an embodiment of the present application provides a pedestrian attribute identification system, including: the acquisition module is used for acquiring a real-time video, extracting an image frame from the real-time video, inputting the image frame into a trained target detection model, and obtaining a pedestrian image output by the trained target detection model; the recognition module is used for inputting the pedestrian image into a trained pedestrian recognition model and obtaining a classification result output by the trained pedestrian recognition model according to preset pedestrian attributes; wherein the preset pedestrian attribute comprises a gender attribute and an age attribute; and the counting module is used for obtaining the number of pedestrians and the proportion of pedestrians with different attribute of different ages according to the classification result.

In a third aspect, an embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the method for identifying a pedestrian attribute according to the first aspect when the processor executes the computer program.

In a fourth aspect, embodiments of the present application provide a storage medium having stored thereon a computer program which, when executed by a processor, implements a method for identifying a pedestrian attribute as described in the first aspect above.

Compared with the related art, the pedestrian attribute identification method, system, computer equipment and storage medium provided by the embodiment of the application are used for solving the problems that pedestrians with different attributes in scenic spots cannot be effectively identified and the flow of pedestrians with all attributes cannot be effectively monitored in the prior art. The target detection method in the related art has low detection precision and poor detection effect on small targets and shielding targets, so that pedestrians in scenic spots cannot be efficiently extracted, and statistical results of pedestrian flows with different attributes in the scenic spots are affected. Aiming at the problem that the target detection precision is not high, the scheme provides a residual block, wherein the residual block 1 adopts mixed depth separable convolution, namely different channels are distributed with different convolution kernels to obtain receptive field feature graphs with different sizes, so that the backbone network of the receptive field feature graphs extracts more robust features in consideration of targets with different sizes in videos, and the positioning and classification of the targets are facilitated. 2. Different receptive fields are obtained in the residual block by utilizing different convolution kernels, and the foreground (target) feature extraction is enhanced by combining a feature attention mechanism and a channel attention mechanism, so that background information is weakened. According to the scheme, the cross bidirectional feature pyramid module is designed, and the robustness of the cross bidirectional feature pyramid module to the target detection accuracy in the video is higher by fully optimizing the combination mode of feature semantic information and resolution. Aiming at the problem that the effect of target detection of pedestrian attributes is poor, the scheme provides the network architecture to generate more distinguishing characteristics. In addition, the pedestrian is detected through the target detection model, the attribute of the pedestrian is judged through the model, the popular group of the scenic spot is judged according to the attribute of the pedestrian, and the problem that the popular group of the scenic spot cannot be obtained in the prior art is solved. Specifically, the method designs a new residual structure by combining a channel attention mechanism and a feature map attention mechanism in a feature extraction network, learns and utilizes the correlation among channels to screen out the attention to the channels. The method is characterized in that a convolution kernel attention mechanism is introduced into a feature extraction network, different effects are generated on targets with different scales (far, near and large) by using different sizes of perception fields (convolution kernels), a more robust feature extraction network is designed by combining the properties of the two, and convolution kernels (3*3, 5*5, 7*7 and 9*9) which are separated by depth without size are used in the convolution kernel attention mechanism, so that the floating point operand is not increased, and the perception fields with different sizes can be obtained. After the primary extraction of the features is completed, in order to enable the extracted features to have high semantic information, a cross bidirectional feature pyramid module is designed in a prediction network, the local context information of three scales is aggregated in a penultimate feature fusion unit, deep features contain more semantic information and a large enough receptive field, shallow features contain more detail information, and the fusion mode is closer to the purpose of fusion of global features and local features so as to generate features with more distinctiveness. According to the invention, pedestrian images can be extracted from the real-time video, the attributes of the pedestrian images are classified, pedestrian flow information with different attributes is obtained, and communication fluency and business service range can be reasonably planned by monitoring the pedestrian flow information with different attributes. In addition, pedestrian flows with different attributes indirectly display the popularity degree and the popularity group of sightseeing spots, can effectively distribute management and maintenance personnel of the sightseeing spots, and take measures for preventing emergencies in areas with larger flows.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the other features, objects, and advantages of the application.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

FIG. 1 is a flow chart of a method of identifying pedestrian attributes according to an embodiment of the present application;

FIG. 2 is a network architecture diagram of one residual block in a feature extraction network according to an embodiment of the present application;

FIG. 3 is a diagram of a cross-bi-directional feature pyramid module architecture in a prediction network according to an embodiment of the present application;

fig. 4 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application;

FIG. 5 is a network architecture diagram of a pedestrian recognition model in accordance with an embodiment of the present application;

fig. 6 is a block diagram of a pedestrian attribute identification system according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with one or more embodiments of the present specification. Rather, they are merely examples of apparatus and methods consistent with aspects of one or more embodiments of the present description as detailed in the accompanying claims.

It should be noted that: in other embodiments, the steps of the corresponding method are not necessarily performed in the order shown and described in this specification. In some other embodiments, the method may include more or fewer steps than described in this specification. Furthermore, individual steps described in this specification, in other embodiments, may be described as being split into multiple steps; while various steps described in this specification may be combined into a single step in other embodiments.

The present embodiment provides a pedestrian attribute identification method, fig. 1 is a flowchart of a pedestrian attribute identification method according to an embodiment of the present application, and as shown in fig. 1, the flowchart includes obtaining an image, attribute identification, and a flow count of pedestrians of each attribute, specifically, the method includes:

step 101, acquiring a real-time video, extracting an image frame from the real-time video, and inputting the image frame into a trained target detection model to obtain a pedestrian image output by the trained target detection model.

In this embodiment, images may be collected by monitoring video, specifically, in the monitored video, find out L segments of video containing a target to be detected, vi represents an ith segment of video, vi shares Ni video images, and Mi video images are selected from the Ni video images as training and testing images, so that the L segments of video images may be used as training and testing images.

In some of these embodiments, the camera is mounted in the room at a height of typically 2-2.5 meters in order to clearly capture pedestrians, with a pitch angle of no more than 15 degrees. The outdoor installation height is generally 3-3.5 meters. For example, the real-time video detection camera may be 200 ten thousand pixels, the focal length is 12mm, the installation height of the camera may be about 2.3 meters, and the camera is responsible for monitoring pedestrian detection about 15 meters away, and in order to obtain more accurate pedestrian traffic data, cameras are installed on both sides of the road, so as to respectively monitor pedestrians in two directions of the road.

In this embodiment, through installing the camera of focal length, highly reasonable, avoid the too little and pedestrian of shooting apart from the problem of shielding each other when being close. The installation height and angle of the camera in actual engineering can directly influence the definition of the pedestrian photo shot by the camera, so that the detection precision of a network is influenced, the detection precision of the network can be greatly improved through the installation data, and the data are the better angle and height data obtained through engineering test in the embodiment.

In some embodiments, M video images are selected from N video images in a video segment as data enhancements for the training and test image method.

In this embodiment, the data may be enhanced by geometric transformation: the P target images in each type are acquired to increase data through translation, rotation (45 degrees, 90 degrees, 180 degrees and 270 degrees), image shrinkage (1/3, 1/2), mosaics data enhancement and shearing transformation; one part of the image with the data enhanced is used as training data, the other part is used as test data, and the training data and the test data are not intersected.

In some of these embodiments, the data is manually annotated prior to training. Specifically, after the Windows operating system, the linux operating system or the MAC operating system configures python and lxml environments, an image tag frame of a target to be detected is obtained by using a LabelImg marking tool, a marking person marks the image tag frame, marked image data information is stored as an XML format file, the generated XML file conforms to the format of a PASCAL VOC, and the XML marked data format can be converted into a tag data format matched with the frame according to different training model frames.

In this embodiment, the trained target detection model is obtained through annotation data training, specifically, through monitoring video acquisition images, data enhancement is performed on the selected images as training and testing data, one part of the images after data enhancement is used as training data, the other part of the images after data enhancement is used as testing data, and the training data and the testing data do not intersect. And marking the data to obtain an image tag frame of the target to be detected.

In some of these embodiments, the trained object detection model includes a feature extraction network and a prediction network; acquiring a real-time video, extracting an image frame from the real-time video, inputting the image frame into a trained target detection model, and obtaining a pedestrian image output by the trained target detection model comprises the following steps: acquiring a real-time video; obtaining images to be detected of the same place in a continuous period of time according to the real-time video; inputting the image to be detected into a feature extraction network, and obtaining a shallow feature map, a middle feature map and a deep feature map of the image to be detected through a plurality of residual modules in the feature extraction network; each residual module comprises at least one residual block, attention aiming at a channel is screened out in the residual blocks by learning and utilizing the correlation between the channels of the feature map, and an output item of the residual block and the feature map of the bypass connection branch are spliced to be used as an input feature map of the next residual block; and inputting the shallow layer feature map, the middle layer feature map and the deep layer feature map into a prediction network for fusion to obtain one or more pedestrian images in the images to be detected.

In this embodiment, an image to be detected is input into a feature extraction network, and according to a specific value of depth D and width W of a video image resolution experimental test network input into a neural network, the overall structure of the feature extraction network is: scaling according to any of the parameters of network depth, width and resolution can improve model accuracy, with return on network accuracy being impaired as network depth deepens (more abundant and complex features are captured), width increases (finer granularity features are captured and thus easier to train), and image resolution of the input network improves (finer granularity modes are captured); feature extraction networks we design tend to focus on more detail related fields based on three factors of network depth, width and resolution. The resolution of the input image of the selected network is X X, the floating point operation amount is doubled by doubling the network depth according to the calculated amount of convolution operation, and the floating point operation amount is increased by four times by doubling the network width, so that the network depth D is selected after the resolution of the input image is determined, and finally the width W of the feature extraction network is selected under the condition that the input resolution image and the network depth are determined.

In some embodiments, filtering the attention of the channel by learning and using the correlation between the channels of the feature map in the residual block, and splicing the output item of the residual block and the feature map of the bypass connection branch as the input feature map of the next residual block includes: performing 1*1 convolution on an image to be detected, performing separable convolution of a mixed depth to extract features, and outputting a feature map; inputting the feature map to a channel attention module and a feature map attention module respectively; pooling, reshaping, dimension increasing and feature compressing the feature map in the channel attention module, multiplying the output item with the input item of the channel attention module and performing dimension reducing convolution; after the feature map attention module groups the feature maps, carrying out feature extraction through mixed depth separable convolution, splicing output items of each group and carrying out dimension reduction convolution; and performing element-level addition operation on the obtained results of the channel attention module and the feature map attention module, and splicing the output item of the residual block and the feature map of the bypass connection branch to serve as an input feature map of the next residual block.

In this embodiment, referring to fig. 2, the feature extraction network is composed of a plurality of residual blocks, the number of the residual blocks is C through the convolution output channels of 1*1, the number of the C channels is evenly divided into 4, the number of each feature channel is C/4, and each C/4 feature channels corresponds to a depth separable convolution. That is, 3*3 corresponds to C/4 characteristic channels, 5*5 corresponds to C/4 characteristic channels, 7*7 corresponds to C/4 characteristic channels, 9*9 corresponds to C/4 characteristic channels. The mixed depth separable convolution increases the size of the convolution kernel in 2i+1 (1= < i < = 4) from 3*3 as the initial first and the maximum depth separable volume used in the present invention is 9*9, and then the convolution operation of 1*1, the batch normalization operation and the activation function operation of H-Swish are performed on the output result of the mixed depth separable convolution; and respectively carrying out a channel attention mechanism and a feature map attention mechanism on the C channel features, screening the attention of the channel by learning and utilizing the correlation among the feature map channels, and splicing the output item of the residual block and the feature map of the bypass connection branch as the input feature map of the next residual block.

In some embodiments, an image to be detected is input into a feature extraction network, and a shallow feature map, a middle feature map and a deep feature map of the image are obtained through a plurality of residual modules in the feature extraction network; comprising the following steps: inputting the image into a feature extraction network; wherein the image is scaled to a three-way map of the same size as the width and height; the three-channel graph is input into a residual error network after being convolved by 3*3, wherein the residual error network comprises a first residual error module, a second residual error module, a third residual error module, a fourth residual error module, a fifth residual error module, a sixth residual error module and a seventh residual error module which are respectively corresponding to 1, 2, 3, 4 and 1 residual error blocks in the first residual error module, the second residual error module, the third residual error module, the fourth residual error module, the fifth residual error module, the sixth residual error module and the seventh residual error module; and a shallow layer characteristic map is obtained at a fourth residual error module and is used as a characteristic of a small prediction target, a middle layer characteristic map is obtained at a fifth residual error module and is used as a characteristic of a target in prediction, and a deep layer characteristic map is obtained at a sixth residual error module and is used as a characteristic of a large prediction target.

In some of these embodiments, pooling, reshaping, upscaling, feature compression operations on the feature map at the channel attention module, multiplying the output term with the input term of the channel attention module and performing the dimensionality reduction convolution includes: carrying out global average pooling operation on the feature map at the channel attention module; remodelling the feature map, and convolving the remodelled feature map with 1*1 to increase dimension; convolving the feature map after dimension increase with 1*1 to compress the number of the feature channels; expanding the number of the characteristic channels to obtain an output item; wherein the output term is a one-dimensional feature vector; the one-dimensional feature vector is multiplied by the feature map and convolved with 1*1 to perform feature fusion.

In some embodiments, the feature extraction through the mixed depth separable convolution after the feature map attention module groups the feature maps, and the splicing and the dimension reduction convolution of the output items of each group comprise: dividing the feature images into four groups, and carrying out feature extraction through mixed depth separable convolution; wherein the mixed depth separable convolution starts with 3*3 as a first convolution kernel, increasing the size of the convolution kernel in 2i+1 (1= < i < = 4); performing 1*1 convolution operation on the output result of the mixed depth separable convolution to obtain four separated groups of convolutions; performing element level addition, global averaging pooling, separating out four groups of full connection layers and obtaining values of four corresponding groups of Softmax, performing element level multiplication on the obtained values of four groups of Softmax and corresponding features respectively, performing element level addition on four groups of features obtained by element level multiplication, and performing feature fusion on a result obtained by element level addition by using 1*1 convolution.

In this embodiment, referring to fig. 3, three fusion units are disposed at the outputs of the third residual error module and the seventh residual error module to perform adjacent two-layer or three-layer feature fusion; seven fusion units are arranged on the fourth residual error module, the fifth residual error module and the sixth residual error module, the resolutions of each layer are equal, the feature images are fused together by the second last fusion unit of the fourth residual error module, the fifth residual error module and the sixth residual error module, and the fusion method of the fusion units is up sampling or down sampling; and respectively connecting a head prediction module behind the fusion units of the fourth residual module, the fifth residual module and the sixth residual module, and obtaining the positions of pedestrians, the sizes of surrounding frames of the pedestrians and the confidence level in the image to be detected through the head prediction module. It should be noted that, in this embodiment, the prediction network merges the features of multiple adjacent scales by adding a cross bidirectional aggregation scale module into the afflicientdet feature pyramid network. Referring to fig. 3, the third-scale local context information is aggregated in the next-to-last feature fusion unit, the deep features contain more semantic information, the receptive field is large enough, the shallow features contain more detail information, and the fusion mode is closer to the purpose of fusing the global features and the local features so as to generate more differentiated features.

In step 101, referring to fig. 2-3, the partial residual block adopts a combination of a feature map channel attention mechanism and a convolution kernel attention mechanism, wherein the feature map channel attention mechanism comprises a channel attention module and a feature map attention module, and learns and utilizes the correlation between channels to screen attention for the channels; the convolution kernel attention mechanism has different effects on targets with different scales (distance and size) by using different sensing fields (convolution kernels), and the convolution kernels which are separated by different depths are used in the convolution kernel attention mechanism, so that not only is the floating point operand reduced, but also the sensing fields with different sizes can be obtained, and the capability of extracting the network characteristics by the characteristics is enhanced, so that pedestrians can be detected in video images. After the primary extraction of the features is completed, in order to enable the extracted features to have high semantic information, a cross bidirectional feature pyramid module is designed in a prediction network, the local context information of three scales is aggregated in a penultimate feature fusion unit, deep features contain more semantic information and a large enough receptive field, shallow features contain more detail information, and the fusion mode is closer to the purpose of fusion of global features and local features so as to generate features with more distinctiveness. The method can detect the target, particularly a small target, such as pedestrians in a distant view, by enhancing the feature extraction capability of the feature extraction network and optimizing the pyramid module, so that the pedestrians cannot be submerged in the context background along with the deepening of the network, and the accuracy of the attribute recognition result of the pedestrians in the scenic spot is improved.

102, inputting a pedestrian image into a trained pedestrian recognition model, and obtaining a classification result output by the trained pedestrian recognition model according to preset pedestrian attributes; the preset pedestrian attributes comprise gender attributes and age attributes.

In some embodiments, the pedestrian image may be saved in a pedestrian data folder, the pedestrian image in the pedestrian data folder is processed at intervals, for example, a real-time video of a day is acquired, the real-time video is processed into a pedestrian image of the same place in the day, the pedestrian image is saved in the pedestrian data folder, the pedestrian image is marked with the date of the day, the pedestrian image in the pedestrian data folder is processed, and the statistical result is the pedestrian traffic of the day.

In some of these embodiments, the trained pedestrian recognition model is a trained feature extraction network in a trained object detection model.

In this embodiment, referring to fig. 5, the network architecture of the pedestrian recognition model is a feature extraction network without the regression module, i.e. a backbone network in the target detection model, which is not described herein. Obtaining predicted values of pedestrian attributes under different levels by using the global average pooling layer and the full connection layer in the figure 2 as classifiers; and voting each attribute to obtain a classification result. Specifically, in the pedestrian recognition model, the labels can be set to be teenagers, young men, old men, female teenagers, young women and old women by a multi-label recognition method. The first column of the text labels can be paths of pedestrian images in the pedestrian data folder, the second column is a male label, the third column is a female label, the fourth column is a teenager label, the fifth column is a young label, the sixth column is an elderly label, whether the pedestrian attribute of each pedestrian image is consistent with the above label is judged, the attribute is that the label is 1, the attribute is not that the label is 1, the label is 0, and finally the sex attribute and the age attribute of each pedestrian image are obtained by classification.

And step 103, obtaining the number of pedestrians and the proportion of pedestrians with different attribute of different ages according to the classification result.

For example, pedestrian traffic of each attribute may be counted once a day, and the counted result is taken as the pedestrian traffic of different attributes of the same day. Corresponding to the pedestrian data folder, a pedestrian traffic text may be generated daily and named by the date of the day.

In this embodiment, pedestrian traffic information with different attributes is obtained by counting the number of pedestrians with each attribute, and communication fluency and business service range can be reasonably planned by monitoring the pedestrian traffic information with different attributes.

In some embodiments, the obtaining the number of pedestrians and the proportion of pedestrians with different attribute of the sex and the attribute of the age according to the classification result comprises: storing the classification result into a pedestrian attribute text, and counting the word frequency of the classification result in the pedestrian attribute text to obtain the number of pedestrians and the proportion of pedestrians with different attribute of different ages; wherein the sex attribute comprises male and female; age attributes include teenagers, young adults, and elderly.

In this embodiment, the classification result is saved in the pedestrian attribute text, and the statistical chart is drawn according to the word frequency statistics, for example, the readability of the statistical result is improved through a long bar, a broken line or a pie chart. Illustratively, a histogram is drawn according to word frequency statistics, and the abscissa is the number of word frequencies appearing in the text, so that the crowd popular with the sightseeing spot is generalized according to the number of word frequencies corresponding to the attribute.

Through the steps 101 to 103, the invention provides a pedestrian attribute identification method, wherein the network can be deepened in the feature extraction part and the network can be widened according to the resolution of an input image, the deepened network can abstract the features layer by layer, knowledge is continuously refined and extracted, each layer of widening network can learn richer features, such as texture features with different directions and different frequencies, after the primary extraction of the features is completed, adjacent features with multiple scales are fused, so that the last-last feature fusion unit aggregates local context information with three scales, more semantic information is obtained, more detail information is contained, and the feature extraction precision of a model is improved. Compared with the prior art, the invention combines the characteristic diagram channel attention mechanism and the convolution kernel attention mechanism in a single residual block, wherein the characteristic diagram channel attention mechanism comprises a channel attention module and a characteristic diagram attention module which are used for learning and utilizing the correlation among channels and screening the attention aiming at the channels; the convolution kernel attention mechanism has different effects on targets with different scales (distance and size) by using different sensing fields (convolution kernels), and the convolution kernels which are separated by different depths are used in the convolution kernel attention mechanism, so that not only is the floating point operand reduced, but also the sensing fields with different sizes can be obtained, and the capability of extracting the network characteristics by the characteristics is enhanced, so that pedestrians can be detected in video images. After the primary extraction of the features is completed, feature fusion is performed through the cross bidirectional feature pyramid, and target detection can be performed on a small target under the monitoring video, so that the small target cannot be submerged in the context background along with deepening of the network, and the target detection precision can be improved. According to the invention, pedestrian images can be extracted from the real-time video, the attributes of the pedestrian images are classified, pedestrian flow information with different attributes is obtained, and communication fluency and business service range can be reasonably planned by monitoring the pedestrian flow information with different attributes. In addition, pedestrian flows with different attributes indirectly display the popularity degree and the popularity group of sightseeing spots, can effectively distribute management and maintenance personnel of the sightseeing spots, and take measures for preventing emergencies in areas with larger flows.

Based on the same technical concept, fig. 6 exemplarily shows a pedestrian attribute recognition system provided by an embodiment of the present invention, including:

the acquiring module 20 is configured to extract an image frame from the real-time video, and input the image frame to the trained target detection model to obtain a pedestrian image output by the trained target detection model.

The recognition module 22 is configured to input a pedestrian image to the trained pedestrian recognition model, and obtain a classification result output by the trained pedestrian recognition model according to a preset pedestrian attribute; the preset pedestrian attributes comprise gender attributes and age attributes.

The counting module 24 is used for obtaining the number of pedestrians and the proportion of pedestrians with different attribute of different ages according to the classification result.

The present embodiment also provides an electronic device comprising a memory 304 and a processor 302, the memory 304 having stored therein a computer program, the processor 302 being arranged to run the computer program to perform the steps of any of the method embodiments described above.

In particular, the processor 302 may include a Central Processing Unit (CPU), or an Application Specific Integrated Circuit (ASIC), or may be configured to implement one or more integrated circuits of embodiments of the present application.

Memory 304 may include, among other things, mass storage 304 for data or instructions. By way of example, and not limitation, memory 304 may comprise a Hard Disk Drive (HDD), floppy disk drive, solid State Drive (SSD), flash memory, optical disk, magneto-optical disk, tape, or Universal Serial Bus (USB) drive, or a combination of two or more of these. Memory 304 may include removable or non-removable (or fixed) media, where appropriate. Memory 304 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 304 is a Non-Volatile (Non-Volatile) memory. In particular embodiments, memory 304 includes Read-only memory (ROM) and Random Access Memory (RAM). Where appropriate, the ROM may be a mask-programmed ROM, a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), an electrically rewritable ROM (EAROM) or FLASH memory (FLASH) or a combination of two or more of these. The RAM may be Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM) where appropriate, where the DRAM may be a fast page mode DRAM 304

(FastPageModeDynamicRandomaAccess memory, abbreviated as FPMDRAM), extended data output dynamic random Access memory (extended DateOutDynamicRandomaAccess memory, abbreviated as EDODRAM), synchronous Dynamic Random Access Memory (SDRAM), etc.

Memory 304 may be used to store or cache various data files that need to be processed and/or communicated, as well as possible computer program instructions for execution by processor 302.

The processor 302 implements the method of identifying any pedestrian attribute in the above-described embodiments by reading and executing the computer program instructions stored in the memory 304.

Optionally, the electronic apparatus may further include a transmission device 306 and an input/output device 308, where the transmission device 306 is connected to the processor 302, and the input/output device 308 is connected to the processor 302.

The transmission device 306 may be used to receive or transmit data via a network. Specific examples of the network described above may include a wired or wireless network provided by a communication provider of the electronic device. In one example, the transmission device includes a network adapter (Network Interface Controller, simply referred to as NIC) that can connect to other network devices through the base station to communicate with the internet. In one example, the transmission device 306 may be a Radio Frequency (RF) module, which is used to communicate with the internet wirelessly.

The input-output device 308 is used to input or output information. For example, the input/output device may be a display screen, a speaker, a microphone, a mouse, a keyboard, or other devices. In this embodiment, the input information may be real-time video, and the input information may be classification results, statistical charts, and the like.

Alternatively, in the present embodiment, the above-mentioned processor 302 may be configured to execute the following steps by a computer program:

s101, extracting an image frame from a real-time video, and inputting the image frame into a trained target detection model to obtain a pedestrian image output by the trained target detection model;

s102, inputting the pedestrian image into a trained pedestrian recognition model, and obtaining a classification result output by the trained pedestrian recognition model according to preset pedestrian attributes; the preset pedestrian attributes comprise gender attributes and age attributes;

s103, obtaining the number of pedestrians and the proportion of pedestrians with different attribute of different ages according to the classification result.

It should be noted that, specific examples in this embodiment may refer to examples described in the foregoing embodiments and alternative implementations, and this embodiment is not repeated herein.

In addition, in combination with the pedestrian attribute identification method in the above embodiment, the embodiment of the application may provide a storage medium to be implemented. The storage medium has a computer program stored thereon; the computer program, when executed by a processor, implements the method of identifying any pedestrian attribute in the above-described embodiments.

It should be understood by those skilled in the art that the technical features of the above embodiments may be combined in any manner, and for brevity, all of the possible combinations of the technical features of the above embodiments are not described, however, they should be considered as being within the scope of the description provided herein, as long as there is no contradiction between the combinations of the technical features.

The foregoing examples merely represent several embodiments of the present application, the description of which is more specific and detailed and which should not be construed as limiting the scope of the present application in any way. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims

1. A method of identifying pedestrian attributes, the method comprising:

Acquiring a real-time video, extracting an image frame from the real-time video, inputting the image frame into a trained target detection model, and obtaining a pedestrian image output by the trained target detection model; the trained target detection model is a neural network model for pedestrian target detection, which is obtained after training by using a pedestrian image sample set;

inputting the pedestrian image into a trained pedestrian recognition model, and obtaining a classification result output by the trained pedestrian recognition model according to preset pedestrian attributes; wherein the preset pedestrian attribute comprises a gender attribute and an age attribute;

obtaining the number of pedestrians and the proportion of pedestrians with different attribute of different ages according to the classification result;

wherein the trained object detection model comprises a feature extraction network and a prediction network;

acquiring a real-time video, extracting an image frame from the real-time video, inputting the image frame into a trained target detection model, and obtaining a pedestrian image output by the trained target detection model comprises the following steps:

acquiring a real-time video;

obtaining images to be detected of the same place in a continuous period of time according to the real-time video;

Inputting the image to be detected into a feature extraction network, and obtaining a shallow feature map, a middle feature map and a deep feature map of the image to be detected through a plurality of residual modules in the feature extraction network; each residual module comprises at least one residual block, attention aiming at a channel is screened out in the residual blocks by learning and utilizing correlation among characteristic map channels, and an output item of the residual block and a characteristic map of a bypass connection branch are spliced to be used as an input characteristic map of the next residual block;

and inputting the shallow layer feature map, the middle layer feature map and the deep layer feature map into a prediction network for fusion to obtain one or more pedestrian images in the image to be detected.

2. The pedestrian attribute identification method according to claim 1, wherein screening attention to a channel by learning and utilizing correlation between feature map channels in the residual block, and splicing an output item of the residual block and a feature map of a bypass connection branch as an input feature map of a next residual block includes:

performing 1*1 convolution on the image to be detected, performing mixed depth separable convolution on the image to be detected to perform feature extraction, and outputting a feature map;

Inputting the feature map to a channel attention module and a feature map attention module respectively;

pooling, reshaping, dimension increasing and feature compressing the feature map in the channel attention module, multiplying an output item with an input item of the channel attention module, and performing dimension reducing convolution;

after the feature map attention module groups the feature maps, carrying out feature extraction through mixed depth separable convolution, splicing output items of each group and carrying out dimension reduction convolution;

and performing element-level addition operation on the obtained results of the channel attention module and the feature map attention module, and splicing the output item of the residual block and the feature map of the bypass connection branch to serve as an input feature map of the next residual block.

3. The method of claim 1, wherein the predictive network is a cross-bi-directional feature pyramid module.

4. A method of identifying pedestrian attributes according to claim 3, wherein the image to be detected is input into the trained feature extraction network; wherein the image to be detected is a three-way image;

the three-channel graph is input into a residual network after being convolved by 3*3, wherein the residual network comprises a first residual module, a second residual module, a third residual module, a fourth residual module, a fifth residual module, a sixth residual module and a seventh residual module from an input end to an output end, and the number of corresponding residual blocks in the first residual module, the second residual module, the third residual module, the fourth residual module, the fifth residual module, the sixth residual module and the seventh residual module is 1, 2, 3, 4 and 1;

A shallow layer characteristic diagram is obtained in the fourth residual error module, a middle layer characteristic diagram is obtained in the fifth residual error module, and a deep layer characteristic diagram is obtained in the sixth residual error module;

three fusion units are arranged at the output of the third residual error module and the seventh residual error module to fuse the adjacent two-layer or three-layer characteristics; seven fusion units are arranged on the fourth residual error module, the fifth residual error module and the sixth residual error module, the resolutions of each layer are equal, the feature graphs are fused together by the fourth residual error module, the fifth residual error module and the second to last fusion unit of the sixth residual error module, and the fusion method of the fusion units is up sampling or down sampling;

and respectively connecting a head prediction module behind the fusion units of the fourth residual error module, the fifth residual error module and the sixth residual error module, and obtaining the position of the target to be detected in the image to be detected, the size of the surrounding frame of the target to be detected and the confidence through the head prediction module.

5. The method of claim 1, wherein the trained pedestrian recognition model is the trained feature extraction network in the trained object detection model.

6. The pedestrian attribute recognition method according to claim 1, wherein obtaining the number and proportion of pedestrians of different attribute of the category and different age attribute according to the classification result includes:

storing the classification result into a pedestrian attribute text, and counting word frequencies of the classification result in the pedestrian attribute text to obtain the number of pedestrians and the proportion of pedestrians with different attribute of different ages; wherein the gender attribute comprises male and female; the age attribute includes teenagers, young adults, and elderly people.

7. A pedestrian attribute identification system, comprising:

the acquisition module is used for acquiring a real-time video, extracting an image frame from the real-time video, inputting the image frame into a trained target detection model, and obtaining a pedestrian image output by the trained target detection model;

the recognition module is used for inputting the pedestrian image into a trained pedestrian recognition model and obtaining a classification result output by the trained pedestrian recognition model according to preset pedestrian attributes; wherein the preset pedestrian attribute comprises a gender attribute and an age attribute;

the counting module is used for obtaining the number of pedestrians and the proportion of pedestrians with different attribute of different ages according to the classification result;

acquiring a real-time video;

8. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, the processor being arranged to run the computer program to perform the method of identifying a pedestrian attribute as claimed in any one of claims 1 to 6.

9. A storage medium, characterized in that the storage medium has stored therein a computer program, wherein the computer program is arranged to perform the method of identifying a pedestrian attribute according to any one of claims 1 to 6 at run-time.