WO2023082196A1 - 行人属性识别系统及其训练方法、行人属性识别方法 - Google Patents
行人属性识别系统及其训练方法、行人属性识别方法 Download PDFInfo
- Publication number
- WO2023082196A1 WO2023082196A1 PCT/CN2021/130421 CN2021130421W WO2023082196A1 WO 2023082196 A1 WO2023082196 A1 WO 2023082196A1 CN 2021130421 W CN2021130421 W CN 2021130421W WO 2023082196 A1 WO2023082196 A1 WO 2023082196A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- pedestrian
- attribute
- module
- feature information
- recognition
- Prior art date
Links
- 238000012549 training Methods 0.000 title claims description 69
- 238000000034 method Methods 0.000 title claims description 61
- 230000009466 transformation Effects 0.000 claims abstract description 60
- 238000000605 extraction Methods 0.000 claims description 58
- 238000004590 computer program Methods 0.000 claims description 28
- 230000004913 activation Effects 0.000 claims description 18
- 238000007499 fusion processing Methods 0.000 claims description 11
- 238000013519 translation Methods 0.000 claims description 10
- 230000004927 fusion Effects 0.000 claims description 8
- 238000011176 pooling Methods 0.000 claims description 7
- 238000012937 correction Methods 0.000 claims description 6
- 239000010410 layer Substances 0.000 description 72
- 230000006870 function Effects 0.000 description 40
- 238000012545 processing Methods 0.000 description 19
- 238000013527 convolutional neural network Methods 0.000 description 15
- 230000008569 process Effects 0.000 description 13
- 238000010586 diagram Methods 0.000 description 12
- 238000004422 calculation algorithm Methods 0.000 description 10
- 230000009471 action Effects 0.000 description 9
- 238000013528 artificial neural network Methods 0.000 description 9
- 238000005516 engineering process Methods 0.000 description 9
- 238000004364 calculation method Methods 0.000 description 7
- 238000013135 deep learning Methods 0.000 description 7
- 238000001514 detection method Methods 0.000 description 5
- 230000007246 mechanism Effects 0.000 description 5
- 239000011521 glass Substances 0.000 description 4
- 230000004044 response Effects 0.000 description 4
- 230000004807 localization Effects 0.000 description 3
- 238000013507 mapping Methods 0.000 description 3
- 238000003058 natural language processing Methods 0.000 description 3
- 238000010606 normalization Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 230000003190 augmentative effect Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 210000000746 body region Anatomy 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 210000002569 neuron Anatomy 0.000 description 2
- 241000282412 Homo Species 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000005284 excitation Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 230000009191 jumping Effects 0.000 description 1
- 238000007477 logistic regression Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 210000004205 output neuron Anatomy 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000001568 sexual effect Effects 0.000 description 1
- 239000002356 single layer Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/7715—Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
- G06V10/443—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
- G06V10/449—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
- G06V10/451—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
- G06V10/454—Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/771—Feature selection, e.g. selecting representative features from a multi-dimensional feature space
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/103—Static body considered as a whole, e.g. static pedestrian or occupant recognition
Definitions
- the present disclosure relates to the technical field of intelligent identification, and in particular to a pedestrian attribute identification system, a training method thereof, and a pedestrian attribute identification method.
- Pedestrian Attribute Recognition refers to the use of computers to predict and analyze various attribute information related to pedestrians in images.
- Common pedestrian attribute recognition includes identifying macro attributes such as gender, skin color, age, and body posture of pedestrians, as well as specific character attributes such as backpack type, clothing type and color, pants type and color, and current actions.
- Accurate and efficient pedestrian attribute recognition methods are helpful for all kinds of analysis work based on pedestrian images, and pedestrian attribute recognition is also more and more widely used in various fields.
- pedestrian attribute recognition can be applied to many fields such as video surveillance, smart city, public safety, and accurate advertisement placement, and it has become an important research direction in the field of computer vision.
- a pedestrian attribute recognition system includes: at least one attribute location module, each attribute location module corresponds to a plurality of pedestrian attributes; the attribute location module includes a space transformation unit and an attribute recognition unit; the space transformation unit is used for Extract the feature information in the discriminable area from the feature information input to the space transformation unit, and the discriminable area is related to a plurality of pedestrian attributes corresponding to the attribute positioning module; the attribute identification unit is used for according to the feature information in the discriminable area, Obtain the recognition results of multiple pedestrian attributes corresponding to the attribute location module.
- the space transformation unit is specifically configured to determine the transformation parameters of the discriminable region according to the feature information input to the space transformation module; Extract feature information in the discriminable area; wherein, the transformation parameters include scaling transformation parameters in the horizontal direction, scaling transformation parameters in the vertical direction, translation transformation parameters in the horizontal direction, and translation transformation parameters in the vertical direction.
- the attribute positioning module also includes a channel attention unit; a channel attention unit, which is used to calibrate the feature information input to the channel attention unit, obtain calibrated feature information, and use the calibrated feature information as input Feature information of the spatial transformation unit.
- the channel attention unit is specifically used to pass the feature information of the input channel attention unit through the global average pooling layer, 1 ⁇ 1 convolution layer, ReLU activation layer, 1 ⁇ 1 convolution layer and Sigmoid Activate the layer to obtain the first calibration vector; multiply the first calibration vector and the feature information of the attention unit of the input channel one by one to obtain the second correction vector; multiply the second correction vector and the feature information of the attention unit of the input channel one by one The elements are added to obtain the calibrated feature information.
- At least one attribute location module includes a first attribute location module and/or a second attribute location module, the first attribute location module is used to identify a plurality of pedestrian attributes related to human body parts, and the second attribute location module uses It is used to identify multiple pedestrian attributes related to the global human body.
- the first attribute location module includes one or more of the head attribute location module, the upper body attribute location module or the lower body attribute location module; wherein the head attribute location module is used to identify
- the upper body attribute positioning module is used to identify multiple pedestrian attributes related to the upper body of the human body
- the lower body attribute positioning module is used to identify multiple pedestrian attributes related to the lower body of the human body.
- the pedestrian attribute recognition system further includes a feature extraction module; the feature extraction module is used to extract feature information from pedestrian images input into the pedestrian attribute recognition system.
- the feature extraction module includes P feature extraction layers, and P is an integer greater than 1; the feature extraction module is specifically used to sequentially pass the pedestrian image through the P feature extraction layers to extract P feature information of different levels , a feature information corresponds to a feature extraction layer.
- the pedestrian attribute recognition system further includes a feature fusion module; the feature fusion module is used to fuse P pieces of feature information of different levels extracted by the feature extraction module to obtain P pieces of feature information after fusion processing.
- At least one attribute location module is divided into P groups of attribute location modules, a group of attribute location modules corresponds to a fused feature information, and each group of attribute location modules includes K attribute location modules, K is An integer greater than 1 and less than M, where M is an integer greater than 1; a set of attribute positioning modules is used to output the first pedestrian attribute prediction vector according to the corresponding fused feature information, and the first pedestrian attribute prediction vector includes M The recognition results of individual pedestrian attributes.
- the pedestrian attribute recognition system also includes a feature recognition module; the feature recognition module is used to output the second pedestrian attribute prediction vector according to the highest-level feature information extracted by the feature extraction module, and the second pedestrian attribute prediction The vector includes the recognition results of M pedestrian attributes.
- the pedestrian attribute recognition system also includes a result output module; the result output module is used to output the first pedestrian attribute prediction vector according to each group of attribute location modules in the P group of attribute location modules, and the feature recognition module outputs The second pedestrian attribute vector is used to determine the final recognition result of the M pedestrian attributes.
- a training method for the pedestrian attribute recognition system described in any of the above embodiments includes: obtaining a training sample set, the training sample set includes a plurality of sample pedestrian images, and each sample pedestrian image has A corresponding attribute label, the attribute label is used to indicate the pedestrian attribute existing in the corresponding sample pedestrian image; according to the training sample set, the pedestrian attribute recognition system is trained to obtain a trained pedestrian attribute recognition system.
- a pedestrian recognition method comprising: acquiring an image of a pedestrian to be recognized; inputting the image of the pedestrian to be recognized into a pedestrian attribute recognition system, and obtaining a recognition result of the image of the pedestrian to be recognized.
- a training device which includes: an acquisition module, configured to acquire a training sample set, the training sample set includes a plurality of sample pedestrian images, each sample pedestrian image has a corresponding attribute label, and the attribute label is used to indicate Pedestrian attributes in the corresponding sample pedestrian images; the training module is used to train the pedestrian attribute recognition system according to the training sample set, so as to obtain the trained pedestrian attribute recognition system; wherein, the pedestrian attribute recognition system is any one of the above The pedestrian attribute recognition system described in the embodiment.
- an identification device which includes: an acquisition module, configured to acquire an image of a pedestrian to be identified; an identification module, configured to input the image of a pedestrian to be identified into a pedestrian attribute identification system, and obtain a recognition result of the image of the pedestrian to be identified ;
- the pedestrian attribute recognition system is the pedestrian attribute recognition system described in any one of the above-mentioned embodiments.
- a training device in yet another aspect, includes a memory and a processor; the memory and the processor are coupled; the memory is used to store computer program codes, and the computer program codes include computer instructions; wherein, when the processor executes the computer instructions , so that the device executes the training method provided in the above-mentioned embodiment.
- an identification device which includes a memory and a processor; the memory and the processor are coupled; the memory is used to store computer program codes, and the computer program codes include computer instructions; wherein, when the processor executes the computer instructions , so that the device executes the pedestrian attribute recognition method provided in the foregoing embodiments.
- a non-transitory computer-readable storage medium stores a computer program; wherein, when the computer program runs on the training device, the training device realizes the training provided by the above-mentioned embodiments. method; or, the computer program enables the recognition device to implement the pedestrian attribute recognition method provided in the above-mentioned embodiment when the recognition device is running.
- FIG. 1 is a structural diagram of an ALM module according to some embodiments
- FIG. 2 is another structural diagram of an ALM module according to some embodiments.
- Fig. 3 is a structural diagram of a pedestrian attribute recognition system according to some embodiments.
- Fig. 4 is another structural diagram of a pedestrian attribute recognition system according to some embodiments.
- Fig. 5 is another structural diagram of a pedestrian attribute recognition system according to some embodiments.
- Fig. 6 is another structural diagram of a pedestrian attribute recognition system according to some embodiments.
- Fig. 7 is a schematic diagram of pedestrian attribute recognition process according to some embodiments.
- Figure 8 is a flowchart of a training method according to some embodiments.
- Fig. 9 is a flowchart of an identification method according to some embodiments.
- Figure 10 is a block diagram of a training device according to some embodiments.
- Figure 11 is a block diagram of a training device according to some embodiments.
- Fig. 12 is a structural diagram of an identification device according to some embodiments.
- Fig. 13 is a structural diagram of an identification device according to some embodiments.
- first and second are used for descriptive purposes only, and cannot be understood as indicating or implying relative importance or implicitly specifying the quantity of indicated technical features. Thus, a feature defined as “first” and “second” may explicitly or implicitly include one or more of these features. In the description of the embodiments of the present disclosure, unless otherwise specified, "plurality” means two or more.
- the expressions “coupled” and “connected” and their derivatives may be used.
- the term “connected” may be used in describing some embodiments to indicate that two or more elements are in direct physical or electrical contact with each other.
- the term “coupled” may be used when describing some embodiments to indicate that two or more elements are in direct physical or electrical contact.
- the terms “coupled” or “communicatively coupled” may also mean that two or more elements are not in direct contact with each other, but yet still co-operate or interact with each other.
- the embodiments disclosed herein are not necessarily limited by the context herein.
- a and/or B includes the following three combinations: A only, B only, and a combination of A and B.
- the term “if” is optionally interpreted to mean “when” or “at” or “in response to determining” or “in response to detecting,” depending on the context.
- the phrases “if it is determined that " or “if [the stated condition or event] is detected” are optionally construed to mean “when determining ! or “in response to determining ! depending on the context Or “upon detection of [stated condition or event]” or “in response to detection of [stated condition or event]”.
- Deep learning deep learning (deep learning, DL)
- Deep learning It is a general term for a class of pattern analysis methods. It enables machines to have the ability to analyze and learn like humans. By learning the internal laws and representation levels of sample data, it can then recognize data such as text, images, and sounds.
- CNN Convolutional neural networks
- Convolutional neural network is a kind of feedforward neural network (Feedforward Neural Networks) with convolution calculation and deep structure. It is one of the representative algorithms of deep learning. Convolutional neural networks can be applied to computer vision such as image classification, object recognition, action recognition, pose estimation, and neural style transfer. It can be applied to natural language processing (natural language processing, NLP) and so on.
- feedforward Neural Networks feedforward Neural Networks
- Convolutional neural networks can be applied to computer vision such as image classification, object recognition, action recognition, pose estimation, and neural style transfer. It can be applied to natural language processing (natural language processing, NLP) and so on.
- a convolutional neural network includes an input layer, a hidden layer, and an output layer.
- the input layer of the convolutional neural network can process multi-dimensional data.
- the input layer can receive the pixel values (three-dimensional array) of the image, that is, the two-dimensional pixel points on the plane and the values of the RGB channels.
- the hidden layer of a convolutional neural network includes one or more convolutional layers, one or more pooling layers, and one or more fully-connected layers.
- the function of the convolutional layer is to extract features from the input data.
- the pooling layer is generally connected, so that after the feature extraction of the convolutional layer, the output data is passed to the pooling layer for selection and information filtering.
- Each node of the fully connected layer is connected to all the nodes of the previous layer to combine the acquired features, and the fully connected layer acts as a "classifier" in the entire convolutional neural network.
- the output layer of a convolutional neural network has the same structure and working principle as the output of a traditional feedforward neural network.
- the output layer uses a logic function or a normalized exponential function (softmax function) to output classification labels, such as: people, scenes, objects, etc.
- Convolutional neural network for pedestrian attribute recognition the output layer can be designed to output pedestrian attributes of pedestrian images.
- STN Spatial Transformer Network
- STN model It is one of the most basic recognition models in the field of affine transformation. Through the STN model, the input original image can be commented, scaled, rotated and other distortion transformation operations, so that the original image can be converted into a preset mode for better recognition.
- the STN model consists of three parts: parameter prediction module, coordinate mapping module and pixel acquisition module.
- the parameter prediction module is used to calculate the affine transformation parameters representing the space transformation between the original image and the transformed image.
- the coordinate mapping module is used to determine the coordinate points of the original image and the coordinate points of the transformed image.
- the pixel acquisition module is used to determine the transformed image.
- the STN model is usually placed in the initial recognition position to improve the accuracy of classification.
- the STN model can transform the received image by the following formula 1:
- s x , s y are scaling parameters
- t x , t y are translation parameters, and are the original coordinates of the i-th pixel in the received image and the coordinates of the i-th pixel in the transformed image.
- SE Net is essentially a channel-based attention model, which is modeled according to the dependencies between channels. It can assign different weights to each feature channel according to the importance of each feature channel, and can also adaptively increase or decrease different channel weight assignments for different task purposes. In the actual application process, the features with a large amount of information can be selectively enhanced through the SE Net model, so that subsequent processing can make full use of these features and suppress useless features.
- the activation function is used to make the artificial neural network have the nonlinear modeling ability.
- the network does not have an activation function
- the network can only express a linear map, and its entire network is equivalent to a single-layer neural network. Only when the activation function is added, the neural network has the ability of hierarchical nonlinear mapping learning.
- the classification problem can be solved by using a linear classifier. However, in the specific classification process, the actual data is often not linearly separable. At this time, an activation function can be introduced to solve the classification problem of nonlinear data.
- the sigmoid function is the most widely used type of activation function. It has the shape of an exponential function. It is the closest to biological neurons in the physical sense. It is a common S-type function in biology, also known as the S-type growth curve. It has a wide range of applications in logistic regression and artificial neural networks.
- the formula form of the Sigmoid function is the following formula (2):
- Linear rectification function (Rectified Linear Unit, ReLU): also known as the modified linear unit, is a commonly used activation function in artificial neural networks, usually refers to the nonlinear function represented by the variant of the ramp function, which belongs to the nonlinear activation function. It can simulate a more accurate activation model of brain neurons receiving signals from a biological point of view.
- the formula form of the slope function is the following formula (3):
- the "integral" algorithm regards the problem of pedestrian attribute recognition as a multi-label classification problem, usually using a convolutional neural network to extract the required features from the entire pedestrian picture, and using a fully connected layer at the top of the network to predict based on the extracted features.
- Pedestrian attributes usually using a convolutional neural network to extract the required features from the entire pedestrian picture, and using a fully connected layer at the top of the network to predict based on the extracted features.
- the "local type” algorithm pays more attention to some local areas in the input image that are important for classification prediction, pre-trains the human pose estimation model to predict the key points of the human body in the input image, and then roughly locates the head of the human body based on these key points , upper body, lower body and other local regions, so as to separate pedestrian images into images of different local regions (such as head region, upper body region and lower body region). Images of different local areas are input into a pre-trained pedestrian recognition attribute model corresponding to the local area, so as to identify pedestrian attributes corresponding to the image of the local area.
- an ALM corresponds to a pedestrian attribute
- the ALM can adaptively identify the local features corresponding to the pedestrian attributes in the pedestrian image, and identify the pedestrian attributes according to the local features, thereby improving the recognition accuracy and accuracy of pedestrian attributes. efficiency.
- One ALM provided by the related art corresponds to only one attribute of a pedestrian. Assuming that there are M pedestrian attributes to be identified in the pedestrian image, the pedestrian attribute recognition system provided by the related art needs to include at least M ALMs. For example, pedestrian images need to identify pedestrian attributes such as hairstyle, gender, and clothing, and the pedestrian attribute recognition system provided by related technologies must at least include ALM for hair style recognition, ALM for gender recognition, and ALM for clothing recognition. In this way, the pedestrian attribute recognition system provided by the related art includes more ALMs, which leads to a longer running time of the whole system.
- an embodiment of the present disclosure provides a pedestrian attribute recognition system, including at least one ALM.
- each ALM corresponds to multiple pedestrian attributes.
- the number of ALMs required by the pedestrian attribute recognition system can be effectively reduced, thereby reducing the running time of the pedestrian attribute recognition system and making the pedestrian attribute recognition system more efficient. It can be well applied in all fields of life and production.
- the M pedestrian attributes can be divided into K types, one type of pedestrian attributes can include multiple pedestrian attributes, and M is greater than 1 Integer, K is an integer greater than 1 and less than M. Therefore, in the pedestrian attribute recognition system, one ALM corresponds to one type of pedestrian attribute.
- the M pedestrian attributes can be divided into pedestrian attributes related to human body parts and pedestrian attributes related to the global human body.
- pedestrian attributes related to body parts may include whether to wear glasses, hairstyle, clothing, and the like. All pedestrian attributes related to the human body can include age, height, gender, etc.
- the pedestrian attribute recognition system may include a first ALM and a second ALM. Wherein, the first ALM is used to identify multiple pedestrian attributes related to human body parts. The second ALM is used to identify multiple pedestrian attributes globally related to the human body.
- pedestrian attributes related to human body parts can be further subdivided, for example, can be divided into: pedestrian attributes related to head, pedestrian attributes related to upper body and/or pedestrian attributes related to lower body.
- the pedestrian attributes related to the head may include whether to wear glasses, hairstyle and so on.
- Pedestrian attributes related to the upper body can include the type of clothing on the upper body, whether to wear a backpack, etc.
- Pedestrian attributes related to the lower body may include the type of lower body clothing, the type of boots, and the like.
- the first ALM may include one or more items of the head ALM, the upper body ALM, and the lower body ALM.
- the head ALM is used to identify multiple pedestrian attributes related to the human head.
- the upper body ALM is used to identify multiple pedestrian attributes related to the upper body of the human body.
- the lower body ALM is used to identify multiple pedestrian attributes related to the lower body of the human body.
- the pedestrian attributes related to body parts can be further refined, for example, can also be divided into: pedestrian attributes related to head, pedestrian attributes related to hands, pedestrian attributes related to torso and/or pedestrian attributes related to legs.
- the first ALM may include one or more items of the head ALM, the hand ALM, the torso ALM or the leg ALM.
- the head ALM is used to identify multiple pedestrian attributes related to the human head.
- Hand ALM is used to identify multiple pedestrian attributes related to human hands.
- the torso part ALM is used to identify multiple pedestrian attributes related to the human torso part.
- Leg ALM is used to identify multiple pedestrian attributes related to human legs.
- the M pedestrian attributes can be divided into clothing-related pedestrian attributes, action-related pedestrian attributes, body appearance-related pedestrian attributes, and the like.
- pedestrian attributes related to clothing may include type of upper body clothing, type of lower body clothing, type of shoes, color of upper body clothing, color of lower body clothing, color of shoes, and the like.
- Action-related attributes of pedestrians may include actions such as running, jumping, and walking.
- Appearance-related attributes of pedestrians can include height, age, gender, etc.
- the pedestrian attribute recognition system provided by the embodiments of the present disclosure may include a clothing attribute locating module, a body appearance attribute locating module, an action attribute locating module, and the like.
- the clothing attribute positioning module is used to identify multiple pedestrian attributes related to clothing.
- the body appearance attribute location module is used to identify multiple pedestrian attributes related to body appearance.
- the action attribute localization module is used to identify multiple pedestrian attributes related to actions.
- the division of M pedestrian attributes may be implemented using a clustering algorithm or a deep learning method. Therefore, multiple pedestrian attributes with implicit associations can be classified into the same class. In this way, even if one ALM is used to identify a category of pedestrian attributes, the purpose of reducing the amount of computation and improving the recognition speed can be achieved while ensuring the recognition accuracy.
- the ALM provided by the embodiments of the present disclosure includes a space transformation unit and an attribute recognition unit.
- the space transformation unit is connected to the attribute recognition unit, and the output of the space transformation unit is the input of the attribute recognition unit.
- the space transformation unit is configured to extract feature information in the discriminable region from the feature information input to the space transformation unit.
- the discriminable area is related to multiple pedestrian attributes corresponding to ALM.
- the attribute identification unit is configured to output identification results of a plurality of pedestrian attributes corresponding to the ALM according to the feature information in the discriminable area.
- the identification result of the attribute of the pedestrian may be a predicted probability value of the attribute of the pedestrian.
- the predicted probability value of the pedestrian attribute can be the probability value of the presence of the pedestrian attribute in the pedestrian image.
- the predicted probability value of the pedestrian attribute A is 65%, which means that there is a 65% possibility that the pedestrian attribute A exists in the pedestrian image.
- the identification result of pedestrian attributes may be the label value of pedestrian attributes.
- the label value of pedestrian attribute A is 1, indicating that the pedestrian attribute A exists in the pedestrian image; the label value of pedestrian attribute A is 0, indicating that there is no pedestrian attribute in the pedestrian image.
- the pedestrian attribute A exists. It should be understood that the meaning represented by the tag value of the attribute of the pedestrian may be determined according to the actual situation, which is not limited in this embodiment of the present disclosure.
- the attribute identification unit can determine the predicted probability value of each pedestrian attribute among the multiple pedestrian attributes corresponding to the ALM according to the feature information in the discriminable area; After that, the predicted probability value of each pedestrian attribute is compared with the probability threshold corresponding to each pedestrian attribute, and based on the comparison result, the label value of each pedestrian attribute is determined.
- the attribute recognition unit can determine the predicted probability value of pedestrian attribute A as 65% according to the feature information in the discriminable area; then, the attribute recognition unit compares the predicted probability value of pedestrian attribute A with 65% The corresponding probability threshold 50% is compared, and it is determined that the predicted probability value of pedestrian attribute A is greater than the probability threshold, thereby determining that the tag value of pedestrian attribute A is used to indicate the presence of pedestrian attribute A in the pedestrian image.
- the probability thresholds corresponding to different pedestrian attributes may be different.
- the probability threshold corresponding to pedestrian attribute A is 50%
- the probability threshold corresponding to pedestrian attribute B is 65%.
- the probability threshold corresponding to the attributes of pedestrians can be determined according to algorithms such as deep learning, which will not be described in detail here.
- the spatial transformation unit can realize the spatial transformation of various deformed data, transform the input into the expected form of the next layer of the network, and can also be trained to automatically select the regional features of interest in the actual recognition process (that is, the above-mentioned feature information in the discriminable area).
- the discriminable area is the semantic area of the multiple pedestrian attributes in the pedestrian image
- the feature information in the discriminable area is the feature information useful for identifying multiple pedestrian attributes . Therefore, compared with the algorithm for identifying pedestrian attributes based on global features, the ALM in the embodiments of the present disclosure can identify pedestrian attributes based on local features (that is, feature information in the discrimination area), thereby reducing the amount of calculation. And improve the accuracy and efficiency of pedestrian attribute recognition.
- the spatial transformation unit may adopt STN technology.
- the space transformation unit is specifically used to determine the transformation parameters of the discriminable region according to the characteristic information of the input space transformation unit; according to the transformation parameters of the distinguishable region, extract the discriminable region from the characteristic information of the input space transformation unit.
- the transformation parameters include scaling transformation parameters in the horizontal direction, scaling transformation parameters in the vertical direction, translation transformation parameters in the horizontal direction, and translation transformation parameters in the vertical direction.
- scaling transformation parameters in the horizontal direction the scaling transformation parameters in the vertical direction, the translation transformation parameters in the horizontal direction, and the translation transformation parameters in the vertical direction are combined to determine a rectangular bounding box, that is, to determine the boundary.
- the spatial transformation unit using the STN technology may include a first fully connected layer and a sampler.
- the feature information of the input space transformation unit passes through the first fully connected layer to obtain the transformation parameters; after that, the matrix R formed by the transformation parameters and the feature information of the input space transformation unit pass through the sampler to obtain the feature information in the discriminable area.
- the sampler performs a Kronecker product operation on the matrix R and the feature information to obtain the feature information in the discriminable area.
- the attribute recognition unit can be constructed with a second fully connected layer and a classification function (not shown in FIG. 1 ).
- the number of output neurons in the second fully connected layer is the same as the number of pedestrian attributes corresponding to the attribute recognition unit.
- the attribute identification unit is specifically used to input the feature information in the discriminable region into the second fully connected layer to obtain the feature vector output by the second fully connected layer; after that, input the feature vector output of the second fully connected layer into the classification function, Obtain the recognition results of multiple pedestrian attributes output by the classification function.
- the activation function may use a sigmoid function; when the pedestrian attribute is a multi-class pedestrian attribute, the activation function may use a softmax function.
- the pedestrian attributes in the embodiments of the present disclosure are binary pedestrian attributes. It should be understood that the multi-category pedestrian attributes can also be converted into multiple binary-category pedestrian attributes for processing.
- the ALM provided by the embodiments of the present disclosure may further include a channel attention unit.
- the channel attention unit is connected with the spatial transformation unit.
- the output of the channel attention unit is the input of the spatial transformation unit.
- the channel attention unit is configured to use the channel attention mechanism to calibrate the feature information input to the channel attention unit to obtain calibrated feature information.
- the attention mechanism is a mechanism that focuses on local information, for example, a certain image area in an image. But as the purpose of the task changes, the area of attention tends to change as well.
- Accompanying the attention mechanism is salient object detection. Its input is a map, and its output is a probability map. The higher the probability, the higher the probability of an important target in the image. .
- one dimension of feature information is the scale space of the image, that is, the length and width of the image, and the other dimension is the feature dimension of the image, including feature channels. Therefore, the channel attention mechanism can be used to automatically obtain the importance of each feature channel through learning, and then according to this importance to enhance useful features and suppress features that are not very useful for the current task (that is, to realize the importance of features). Calibration of information).
- ALM adds a channel attention unit, which can calibrate the feature information of the input space transformation unit, so as to improve the useful part of the feature information for identifying multiple pedestrian attributes corresponding to the ALM, and suppress the feature The part of the information that is useless for identifying multiple pedestrian attributes corresponding to the ALM, thereby improving the accuracy of pedestrian attribute identification.
- the channel attention unit can adopt SE net.
- the channel attention unit may include a global average pooling layer, a 1 ⁇ 1 convolutional layer, a ReLU activation layer, a 1 ⁇ 1 convolutional layer, a Sigmoid activation layer, a multiplier, and an adder.
- the channel attention unit is specifically used to pass the feature information input into the channel attention unit through the global average pooling layer, 1 ⁇ 1 convolution layer, ReLU activation layer, 1 ⁇ 1 convolution layer and Sigmoid activation layer to obtain the first A calibration vector; the feature information input to the channel attention unit is multiplied by the first calibration vector channel by channel to obtain a second correction vector; the feature information input to the channel attention unit is added element-by-element to the second correction vector , output the calibrated feature information.
- the pedestrian attribute recognition system provided in the embodiments of the present disclosure further includes a feature extraction module.
- the feature extraction module is used to extract feature information from pedestrian images input into the pedestrian attribute recognition system.
- the feature extraction module may include P feature extraction layers, where P is a positive integer greater than 1.
- the feature extraction module is specifically configured to sequentially pass the pedestrian image through P feature extraction layers to extract P feature information of different levels, and one feature information corresponds to one feature extraction layer.
- different feature extraction layers are used to extract feature information of different levels. Wherein, the level of feature information extracted by the feature extraction layer closer to the input of the pedestrian attribute recognition system is lower, and the level of feature information extracted by the feature extraction layer closer to the output of the pedestrian attribute recognition system is higher.
- the feature extraction module may adopt a batch normalization (Batch Normalization, BN)-bottleneck (inception) architecture, or may also adopt other CNN architectures.
- each feature extraction layer in the feature extraction module may include at least one inception block.
- inception block For the specific structure of the inception block, reference may be made to related technologies, which will not be repeated here.
- the feature information extracted by the feature extraction module can be directly used as feature information input to the ALM. Based on this, as shown in FIG. 3 , when the feature extraction module includes P feature extraction layers, at least one ALM included in the pedestrian attribute recognition system can be divided into P groups of ALMs, and each group of ALMs includes multiple ALMs. A set of ALMs corresponds to a feature extraction layer.
- the pedestrian attribute recognition system includes 12 ALMs, and the 12 ALMs can be divided into 3 groups.
- the first group of ALMs includes ALM1-1, ALM1-2, ALM1-3 and ALM1-4.
- the second group of ALMs includes ALM2-1, ALM2-2, ALM2-3, and ALM2-4.
- the third ALM includes ALM3-1, ALM3-2, ALM3-3 and ALM3-4.
- ALM1-1, ALM2-1 and ALM3-1 may all be ALMs for identifying multiple pedestrian attributes related to human heads.
- ALM1-2, ALM2-2 and ALM3-2 may all be ALMs for identifying multiple pedestrian attributes related to the upper body of a human body.
- ALM1-3, ALM2-3, and ALM3-3 may all be ALMs for identifying multiple pedestrian attributes related to the lower body of a human body.
- ALM1-4, ALM2-4, and ALM3-4 can all be ALMs for identifying multiple pedestrian attributes globally related to the human body.
- inputting the feature information of the group of ALMs can be the feature information extracted by the feature extraction layer corresponding to the group of ALMs.
- the group of ALMs is used to output a first pedestrian attribute prediction vector, and the first pedestrian attribute prediction vector includes recognition results of M pedestrian attributes of the pedestrian image.
- the feature information extracted by the feature extraction module also needs to undergo a series of processing (such as fusion processing, etc.), and the processed feature information is used as the feature information input to the ALM.
- a series of processing such as fusion processing, etc.
- the pedestrian attribute recognition system provided by the embodiments of the present disclosure further includes a feature fusion module.
- the feature fusion module is used to fuse the P pieces of feature information of different levels extracted by the feature extraction module, and output the P pieces of feature information after fusion processing.
- the feature extraction module and the feature fusion module may adopt a feature pyramid network architecture.
- the feature extraction module is specifically used to directly use the feature information extracted from the Pth feature extraction layer as the feature information after the Pth fusion process; for the remaining P-
- the feature information extracted by the i-th feature extraction layer is fused with the i+1-th fused feature information to obtain the i-th fused feature information.
- i is an integer greater than or equal to 1 and less than or equal to P-1.
- the fusion processing may include the following operations: upsampling the feature information after the i+1th fusion process to obtain the feature information after the upsampling process; and then combining the feature information after the upsampling process with the ith
- the feature information extracted by the feature extraction layer is spliced according to the number of channels to obtain the i-th feature information after fusion processing.
- the magnification factor used for upsampling mainly considers the resolution of the feature information extracted by the i-th feature extraction layer and the resolution of the feature information extracted by the i+1-th feature extraction layer.
- the resolution of the feature information extracted by the i-th feature extraction layer is 16 ⁇ 8, and the resolution of the feature information extracted by the i+1-th feature extraction layer is 8 ⁇ 4, so the upsampling adopted
- the magnification factor is 2.
- At least one ALM included in the pedestrian attribute recognition system can be divided into P groups of ALMs, each group of ALMs includes multiple ALMs.
- the feature information input into the i-th group of ALMs is the feature information after the i-th fusion process.
- the group of ALMs is used to output a first pedestrian attribute prediction vector, and the first pedestrian attribute prediction vector includes recognition results of M pedestrian attributes of the pedestrian image.
- low-level feature information has relatively rich detail information
- high-level feature information has relatively rich semantic information.
- Low-level feature information and high-level feature information are complementary.
- the feature information input to ALM is the feature information after fusion processing, which is beneficial for ALM to use the advantages of high-level feature information and low-level feature information to improve the accuracy of pedestrian attribute recognition.
- the pedestrian attribute recognition system may further include a feature recognition module.
- the feature recognition module is connected with the feature extraction module.
- the feature recognition module is used to obtain the second pedestrian attribute prediction vector according to the feature information of the highest level extracted by the feature extraction module.
- the second pedestrian attribute prediction vector includes recognition results of M pedestrian attributes of the pedestrian image.
- the feature recognition module can be constructed with a third fully connected layer and a classification function.
- the feature recognition module is specifically used to input the highest-level feature information extracted by the feature extraction module into the second fully connected layer to obtain the feature vector output by the second fully connected layer; after that, input the feature vector output by the second fully connected layer A classification function to obtain the second pedestrian attribute prediction vector output by the classification function.
- the feature recognition module uses global features to identify pedestrian attributes, while the ALM uses local features to identify pedestrian attributes.
- the pedestrian attribute recognition system provided by the embodiments of the present disclosure can make full use of the advantages of the two recognition methods to improve the accuracy of pedestrian attribute recognition.
- the pedestrian attribute recognition system provided by the embodiment of the present disclosure may also include the result output module.
- the result output module is specifically used to output the final recognition results of M pedestrian attributes in the pedestrian image according to the first pedestrian attribute prediction vector output by each group of ALMs and the second pedestrian attribute prediction vector output by the feature recognition module.
- the final recognition result of pedestrian attributes may be the final predicted probability value of pedestrian attributes.
- the final identification result of pedestrian attributes may be the final label value of pedestrian attributes.
- the following takes the pedestrian attribute recognition result included in the pedestrian prediction vector as the predicted probability value of the pedestrian attribute, and the final recognition result of the pedestrian attribute as the final label value of the pedestrian attribute as an example to specifically illustrate the processing flow of the result output module.
- the result output module selects the largest one from the predicted probability values of the target pedestrian attributes contained in each first pedestrian attribute prediction vector and the predicted probability values of the target pedestrian attributes contained in the second pedestrian attribute prediction vector The predicted probability value is used as the final predicted probability value of the target pedestrian attributes.
- the attribute of the target pedestrian may be any one of the M pedestrian attributes that need to be identified in the pedestrian image.
- the result output module judges whether the final predicted probability value of the target pedestrian attribute is greater than or equal to the probability threshold corresponding to the target pedestrian attribute; if the final predicted probability value of the target pedestrian attribute is greater than or equal to the probability threshold corresponding to the target pedestrian attribute, the result output module Determine the final label value of the target pedestrian attribute to indicate the label value of the target pedestrian attribute in the pedestrian image; conversely, the result output module determines the final label value of the target pedestrian attribute to indicate the absence of the target pedestrian attribute in the pedestrian image value.
- the pedestrian attribute recognition system includes 3 groups of ALMs and a feature recognition module.
- the predicted probability value of pedestrian attribute A contained in the first pedestrian attribute prediction vector output by the first group of ALMs is 60%; the first pedestrian attribute prediction vector output by the second group of ALMs contains The predicted probability value of the included pedestrian attribute A is 62%; the predicted probability value of the pedestrian attribute A contained in the first pedestrian attribute prediction vector output by the third group of ALM is 65%; the second pedestrian attribute prediction output by the feature recognition module The predicted probability value of pedestrian attribute A contained in the vector is 40%. Based on this, the result output module can determine that the final predicted probability value of pedestrian attribute A is 65%.
- the following takes the pedestrian attribute recognition result included in the pedestrian prediction vector as the label value of the pedestrian attribute, and the final recognition result of the pedestrian attribute as the final label value of the pedestrian attribute as an example to specifically illustrate the processing flow of the result output module.
- the result output module For the target pedestrian attribute, the result output module counts each label value of the target pedestrian attribute according to the label value of the target pedestrian attribute in each first pedestrian attribute prediction vector and the label value of the target pedestrian attribute in the second pedestrian attribute prediction vector the number of occurrences. Therefore, the result output module can select the label value with the most occurrences as the final label value of the target pedestrian attribute.
- the pedestrian attribute recognition system includes 3 groups of ALMs and a feature recognition module.
- the label value of pedestrian attribute A contained in the first pedestrian attribute prediction vector output by the first group of ALMs is 1; the label value of the first pedestrian attribute prediction vector output by the second group of ALMs contains The label value of pedestrian attribute A is 1; the label value of pedestrian attribute A contained in the first pedestrian attribute prediction vector output by the third group of ALM is 1; the pedestrian attribute contained in the second pedestrian attribute prediction vector output by the feature recognition module A has a label value of 0. Therefore, the result output module can determine that the final label value of pedestrian attribute A is 1.
- the pedestrian attribute identification system may not only include an attribute location module for identifying multiple pedestrian attributes, but may also include an attribute location module for identifying one pedestrian attribute, which is not limited.
- the pedestrian attribute recognition system provided by the embodiment of the present disclosure can input a pedestrian image with a preset resolution.
- the preset resolution may be 256 ⁇ 128.
- Pedestrian images with preset resolutions pass through three feature extraction layers in the feature extraction module in turn to obtain three different levels of feature information as well as Exemplary, characteristic information
- the resolution can be 32 ⁇ 16, feature information
- the resolution can be 16 ⁇ 8, feature information
- the resolution can be 8x4.
- feature information as well as The number of channels can be 256.
- X 3 is X 2 is the upsampled X 3 and It is obtained after splicing according to the number of channels
- X 1 is X 2 after upsampling processing and It is obtained after splicing according to the number of channels.
- the resolution of feature information X 3 is 8 ⁇ 4, and the number of channels is 256.
- the resolution of feature information X 2 is 16 ⁇ 8, and the number of channels is 512.
- the resolution of feature information X 3 is 32 ⁇ 16, and the number of channels is 768.
- the feature information X 1 is input into the first group of ALM, and the pedestrian attribute prediction vector is obtained
- the feature information X 2 is input into the second group of ALM, and the pedestrian attribute prediction vector is obtained
- the feature information X 3 is input into the third group of ALM, and the pedestrian attribute prediction vector is obtained
- the three groups of ALMs all include K ALMs.
- the above pedestrian attribute prediction vector as well as Both include the recognition results of M pedestrian attributes.
- the pedestrian attribute recognition system will predict the pedestrian attribute vector as well as The result recognition module is input to obtain the recognition result of the pedestrian image, that is, the final recognition result of M pedestrian attributes.
- the pedestrian attribute recognition system may include more or less modules. Moreover, some modules in the above pedestrian attribute recognition system can be integrated together, or some modules can be divided into more modules.
- the pedestrian attribute recognition system provided by the embodiments of the present disclosure may be implemented by software, hardware, or a combination of software and hardware.
- the pedestrian attribute recognition system can be called a pedestrian attribute recognition model.
- the pedestrian attribute recognition system can be implemented by a processor.
- the processor can be a general-purpose logical operation device with data processing capability and/or program execution capability, such as a central processing unit (central processing unit, CPU), an image processing unit (graphics processing unit, GPU), a microprocessor ( microcontroller unit, MCU), etc., the processor executes the computer instructions of the corresponding functions to realize the corresponding functions.
- Computer instructions include one or more processor operations defined by an instruction set architecture corresponding to the processor, and these computer instructions may be logically embodied and represented by one or more computer programs.
- the processor may be a hardware entity that can be programmed and adjusted to perform corresponding functions, such as a field programmable logic array (field programmable gate array, FPGA) or an application specific integrated circuit (application specific integrated circuit, ASIC).
- FPGA field programmable gate array
- ASIC application specific integrated circuit
- the processor may be a hardware circuit specially designed to perform corresponding functions, such as a tensor processing unit (TPU) or a neural-network processing unit (NPU).
- TPU tensor processing unit
- NPU neural-network processing unit
- an embodiment of the present disclosure provides a training method for training the aforementioned pedestrian attribute recognition system.
- the method includes the following steps:
- the training device acquires a training sample set.
- the training sample set includes at least one sample pedestrian image.
- Each sample pedestrian image is pre-set with the corresponding attribute label.
- M is the number of pedestrian attributes that need to be identified.
- y m ⁇ [0,1], m ⁇ [1,2,...,M].
- y m 0, indicating that the pedestrian image does not contain the pedestrian attribute.
- y m 1, indicating that the pedestrian image contains the pedestrian attribute.
- Table x shows specific examples of attributes of pedestrians that need to be identified in pedestrian images. It should be understood that in actual use, the attributes of pedestrians that need to be identified in pedestrian images can be more or less than the examples in Table 1.
- the attribute label of a pedestrian image is [1,0,0,0,1,0,1,1,1], indicating that the pedestrian attributes contained in the pedestrian image are: black hair, Male, wearing glasses, wearing a T-shirt on the upper body, jeans on the lower body, and carrying a backpack.
- the training device trains the pedestrian attribute recognition system according to the training sample set, so as to obtain a trained pedestrian attribute recognition system.
- the training device can input each sample pedestrian image in the training sample set to the pedestrian attribute recognition system to train each module (such as AML, feature extraction module, feature recognition module, etc.) in the pedestrian attribute recognition system.
- each module such as AML, feature extraction module, feature recognition module, etc.
- the training device may determine a loss value corresponding to the pedestrian attribute recognition system by using a preset loss function according to the recognition result of the pedestrian attribute recognition system on the sample pedestrian image and the attribute labels of the sample pedestrian image. Afterwards, the training device uses a gradient descent algorithm to update the pedestrian attribute recognition system according to the loss value of the pedestrian attribute recognition system. It should be understood that updating the pedestrian attribute recognition system specifically refers to updating parameters (such as weight values and offset values) in the pedestrian attribute recognition system.
- the preset loss function may be a binary cross-entropy loss function.
- L 1 , L 2 , L 3 , and L 4 can all satisfy the following formula:
- M is the number of pedestrian attributes that need to be identified.
- y m is the tag value of the mth pedestrian attribute
- ⁇ m is the weight of the mth pedestrian attribute
- ⁇ is the preset parameter.
- the training device can obtain the trained pedestrian attribute recognition system.
- the training device may also verify the trained pedestrian attribute recognition system, so as to avoid overfitting of the trained pedestrian attribute recognition system.
- the embodiments of the present disclosure can complete the training of the pedestrian attribute recognition system in a weakly supervised manner, reducing the complexity of the training.
- the trained pedestrian attribute recognition system includes at least one attribute location module, and one attribute location module can correspond to multiple pedestrian attributes. That is, one attribute location module can determine the recognition results of multiple pedestrian attributes through one calculation. This can effectively reduce the overall calculation amount of the pedestrian attribute recognition system, thereby reducing the time consumed to obtain the recognition result corresponding to the pedestrian image.
- an embodiment of the present disclosure provides a sexual attribute identification method based on the aforementioned pedestrian attribute identification system, the method includes the following steps:
- the identification device acquires images of pedestrians to be identified.
- the pedestrian image to be recognized may be a frame image in a video.
- the video may be a surveillance video captured by a security camera.
- the recognition device inputs the pedestrian image to be recognized into the pedestrian attribute recognition system, and obtains a recognition result corresponding to the pedestrian image to be recognized.
- the recognition result corresponding to the pedestrian image is used to indicate the predicted probability values of M pedestrian attributes in the pedestrian image. It should be understood that, in the case that the pedestrian attribute recognition system includes a result output module, the predicted probability value indicated by the recognition result is the final predicted probability value output by the result output module.
- the recognition result corresponding to the pedestrian image is used to indicate the pedestrian attributes existing in the pedestrian image.
- the recognition result corresponding to the pedestrian image includes label values of M pedestrian attributes. It should be understood that, in the case where the pedestrian attribute recognition system includes a result output module, the tag value of the pedestrian attribute included in the recognition result is the final tag value output by the result output module.
- the pedestrian image to be recognized can be preprocessed first, so that the preprocessed pedestrian image meets the input requirements of the pedestrian attribute recognition system (such as the size of the image); after that, the preprocessed pedestrian image
- the preprocessing may include size normalization processing and the like.
- one attribute location module may correspond to multiple pedestrian attributes. Therefore, based on the recognition method of the pedestrian attribute recognition system provided by the embodiment of the present disclosure, the amount of calculation for pedestrian attribute recognition for pedestrian images can be reduced, thereby reducing the time consumed to obtain the recognition results corresponding to pedestrian images.
- the recognition device and the training device may be two independent devices, or may be integrated into one device. In the case that the recognition device and the training device are two independent devices, the recognition device can obtain the trained pedestrian attribute recognition system from the training device.
- the identification device and the training device may be servers or terminal devices.
- the server may be a device with data processing capability and data storage capability.
- the server may be a server, or a server cluster composed of multiple servers, or a cloud computing service center, which is not limited.
- the terminal device can be a mobile phone, a tablet computer, a handheld computer, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a personal digital assistant (PDA), an augmented reality (augmented reality, AR) Or a virtual reality (VR) device.
- UMPC ultra-mobile personal computer
- PDA personal digital assistant
- AR augmented reality
- VR virtual reality
- An embodiment of the present disclosure also provides a training device.
- the training device 1000 includes an acquisition unit 1001 and a training unit 1002 .
- the acquiring unit 1001 is configured to acquire a training sample set, the training sample set includes a plurality of sample pedestrian images, each sample pedestrian image has a corresponding attribute label, and the attribute label is used to indicate the pedestrian attribute existing in the corresponding sample pedestrian image.
- the training unit 1002 is configured to train the pedestrian attribute recognition system according to the training sample set to obtain a trained pedestrian attribute recognition system; wherein, the pedestrian attribute recognition system is any pedestrian attribute recognition system provided in the above implementation.
- the training device 1100 includes a memory 1101 and a processor 1102; the memory 1101 and the processor 1102 are coupled; the memory 1101 is used to store computer program codes, and the computer program codes include computer instructions.
- the processor 1102 executes the computer instructions, the training device 1100 is made to execute various steps performed by the training device in the method flow shown in the above method embodiments.
- an embodiment of the present disclosure also provides an identification device.
- the identification device 2000 includes an acquisition unit 2001 and an identification unit 2002 .
- the acquiring unit 2001 is configured to acquire images of pedestrians to be identified.
- the recognition unit 2002 is configured to input the image of the pedestrian to be recognized into the pedestrian attribute recognition system to obtain the recognition result of the pedestrian image to be recognized; wherein the pedestrian attribute recognition system is any pedestrian attribute recognition system provided in the above implementation.
- the identification device 2100 includes a memory 2101 and a processor 2102; the memory 2101 and the processor 2102 are coupled; the memory 2101 is used to store computer program codes, and the computer program codes include computer instructions.
- the recognition device 2100 is made to execute each step performed by the recognition device in the method flow shown in the above method embodiment.
- Some embodiments of the present disclosure also provide a computer-readable storage medium (for example, a non-transitory computer-readable storage medium), where a computer program is stored in the computer-readable storage medium, and when the computer program is run on a processor , causing the processor to execute one or more steps in the training method described in the above method embodiment; or, when the computer program is run on the processor, the processor is made to execute the recognition method as described in the above method embodiment One or more steps in .
- a computer-readable storage medium for example, a non-transitory computer-readable storage medium
- the above-mentioned computer-readable storage medium may include, but is not limited to: a magnetic storage device (for example, a hard disk, a floppy disk, or a magnetic tape, etc.), an optical disk (for example, a CD (Compact Disk, a compact disk), a DVD (Digital Versatile Disk, Digital Versatile Disk), etc.), smart cards and flash memory devices (for example, EPROM (Erasable Programmable Read-Only Memory, Erasable Programmable Read-Only Memory), card, stick or key drive, etc.).
- Various computer-readable storage media described in this disclosure can represent one or more devices and/or other machine-readable storage media for storing information.
- the term "machine-readable storage medium" may include, but is not limited to, wireless channels and various other media capable of storing, containing and/or carrying instructions and/or data.
- Some embodiments of the present disclosure also provide a computer program product.
- the computer program product includes a computer program, and when the computer program is executed on the training device, the processor is executed as one or more steps in the training method described in the above method embodiments; or, the computer is executed on the recognition device When the program is used, the processor is made to execute one or more steps in the identification method described in the above method embodiments.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Human Computer Interaction (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Molecular Biology (AREA)
- Biodiversity & Conservation Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
一种行人识别系统,该系统包括:至少一个属性定位模块,每个属性定位模块对应多个行人属性;属性定位模块包括空间变换单元和属性识别单元;空间变换单元,用于从输入该空间变换单元的特征信息中提取出可判别区域内的特征信息,可判别区域与属性定位模块对应的多个行人属性相关;属性识别单元,用于根据可判别区域内的特征信息,获取属性定位模块对应的多个行人属性的识别结果。
Description
本公开涉及智能识别技术领域,尤其涉及一种行人属性识别系统及其训练方法、行人属性识别方法。
行人属性识别(Pedestrian Attribute Recognition),是指利用计算机来预测和分析图像中与行人有关的各类属性信息。常见的行人属性识别包括识别行人的性别、肤色、年龄、体态等宏观属性,也包括背包类型、衣服类型及颜色、裤子类型及颜色、当前动作等特定的人物属性。准确、高效的行人属性识别方法,有助于各类基于行人图像的分析工作的进行,行人属性识别也在各个领域中越来越得到广泛的应用。目前,行人属性识别可以应用于视频监控、智慧城市、公共安全、广告精准投放等多个领域,其已成为计算机视觉领域的一个重要研究方向。
发明内容
一方面,提供一种行人属性识别系统,该系统包括:至少一个属性定位模块,每个属性定位模块对应多个行人属性;属性定位模块包括空间变换单元和属性识别单元;空间变换单元,用于从输入该空间变换单元的特征信息中提取出可判别区域内的特征信息,可判别区域与属性定位模块对应的多个行人属性相关;属性识别单元,用于根据可判别区域内的特征信息,获取属性定位模块对应的多个行人属性的识别结果。
在一些实施例中,该空间变换单元,具体用于根据该输入空间变换模块的特征信息,确定可判别区域的变换参数;根据可判别区域的变换参数,从输入该空间变换模块的特征信息中提取出可判别区域内的特征信息;其中,变换参数包括水平方向上的缩放变换参数、垂直方向上的缩放变换参数、水平方向上的平移变换参数以及垂直方向上的平移变换参数。
另一些实施例中,属性定位模块还包括通道注意力单元;通道注意力单元,用于对输入该通道注意力单元的特征信息进行校准,得到校准后的特征信息,校准后的特征信息作为输入空间变换单元的特征信息。
另一些实施例中,通道注意力单元,具体用于将输入通道注意力单元的特征信息依次经过全局平均池化层、1×1卷积层、ReLU激活层、1×1卷积层以及Sigmoid激活层,得到第一校准向量;将第一校准向量 与输入通道注意力单元的特征信息按通道逐一相乘,得到第二校正向量;将第二校正向量与输入通道注意力单元的特征信息逐元素相加,得到校准后的特征信息。
另一些实施例中,至少一个属性定位模块包括第一属性定位模块和/或第二属性定位模块,第一属性定位模块用于识别与人体部位相关的多个行人属性,第二属性定位模块用于识别与人体全局相关的多个行人属性。
另一些实施例中,第一属性定位模块包括头部属性定位模块、上半身属性定位模块或者下半身属性定位模块中的一项或者多项;其中,头部属性定位模块用于识别与人体头部相关的多个行人属性,上半身属性定位模块用于识别与人体上半身相关的多个行人属性,下半身属性定位模块用于识别与人体下半身相关的多个行人属性。
另一些实施例中,该行人属性识别系统还包括特征提取模块;特征提取模块用于从输入行人属性识别系统的行人图像中提取出特征信息。
另一些实施例中,特征提取模块包括P个特征提取层,P为大于1的整数;特征提取模块,具体用于将行人图像依次通过P个特征提取层,提取出不同层次的P个特征信息,一个特征信息与一个特征提取层对应。
另一些实施例中,行人属性识别系统还包括特征融合模块;特征融合模块,用于对特征提取模块提取出的不同层次的P个特征信息进行融合处理,得到P个融合处理后的特征信息。
另一些实施例中,至少一个属性定位模块被划分为P组属性定位模块,一组属性定位模块与一个融合处理后的特征信息对应,并且每组属性定位模块包括K个属性定位模块,K为大于1小于M的整数,M为大于1的整数;一组属性定位模块,用于根据对应的融合处理后的特征信息,输出第一行人属性预测向量,第一行人属性预测向量包括M个行人属性的识别结果。
另一些实施例中,该行人属性识别系统还包括特征识别模块;特征识别模块,用于根据特征提取模块所提取出来的最高层次的特征信息,输出第二行人属性预测向量,第二行人属性预测向量包括M个行人属性的识别结果。
另一些实施例中,该行人属性识别系统还包括结果输出模块;结果输出模块,用于根据P组属性定位模块中各组属性定位模块输出的第一行人属性预测向量,以及特征识别模块输出的第二行人属性向量,确定 M个行人属性的最终识别结果。
另一方面,提供一种用于上述任一实施例所述的行人属性识别系统的训练方法,该方法包括:获取训练样本集,训练样本集包括多个样本行人图像,每个样本行人图像具有对应的属性标签,属性标签用于指示对应的样本行人图像中存在的行人属性;根据训练样本集,对行人属性识别系统进行训练,以获得训练完成的行人属性识别系统。
又一方面,提供一种行人识别方法,该方法包括:获取待识别的行人图像;将待识别行人图像输入行人属性识别系统,获得待识别行人图像的识别结果。
又一方面,提供一种训练装置,该装置包括:获取模块,用于获取训练样本集,训练样本集包括多个样本行人图像,每个样本行人图像具有对应的属性标签,属性标签用于指示对应的样本行人图像中存在的行人属性;训练模块,用于根据训练样本集,对行人属性识别系统进行训练,以获得训练完成的行人属性识别系统;其中,该行人属性识别系统为上述任一实施例所述的行人属性识别系统。
又一方面,提供一种识别装置,该装置包括:获取模块,用于获取待识别的行人图像;识别模块,用于将待识别行人图像输入行人属性识别系统,获得待识别行人图像的识别结果;其中,该行人属性识别系统为上述任一实施例所述的行人属性识别系统。
又一方面,提供一种训练装置,该装置包括存储器和处理器;存储器和处理器耦合;该存储器用于存储计算机程序代码,计算机程序代码包括计算机指令;其中,当该处理器执行计算机指令时,使得该装置执行如上述实施例提供的训练方法。
又一方面,提供一种识别装置,该装置包括存储器和处理器;存储器和处理器耦合;该存储器用于存储计算机程序代码,计算机程序代码包括计算机指令;其中,当该处理器执行计算机指令时,使得该装置执行上述实施例中提供的行人属性识别方法。
又一方面,提供一种非瞬态的计算机可读存储介质,该计算机可读存储介质存储有计算机程序;其中,计算机程序在训练装置上运行时,使得训练装置实现如上述实施例提供的训练方法;或者,该计算机程序在识别装置运行时,使得识别装置实现如上述实施例提供的行人属性识别方法。
为了更清楚地说明本公开中的技术方案,下面将对本公开一些实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本公开的一些实施例的附图,对于本领域普通技术人员来讲,还可以根据这些附图获得其他的附图。此外,以下描述中的附图可以视作示意图,并非对本公开实施例所涉及的产品的实际尺寸、方法的实际流程、信号的实际时序等的限制。
图1为根据一些实施例的ALM模块的一种结构图;
图2为根据一些实施例的ALM模块的另一种结构图;
图3为根据一些实施例的行人属性识别系统的一种结构图;
图4为根据一些实施例的行人属性识别系统的另一种结构图;
图5为根据一些实施例的行人属性识别系统的另一种结构图;
图6为根据一些实施例的行人属性识别系统的另一种结构图;
图7为根据一些实施例的行人属性识别过程的一种示意图;
图8为根据一些实施例的一种训练方法的流程图;
图9为根据一些实施例的一种识别方法的流程图;
图10为根据一些实施例的一种训练装置的结构图;
图11为根据一些实施例的一种训练装置的结构图;
图12为根据一些实施例的一种识别装置的结构图;
图13为根据一些实施例的一种识别装置的结构图。
下面将结合附图,对本公开一些实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本公开一部分实施例,而不是全部的实施例。基于本公开所提供的实施例,本领域普通技术人员所获得的所有其他实施例,都属于本公开保护的范围。
除非上下文另有要求,否则,在整个说明书和权利要求书中,术语“包括(comprise)”及其其他形式例如第三人称单数形式“包括(comprises)”和现在分词形式“包括(comprising)”被解释为开放、包含的意思,即为“包含,但不限于”。在说明书的描述中,术语“一个实施例(one embodiment)”、“一些实施例(some embodiments)”、“示例性实施例(exemplary embodiments)”、“示例(example)”、“特定示例(specific example)”或“一些示例(some examples)”等旨在表明与该实施例或示例相关的特定特征、结构、材料或特性包括在本公开的至少一个实施例或示例中。上述术语的示意性表示不一定是指同一实施例或示例。此外,所述的特定特征、结构、材料或特点可以以任何适当方 式包括在任何一个或多个实施例或示例中。
以下,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本公开实施例的描述中,除非另有说明,“多个”的含义是两个或两个以上。
在描述一些实施例时,可能使用了“耦接”和“连接”及其衍伸的表达。例如,描述一些实施例时可能使用了术语“连接”以表明两个或两个以上部件彼此间有直接物理接触或电接触。又如,描述一些实施例时可能使用了术语“耦接”以表明两个或两个以上部件有直接物理接触或电接触。然而,术语“耦接”或“通信耦合(communicatively coupled)”也可能指两个或两个以上部件彼此间并无直接接触,但仍彼此协作或相互作用。这里所公开的实施例并不必然限制于本文内容。
“A和/或B”,包括以下三种组合:仅A,仅B,及A和B的组合。
如本文中所使用,根据上下文,术语“如果”任选地被解释为意思是“当……时”或“在……时”或“响应于确定”或“响应于检测到”。类似地,根据上下文,短语“如果确定……”或“如果检测到[所陈述的条件或事件]”任选地被解释为是指“在确定……时”或“响应于确定……”或“在检测到[所陈述的条件或事件]时”或“响应于检测到[所陈述的条件或事件]”。
另外,“基于”或“根据”的使用意味着开放和包容性,因为“基于”或“根据”一个或多个所述条件或值的过程、步骤、计算或其他动作在实践中可以基于额外条件或超出所述的值。
为了便于理解,首先对本发明实施例涉及到的一些术语或技术的基本概念进行简单的介绍和说明。
1、深度学习(deep learning,DL)
深度学习:是一类模式分析方法的统称,它可以让机器能够像人一样具有分析学习能力,通过学习样本数据的内在规律和表示层次,进而能够识别文字、图像和声音等数据。
2、卷积神经网络(convolutional neural networks,CNN)
卷积神经网络是一类包含卷积计算且具有深度结构的前馈神经网络(Feedforward Neural Networks),是深度学习的代表算法之一。卷积神经网络可以应用于图像识别(image classification)、物体识别(object recognition)、行为认知(action recognition)、姿态估计(pose estimation)、神经风格转换(neural style transfer)等计算机视觉方面,也可以应用于 自然语言处理(natural language processing,NLP)方面等。
一般而言,卷积神经网络包括输入层、隐含层和输出层。
其中,卷积神经网络的输入层可以处理多维数据。以图像处理为例,输入层可以接收图像的像素值(三维数组),即平面上的二维像素点和RGB通道的数值。
卷积神经网络的隐含层包括一个或多个卷积层(convolutional layer)、一个或多个池化层(pooling layer),以及一个或多个全连接层(fully-connected layer)。其中,卷积层的功能是对输入数据进行特征提取。卷积层之后一般会连接池化层,从而在卷积层进行特征提取后,输出的数据被传递到池化层进行选择和信息过滤。全连接层的每一个结点都与上一层的所有结点相连,用于将获取到的特征综合起来,全连接层在整个卷积神经网络中起到“分类器”的作用。
卷积神经网络的输出层,其结构和工作原理与传统前馈神经网络的输出相同。例如,对于图形分类的卷积神经网络,输出层使用逻辑函数或归一化指数函数(softmax function)输出分类标签,例如:人、景、物等。用于行人属性识别的卷积神经网络,输出层可设计为输出行人图像的行人属性。
3、空间变换网络(Spatial Transformer Network,STN)
STN模型:是仿射变换领域最基础的识别模型之一。经过STN模型,可以将输入的原始图像进行评议、缩放、旋转等扭曲变换操作,从而可以将原始图片转换成为预设的模式,以便于更好的识别。STN模型由参数预测模块、坐标映射模块以及像素采集模块三部分组成。参数预测模块用于计算表示原始图像与变换后的图像空间变换的仿射变换参数。坐标映射模块用于确定原始图像的坐标点以及变换图像的坐标点。像素采集模块用于确定转变后的图像。在实际应用中,通常把STN模型放置于初始识别位置,以提升分类的精确度。示例性的,STN模型可以对接收到的图像通过下述公式1进行变换:
4、挤压-激励网络(Squeeze and Excitation Net,SE Net)
SE Net本质上是一个基于通道的注意力模型,依据通道间的依赖关系进行建模。其可以根据各个特征通道的重要程度,为各个特征通道分配不同的权重,针对不同的任务目的还可以自适应的增大或者减小不同的通道权重分配。在实际应用过程中,通过SE Net模型可以有选择性的增强信息量大的特征、使得后续处理可以充分利用这些特征,并对无用特征进行抑制。
5、激活函数(activation function)
激活函数用于使得人工神经网络具备非线性建模能力。在网络不具备激活函数的情况下,那么该网络仅能够表达线性映射,其整个网络与单层神经网络是等价的。只有在加入了激活函数的情况下,神经网络才具备了分层的非线性映射学习能力。对于线性可分的数据集,采用线性分类器即可解决分类问题。但在具体分类过程中,实际数据往往不是线性可分的,此时可以引入激活函数,以解决非线性数据的分类问题。
sigmoid函数是使用范围最广的一类激活函数,具有指数函数形状,它在物理意义上最为接近生物神经元,是一个在生物学中常见的S型的函数,也称为S型生长曲线。在逻辑回归、人工神经网络中有着广泛的应用。Sigmoid函数的公式形式为下述公式(2):
线性整流函数(Rectified Linear Unit,ReLU):也被称为修正线性单元,是一种人工神经网络中常用的激活函数,通常指代以斜坡函数其变种为代表的非线性函数,属于非线性的激活函数。其可以从生物学角度模拟出脑神经元接受信号更精确的激活模型。其中,斜坡函数的公式形式为下述公式(3):
f(x)=max(0,x) (3)
以上是对本公开的实施例中涉及到的技术术语的介绍,以下不再赘述。
如背景技术所述,行人属性识别技术的研究工作在学术研究和工业应用等方面都有着极高的价值。在实际应用中,对行人属性识别技术的识别精准度与识别效率的要求都比较高。现有的基于深度学习的行人属 性识别算法主要包括“整体型”和“局部型”。
其中“整体型”算法将行人属性识别问题看成多标签分类问题,通常使用卷积神经网络在整张行人图片上提取所需的特征,在网络顶端使用全连接层基于提取出来的特征来预测行人属性。
“局部型”算法则更关注输入图像中某些对分类预测重要的局部区域,预先训练好人体姿态估计模型来预测输入图像的人体关键点,再根据这些关键点粗略地定位出人体的头部、上半身、下半身等局部区域,从而将行人图像分隔为不同的局部区域(例如头部区域、上半身区域以及下半身区域)的图像。不同的局部区域的图像会输入至预先训练好的与局部区域对应的行人识别属性模型中,以识别该局部区域的图像对应的行人属性。
但是,上述行人属性识别算法的精度较低。对此,相关技术提供一种基于属性定位模块(attribute localization module,ALM)的行人属性识别系统。在相关技术中,一个ALM与一个行人属性对应,ALM能够自适应地识别出行人属性在行人图像中对应的局部特征,并根据该局部特征识别出行人属性,从而可以提高行人属性的识别精度和效率。
相关技术提供的一个ALM仅对应一个行人属性,假设行人图像需要识别的行人属性有M个,则相关技术提供的行人属性识别系统至少需要包括M个ALM。例如,行人图像需要识别发型、性别、服饰等行人属性,则相关技术提供的行人属性识别系统至少要包括用于识别发型的ALM、用于识别性别的ALM以及用于识别服饰的ALM。这样一来,相关技术提供的行人属性识别系统包括较多的ALM,从而导致整个系统的运行时间较长。
有鉴于此,本公开的实施例提供一种行人属性识别系统,包括至少一个ALM。其中,每一个ALM对应多个行人属性。这样一来,在面对同样数目的需要识别的行人属性时,可以有效地减少行人属性识别系统所需的ALM的数量,从而降低行人属性识别系统的运行时间,使得该行人属性识别系统可以更好地应用于生活生产的各个领域。
在本公开的实施例中,若行人属性识别系统需要识别的行人属性为M个,M个行人属性可以分为K个类型,一个类型的行人属性可以包括多个行人属性,M为大于1的整数,K为大于1小于M的整数。从而,在行人属性识别系统中,一个ALM对应一类行人属性。
在一种可能的实现方式中,M个行人属性可以划分为人体部位相关 的行人属性以及人体全局相关的行人属性。示例性的,人体部位相关的行人属性可以包括是否带眼镜、发型、服饰等。人体全部相关的行人属性可以包括年龄、身高、性别等。基于此,行人属性识别系统可以包括第一ALM和第二ALM。其中,第一ALM用于识别与人体部位相关的多个行人属性。第二ALM用于识别与人体全局相关的多个行人属性。
进一步的,人体部位相关的行人属性还可以进一步细分,例如还可以划分为:头部相关的行人属性、上半身相关的行人属性和/或下半身相关的行人属性。示例性的,头部相关的行人属性可以包括是否带眼镜、发型等。上半身相关的行人属性可以包括上半身服饰的类型、是否带背包等。下半身相关的行人属性可以包括下半身服饰的类型、靴子的类型等。相应的,第一ALM可以包括头部ALM、上半身ALM和下半身ALM中的一项或者多项。其中,头部ALM用于识别人体头部相关的多个行人属性。上半身ALM用于识别人体上半身相关的多个行人属性。下半身ALM用于识别人体下半身相关的多个行人属性。
或者,人体部位相关的行人属性还可以进一步细化,例如还可以划分为:头部相关的行人属性、手部相关的行人属性、躯干部分相关的行人属性和/或腿部相关的行人属性。相应的,第一ALM可以包括头部ALM、手部ALM、躯干部分ALM或者腿部ALM中的一项或者多项。其中,头部ALM用于识别人体头部相关的多个行人属性。手部ALM用于识别人体手部相关的多个行人属性。躯干部分ALM用于识别人体躯干部分相关的多个行人属性。腿部ALM用于识别人体腿部相关的多个行人属性。
在另一种可能的实现方式中,M个行人属性可以划分为服饰相关的行人属性、动作相关的行人属性、体貌相关的行人属性等。示例性的,服饰相关的行人属性可以包括上半身服饰的类型、下半身服饰的类型、鞋子的类型、上半身服饰的颜色、下半身服饰的颜色、鞋子的颜色等。动作相关的行人属性可以包括行人的跑、跳、走等动作。体貌相关的行人属性可以包括身高、年龄、性别等。相应的,本公开的实施例提供的行人属性识别系统可以包括服饰属性定位模块、体貌属性定位模块、动作属性定位模块等。其中,服饰属性定位模块用于识别服饰相关的多个行人属性。体貌属性定位模块用于识别体貌相关的多个行人属性。动作属性定位模块用于识别动作相关的多个行人属性。
应理解,上述对M个行人属性的划分仅是示例性的。行人属性识别系统需要识别的行人图像中的行人属性可以根据实际需求而变化,相应 的,对M个行人属性的划分方式也可以根据实际需求而变化。
可选的,对M个行人属性的划分可以采用聚类算法或者深度学习方法来实现。从而,可以将具有隐含地关联关系的多个行人属性划分到同一类中。这样,即使以一个ALM来识别一类行人属性,也能在保证识别精度的情况下,达到减少运算量,提高识别速度的目的。
下面对本公开的实施例中提供的ALM的结构进行介绍。
如图1所示,本公开的实施例提供的ALM包括空间变换单元和属性识别单元。其中,空间变换单元和属性识别单元连接,空间变换单元的输出即为属性识别单元的输入。
空间变换单元,用于从输入空间变换单元的特征信息中提取出可判别区域内的特征信息。其中,可判别区域与ALM对应的多个行人属性相关。
属性识别单元,用于根据可判别区域内的特征信息,输出ALM对应的多个行人属性的识别结果。
一种示例中,行人属性的识别结果可以为行人属性的预测概率值。其中,行人属性的预测概率值即可以为行人图像存在该行人属性的概率值。示例性的,行人属性A的预测概率值为65%,表示行人图像有65%的可能性存在该行人属性A。
另一种示例中,行人属性的识别结果可以为行人属性的标签值。示例性的,以行人属性A为二分类的行人属性为例,行人属性A的标签值为1,表示行人图像中存在该行人属性A;行人属性A的标签值为0,表示行人图像中不存在该行人属性A。应理解,行人属性的标签值所代表的含义可以根据实际情况来确定,本公开实施例对此不作限定。
可选的,在行人属性的识别结果可以为行人属性的标签值时,属性识别单元可以根据可判别区域内的特征信息,确定ALM对应的多个行人属性中每个行人属性的预测概率值;之后,对每个行人属性的预测概率值和每个行人属性对应的概率阈值进行比较,基于比较结果,确定每个行人属性的标签值。
例如,以行人属性A为例,属性识别单元可以根据可判别区域内的特征信息,确定行人属性A的预测概率值为65%;之后,属性识别单元将行人属性A的预测概率值65%与对应的概率阈值50%进行比较,确定行人属性A的预测概率值大于概率阈值,从而确定行人属性A的标签值为用于表示行人图像中存在行人属性A的标签值。
应理解,不同的行人属性对应的概率阈值可以是不同的。例如,行人属性A对应的概率阈值为50%,行人属性B对应的概率阈值为65%。行人属性对应的概率阈值可以根据深度学习等算法来确定,对此不予赘述。
应理解,在卷积神经网络的实际应用过程中,通常需要考虑输入样本的局部性、平移不变性、缩小不变性,旋转不变性等,以提高对输入样本分类的准确度,而这些方法实际上就是对图像进行空间坐标变换。空间变换单元可以实现对各种形变的数据进行空间变换,将输入变换为网络的下一层期望的形式,还可以经过训练,在实际识别的过程中自动选择感兴趣的区域特征(也即上述可判别区域内的特征信息)。
对于属性定位模块对应的多个行人属性来说,可判别区域即为该多个行人属性在行人图像中的语义区域,可判别区域内的特征信息即为对多个行人属性识别有用的特征信息。因此,相比较基于全局特征来识别行人属性的算法来说,本公开的实施例中的ALM可以基于局部特征(也即可判别区域内的特征信息)来识别行人属性,从而可以降低计算量,并提高行人属性识别的精度和效率。
在一些实施例中,空间变换单元可以采用STN技术。这种情况下,空间变换单元,具体用于根据输入空间变换单元的特征信息,确定可判别区域的变换参数;根据可判别区域的变换参数,从输入空间变换单元的特征信息中提取出可判别区域内的特征信息。其中,变换参数包括水平方向上的缩放变换参数、垂直方向上的缩放变换参数、水平方向上的平移变换参数以及垂直方向上的平移变换参数。
应理解,水平方向上的缩放变换参数、垂直方向上的缩放变换参数、水平方向上的平移变换参数以及垂直方向上的平移变换参数相结合以确定一个矩形边界框,也即确定可判别区域的边界。
示例性的,如图1所示,采用STN技术的空间变换单元可以包括第一全连接层和采样器。输入空间变换单元的特征信息经过第一全连接层,得到变换参数;之后,变换参数所构成的矩阵R和输入空间变换单元的特征信息经过采样器,得到可判别区域内的特征信息。
其中,采样器对矩阵R和特征信息进行克罗内克积的运算,以得到可判别区域内的特征信息。
在一些实施例中,如图1所示,属性识别单元可以以第二全连接层和分类函数(图1中未示出)来构建。第二全连接层的输出神经元的个 数与属性识别单元对应的行人属性的个数相同。属性识别单元,具体用于将可判别区域内的特征信息输入第二全连接层,获得第二全连接层输出的特征向量;之后,将第二全连接层的输出的特征向量输入分类函数,获得分类函数输出的多个行人属性的识别结果。其中,在行人属性为二分类的行人属性时,激活函数可以采用sigmoid函数;在行人属性为多分类的行人属性时,激活函数可以采用softmax函数。为了便于说明,本公开实施例中行人属性为二分类行人属性。应理解,多分类行人属性也可以转换为多个二分类行人属性进行处理。
可选的,如图2所示,本公开的实施例提供的ALM还可以包括通道注意力单元。该通道注意力单元与空间变换单元连接。通道注意力单元的输出即为空间变换单元的输入。
通道注意力单元,用于利用通道注意力机制对输入该通道注意力单元的特征信息进行校准,得到校准后的特征信息。
需要说明的是,注意力(Attention)机制,是聚焦于局部信息的机制,比如,图像中的某一个图像区域。但随着任务目的的变化,注意力区域往往也会发生变化。与注意力机制相伴而生的便是显著目标检测(salient object detection),它的输入是一张图,输出是一张概率图,概率越大的地方,代表是图像中重要目标的概率越大。对于输入二维图像的CNN来说,特征信息的一个维度是图像的尺度空间,即图像的长宽,另一个维度就是图像的特征维度,包括特征通道。因此可以采用通道注意力机制,通过学习的方式来自动获取到每个特征通道的重要程度,然后依照这个重要程度去提升有用的特征并抑制对当前任务用处不大的特征(也即实现对特征信息的校准)。
本公开的实施例中,ALM增加通道注意力单元,可以实现对输入空间变换单元的特征信息进行校准,以提升该特征信息中对于识别ALM对应的多个行人属性有用的部分,而抑制该特征信息中对于识别ALM对应的多个行人属性无用的部分,从而提高行人属性识别的准确性。
在一些实施例中,该通道注意力单元可以采用SE net。如图2所示,该通道注意力单元可以包括全局平均池化层、1×1卷积层、ReLU激活层、1×1卷积层、Sigmoid激活层、乘法器以及加法器。通道注意力单元,具体用于将输入该通道注意力单元的特征信息依次经过全局平均池化层、1×1卷积层、ReLU激活层、1×1卷积层以及Sigmoid激活层,得到第一校准向量;将输入该通道注意力单元的特征信息与第一校准向量按通道 逐一相乘,得到第二校正向量;将输入该通道注意力单元的特征信息与第二校正向量逐元素相加,输出校准后的特征信息。
可选的,本公开的实施例提供的行人属性识别系统还包括特征提取模块。其中,特征提取模块用于从输入行人属性识别系统的行人图像中提取出特征信息。
应理解,本公开实施例对特征提取模块所提取出来的特征信息的分辨率以及通道数不作限定。
可选的,该特征提取模块可以包括P个特征提取层,P为大于1的正整数。所述特征提取模块,具体用于将所述行人图像依次通过P个特征提取层,提取出不同层次的P个特征信息,一个特征信息与一个特征提取层对应。
也即,不同的特征提取层用于提取不同层次的特征信息。其中,越靠近行人属性识别系统的输入的特征提取层所提取出来的特征信息的层次越低,越靠近行人属性识别系统的输出的特征提取层所提取出来的特征信息的层次越高。
可选的,特征提取模块可以采用批归一化(Batch Normalization,BN)-瓶颈(inception)架构,或者也可以采用其他CNN架构。在特征提取模块采用BN-inception架构的情况下,特征提取模块中的各个特征提取层可以包括至少一个inception块。inception块的具体结构可以参考相关技术,在此不予赘述。
在一些实施例中,特征提取模块提取出来的特征信息可以直接作为输入ALM的特征信息。基于此,如图3所示,在该特征提取模块包括P个特征提取层的情况下,行人属性识别系统所包括的至少一个ALM可以划分为P组ALM,每组ALM包括多个ALM。一组ALM与一个特征提取层对应。
例如,假设P=3,行人属性识别系统包括12个ALM,12个ALM可以划分为3组。其中,第一组ALM包括ALM1-1、ALM1-2、ALM1-3以及ALM1-4。第二组ALM包括ALM2-1、ALM2-2、ALM2-3以及ALM2-4。第三ALM包括ALM3-1、ALM3-2、ALM3-3以及ALM3-4。其中,ALM1-1、ALM2-1和ALM3-1均可以是用于识别人体头部相关的多个行人属性的ALM。ALM1-2、ALM2-2和ALM3-2均可以是用于识别人体上半身相关的多个行人属性的ALM。ALM1-3、ALM2-3和ALM3-3均可以是用于识别人体下半身相关的多个行人属性的ALM。ALM1-4、ALM2-4和 ALM3-4均可以是用于识别人体全局相关的多个行人属性的ALM。
对于一组ALM来说,输入该组ALM的特征信息即可以为该组ALM对应的特征提取层所提取出来的特征信息。并且,该组ALM用于输出第一行人属性预测向量,第一行人属性预测向量包括行人图像的M个行人属性的识别结果。
在另一些实施例中,特征提取模块提取出来的特征信息还需要经过一系列处理(例如融合处理等),处理后的特征信息作为输入ALM的特征信息。
可选的,如图4所示,本公开的实施例提供的行人属性识别系统还包括特征融合模块。特征融合模块,用于将特征提取模块所提取出来的不同层次的P个特征信息进行融合处理,输出P个融合处理后的特征信息。
可选的,特征提取模块和特征融合模块可以采用特征金字塔网络的架构。基于此,作为一种可能的实现方式,在该特征提取模块,具体用于将第P个特征提取层提取出来的特征信息,直接作为第P个融合处理后的特征信息;对于其余的P-1个特征提取层来说,将第i个特征提取层所提取出来的特征信息,与第i+1个融合处理后的特征信息进行融合处理,得到第i个融合处理后的特征信息。i为大于等于1小于等于P-1的整数。
示例性的,融合处理可以包括以下操作:对第i+1个融合处理后的特征信息进行上采样,得到上采样处理后的特征信息;之后,将上采样处理后的特征信息与第i个特征提取层提取出来的特征信息按通道数进行拼接,得到第i个融合处理后的特征信息。
其中,上采样的目的在于使得第i+1个特征提取层所提取出来的特征信息的分辨率与第i个特征提取层所提取出来的特征信息的分辨率相同。因此,上采样所采用的放大倍数主要考虑第i个特征提取层所提取出来的特征信息的分辨率,以及第i+1个特征提取层所提取出来的特征信息的分辨率。例如,第i个特征提取层所提取出来的特征信息的分辨率为16×8,第i+1个特征提取层所提取出来的特征信息的分辨率为8×4,因此上采样所采用的放大倍数为2。
在一些实施例中,行人属性识别系统所包括的至少一个ALM可以划分为P组ALM,每组ALM包括多个ALM。输入第i组ALM的特征信息即为第i个融合处理后的特征信息。
对于一组ALM来说,该组ALM用于输出第一行人属性预测向量,第一行人属性预测向量包括行人图像的M个行人属性的识别结果。
应理解,对于图像来说,低层次的特征信息具有较为丰富的细节信息,而高层次的特征信息具有较为丰富的语义信息。低层次的特征信息和高层次的特征信息是相辅相成的。输入ALM的特征信息是融合处理后的特征信息,有利于ALM利用高层次的特征信息和低层次的特征信息的优点,以提高行人属性识别的准确性。
可选的,如图5所示,本公开的实施例提供的行人属性识别系统还可以包括特征识别模块。其中,特征识别模块与特征提取模块连接。特征识别模块,用于根据特征提取模块提取出来的最高层次的特征信息,获取第二行人属性预测向量。其中,第二行人属性预测向量包括行人图像的M个行人属性的识别结果。
可选的,特征识别模块可以以第三全连接层和分类函数来构建。特征识别模块,具体用于将特征提取模块提取出来的最高层次的特征信息输入第二全连接层,获得第二全连接层输出的特征向量;之后,将第二全连接层输出的特征向量输入分类函数,获得分类函数输出的第二行人属性预测向量。
应理解,特征识别模块是利用全局特征来进行行人属性识别,而ALM是利用局部特征来进行行人属性识别。这样,本公开的实施例所提供的行人属性识别系统能够充分利用两种识别方式的优点,提高行人属性识别的准确性。
可选的,如图6所示,在本公开的实施例提供的行人属性识别系统包括至少一组ALM以及特征识别模块的情况下,本公开的实施例提供的行人属性识别系统还可以包括结果输出模块。
其中,结果输出模块具体用于根据各组ALM输出的第一行人属性预测向量,以及特征识别模块输出的第二行人属性预测向量,输出行人图像中M个行人属性的最终识别结果。
一个示例中,行人属性的最终识别结果可以为行人属性的最终预测概率值。
另一种示例中,行人属性的最终识别结果可以为行人属性的最终标签值。
下面以行人预测向量所包含的行人属性的识别结果为行人属性的预测概率值,行人属性的最终识别结果为行人属性的最终标签值为例,具 体说明结果输出模块的处理流程。
对于目标行人属性来说,结果输出模块从各个第一行人属性预测向量包含的目标行人属性的预测概率值,以及第二行人属性预测向量包含的目标行人属性的预测概率值中,选择最大的预测概率值作为目标行人属性的最终预测概率值。其中,目标行人属性可以是行人图像需要识别的M个行人属性中的任意一个。
之后,结果输出模块判断目标行人属性的最终预测概率值是否大于或等于目标行人属性对应的概率阈值;若目标行人属性的最终预测概率值大于或等于目标行人属性对应的概率阈值,则结果输出模块确定目标行人属性的最终标签值为用于表示行人图像中存在目标行人属性的标签值;反之,结果输出模块确定目标行人属性的最终标签值为用于表示行人图像中不存在目标行人属性的标签值。
举例来说,行人属性识别系统包括3组ALM以及一个特征识别模块。其中,对于行人属性A来说,第一组ALM输出的第一行人属性预测向量所包含的行人属性A的预测概率值为60%;第二组ALM输出的第一行人属性预测向量所包含的行人属性A的预测概率值为62%;第三组ALM输出的第一行人属性预测向量所包含的行人属性A的预测概率值为65%;特征识别模块输出的第二行人属性预测向量所包含的行人属性A的预测概率值为40%。基于此,结果输出模块可以确定行人属性A的最终预测概率值为65%。
假设行人属性A对应的概率阈值为50%,在行人属性A的最终预测概率值为65%时,可以确定行人图像包含行人属性A。
下面以行人预测向量所包含的行人属性的识别结果为行人属性的标签值,行人属性的最终识别结果为行人属性的最终标签值为例,具体说明结果输出模块的处理流程。
对于目标行人属性来说,结果输出模块根据各个第一行人属性预测向量中目标行人属性的标签值,以及第二行人属性预测向量中目标行人属性的标签值,统计目标行人属性的各个标签值出现的次数。从而,结果输出模块可以选择出现次数最多的标签值作为目标行人属性的最终标签值。
举例来说,行人属性识别系统包括3组ALM以及一个特征识别模块。其中,对于行人属性A来说,第一组ALM输出的第一行人属性预测向量所包含的行人属性A的标签值为1;第二组ALM输出的第一行人属性 预测向量所包含的行人属性A的标签值为1;第三组ALM输出的第一行人属性预测向量所包含的行人属性A的标签值为1;特征识别模块输出的第二行人属性预测向量所包含的行人属性A的标签值为0。从而结果输出模块可以确定行人属性A的最终标签值为1。
应理解,上述对结果输出模块确定行人属性的最终识别结果的处理流程介绍仅是示例性的,实际应用中还可以采用其他方式,对此不作限定。
本公开的实施例提供的行人属性识别系统不仅可以包括用于识别多个行人属性的属性定位模块,还可以包括用于识别一个行人属性的属性定位模块,对此不作限定。
下面结合具体示例来具体说明本公开实施例提供的行人属性识别系统的工作过程。
如图7所示,本公开实施例提供的行人属性识别系统可以输入预设分辨率的行人图像。示例性的,预设分辨率可以为256×128。
预设分辨率的行人图像依次通过特征提取模块中的三个特征提取层而得到三个不同层次的特征信息
以及
示例性的,特征信息
的分辨率可以为32×16,特征信息
的分辨率可以为16×8,特征信息
的分辨率可以为8×4。并且,特征信息
以及
的通道数均可以为256。
特征信息
以及
经过特征融合模块的处理,可以得到三个融合处理后的特征信息X
1、X
2以及X
3。其中,X
3即为
X
2为上采样处理后的X
3与
按通道数目进行拼接处理后得到的,X
1为上采样处理后的X
2与
按通道数目进行拼接处理后得到的。
其中,特征信息X
3的分辨率为8×4,通道数目为256。特征信息X
2的分辨率为16×8,通道数目为512。特征信息X
3的分辨率为32×16,通道数目为768。
上文对本公开的实施例提供的行人属性识别系统的介绍仅是示例性的,该行人属性识别系统可以包括更多或更少的模块。并且,上述行人属性识别系统中的一些模块可以集成在一起,或者某些模块还可以被划分为更多的模块。
应理解,本公开的实施例提供的行人属性识别系统可以以软件、硬件或者软件结合硬件的方式来实现。在行人属性识别系统仅以软件的方式来实现时,行人属性识别系统可以被称为行人属性识别模型。在行人属性识别系统以硬件或者硬件结合软件的方式来实现时,行人属性识别系统可以采用处理器来实现。
例如,处理器可以是通用的具有数据处理能力和/或程序执行能力的逻辑运算器件,诸如中央处理单元(central processing unit,CPU)、图像处理器(graphics processing unit,GPU)、微处理器(microcontroller unit,MCU)等,处理器执行对应功能的计算机指令以实现对应的功能。计算机指令包括了一个或多个由对应于处理器的指令集架构定义的处理器操作,这些计算机指令可以被一个或多个计算机程序在逻辑上包含和表示。
例如,处理器可以是具有可被编程调整功能以执行相应功能的硬件实体,诸如现场可编程逻辑阵列(field programmable gate array,FPGA)或者专用集成电路(application specific integrated circuit,ASIC)等。
例如,处理器可以是专门设计用来执行对应功能的硬件电路,如张量处理器(tensor processing unit,TPU)或神经网络处理器(neural-network processing units,NPU)等。
如图8所示,本公开实施例提供一种训练方法,用于训练前述行人属性识别系统,该方法包括以下步骤:
S101、训练装置获取训练样本集。
其中,训练样本集包括至少一个样本行人图像。每个样本行人图像均已预先设置对应的属性标签。
可选的,属性标签可以为如下形式:y=[y
1,y
2,…,y
M]。其中,M是需要识别的行人属性的数量。y
m∈[0,1],m∈[1,2,…,M]。其中,y
m=0,表示行人图像不包含该行人属性。y
m=1,表示行人图像包含该行人属性。
示例性的,表x示出行人图像需要识别的行人属性的具体示例。应理解,实际使用中,行人图像需要识别的行人属性可以多于或者少于表 1的示例。
表1
序号 | 行人属性 |
1 | 黑色头发 |
2 | 蓝色头发 |
3 | 棕色头发 |
4 | 白色头发 |
5 | 男性 |
6 | 女性 |
7 | 戴眼镜 |
8 | 上半身穿T恤 |
9 | 下半身穿牛仔裤 |
10 | 携带背包 |
结合表1进行说明,假设某个行人图像的属性标签为【1,0,0,0,1,0,1,1,1,1】,说明该行人图像包含的行人属性有:黑色头发、男性、戴眼镜、上半身穿T恤、下半身穿牛仔裤以及携带背包。
S102、训练装置根据训练样本集,对行人属性识别系统进行训练,以获得训练完成的行人属性识别系统。
作为一种可能的实现方式,训练装置可以将训练样本集中的各个样本行人图像分别输入行人属性识别系统,以训练行人属性识别系统中各个模块(例如AML、特征提取模块、特征识别模块等)。
具体地,训练装置可以根据行人属性识别系统对样本行人图像的识别结果以及样本行人图像的属性标签,利用预设的损失函数,确定行人属性识别系统对应的损失值。之后,训练装置利用梯度下降算法,根据行人属性识别系统的损失值,更新行人属性识别系统。应理解,更新行人属性识别系统具体是指更新行人属性识别系统中的参数(例如权重值和偏置值)。示例性的,预设的损失函数可以为二分类交叉熵损失函数。
下面结合图7来说明行人属性识别系统的损失值的计算过程。
行人属性识别系统的损失值可以是各组ALM对应的损失值以及特征识别模块对应的损失值之和。具体的,L=L
1+L
2+L
3+L
4。其中,L表示行人属性识别系统的损失值,L
1表示第一组ALM对应的损失值,L
2表示第二组ALM对应的损失值,L
3表示第三组ALM对应的损失值,L
4 表示特征识别模块对应的损失值。
其中,L
1、L
2、L
3、L
4均可以满足以下公式:
在行人属性识别系统经过迭代训练以达到收敛之后,训练装置即可以获得训练完成的行人属性识别系统。
可选的,训练完成之后,训练装置还可以对训练完成的行人属性识别系统进行验证,以避免训练完成的行人属性识别系统出现过拟合的情况。
基于图8所示的实施例,在训练过程中仅需要样本行人图像的属性标签,而无需在样本行人图像中标注行人属性对应的区域。也即,本公开的实施例可以以弱监督的方式完成对行人属性识别系统的训练,降低训练的复杂度。并且,训练完成的行人属性识别系统包括至少一个属性定位模块,而一个属性定位模块可以对应多个行人属性。也即,一个属性定位模块通过一次计算可以确定多个行人属性的识别结果。这样能够有效降低行人属性识别系统的整体计算量,从而减少得到行人图像对应的识别结果所耗费的时间。
如图9所示,本公开的实施例提供一种基于前述行人属性识别系统的性属性识别方法,该方法包括以下步骤:
S201、识别装置获取待识别的行人图像。
可选的,待识别的行人图像可以是视频中的一帧图像。示例性的,视频可以是安防摄像头拍摄的监控视频。
S202、识别装置将待识别的行人图像输入行人属性识别系统,获取待识别的行人图像对应的识别结果。
一个示例中,行人图像对应的识别结果用于指示行人图像中M个行人属性的预测概率值。应理解,在行人属性识别系统包括结果输出模块的情况下,上述识别结果所指示的预测概率值即为结果输出模块输出的最终预测概率值。
另一个示例中,行人图像对应的识别结果用于指示行人图像中存在 的行人属性。可选的,行人图像对应的识别结果包括M个行人属性的标签值。应理解,在行人属性识别系统包括结果输出模块的情况下,上述识别结果所包含的行人属性的标签值即为结果输出模块输出的最终标签值。
可选的,可以先对待识别的行人图像进行预处理,以使得预处理后的行人图像满足行人属性识别系统的输入要求(例如对图像的尺寸的要求);之后,再将预处理后的行人图像输入行人属性识别系统。其中,预处理可以包括尺寸归一化处理等。
其中,行人属性识别系统对待识别的行人图像的处理过程可以参考前文,在此不再赘述。
基于图9所示的实施例,由于本公开的实施例提供的行人属性识别系统包括至少一个属性定位模块,而一个属性定位模块可以对应多个行人属性。因此,基于本公开实施例提供的行人属性识别系统的识别方法,可以降低对行人图像进行行人属性的识别的计算量,从而减少得到行人图像对应的识别结果所耗费的时间。
在本公开的实施例中,识别装置和训练装置可以是两个独立的设备,可以集成为一个设备。在识别装置和训练装置是两个独立的设备的情况下,识别装置可以从训练装置获取到训练完成的行人属性识别系统。
可选的,识别装置和训练装置可以为服务器或者终端设备。
其中,服务器可以是具有数据处理能力以及数据存储能力的设备。示例性的,服务器可以是一台服务器,或者是多台服务器组成的服务器集群,又或者是一个云计算服务中心,对此不作限定。
终端设备可以是手机、平板电脑、手持计算机、笔记本电脑、超级移动个人计算机(ultra-mobile personal computer,UMPC)、上网本、个人数字助理(personal digital assistant,PDA)、增强现实(augmented reality,AR)或者虚拟现实(virtual reality,VR)设备。
上述主要从方法的角度对本公开实施例提供的方案进行了介绍。为了实现上述功能,其包含了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,本公开能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本公开的范围。
本公开实施例还提供一种训练装置,如图10所示,该训练装置1000包括获取单元1001和训练单元1002。
获取单元1001,用于获取训练样本集,训练样本集包括多个样本行人图像,每个样本行人图像具有对应的属性标签,属性标签用于指示对应的样本行人图像中存在的行人属性。
训练单元1002,用于根据训练样本集,对行人属性识别系统进行训练,以获得训练完成的行人属性识别系统;其中,该行人属性识别系统为上述实施中提供的任一行人属性识别系统。
本公开另一实施例还提供一种训练装置。如图11所示,训练装置1100包括存储器1101和处理器1102;存储器1101和处理器1102耦合;存储器1101用于存储计算机程序代码,计算机程序代码包括计算机指令。其中,当处理器1102执行计算机指令时,使得训练装置1100执行上述方法实施例所示的方法流程中训练装置执行的各个步骤。
本公开实施例还提供一种识别装置,如图12所示,该识别装置2000包括获取单元2001和识别单元2002。
获取单元2001,用于获取待识别的行人图像。
识别单元2002,用于将待识别行人图像输入行人属性识别系统,获得待识别行人图像的识别结果;其中,该行人属性识别系统为上述实施中提供的任一行人属性识别系统。
本公开另一实施例还提供一种识别装置。如图13所示,识别装置2100包括存储器2101和处理器2102;存储器2101和处理器2102耦合;存储器2101用于存储计算机程序代码,计算机程序代码包括计算机指令。其中,当处理器2102执行计算机指令时,使得识别装置2100执行上述方法实施例所示的方法流程中识别装置执行的各个步骤。
本公开的一些实施例还提供了一种计算机可读存储介质(例如,非暂态的计算机可读存储介质),该计算机可读存储介质中存储有计算机程序,计算机程序在处理器上运行时,使得处理器执行如上述方法实施例中所述的训练方法中的一个或多个步骤;或者,计算机程序在处理器上运行时,使得处理器执行如上述方法实施例中所述的识别方法中的一个或多个步骤。
示例性的,上述计算机可读存储介质可以包括,但不限于:磁存储器件(例如,硬盘、软盘或磁带等),光盘(例如,CD(Compact Disk,压缩盘)、DVD(Digital Versatile Disk,数字通用盘)等),智能卡和闪存器件(例如,EPROM(Erasable Programmable Read-Only Memory,可擦写可编程只读存储 器)、卡、棒或钥匙驱动器等)。本公开描述的各种计算机可读存储介质可代表用于存储信息的一个或多个设备和/或其它机器可读存储介质。术语“机器可读存储介质”可包括但不限于,无线信道和能够存储、包含和/或承载指令和/或数据的各种其它介质。
本公开的一些实施例还提供了一种计算机程序产品。该计算机程序产品包括计算机程序,在训练装置上执行该计算机程序时,使得处理器执行如上述方法实施例中所述的训练方法中的一个或多个步骤;或者,在识别装置上执行该计算机程序时,使得处理器执行如上述方法实施例中所述的识别方法中的一个或多个步骤。
上述对象检测系统、计算机可读存储介质、计算机程序产品及计算机程序的有益效果和上述一些实施例所述的对象检测方法的有益效果相同,此处不再赘述。
以上所述,仅为本公开的具体实施方式,但本公开的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本公开揭露的技术范围内,想到变化或替换,都应涵盖在本公开的保护范围之内。因此,本公开的保护范围应以所述权利要求的保护范围为准。
Claims (18)
- 一种行人属性识别系统,其特征在于,包括至少一个属性定位模块,每个属性定位模块对应多个行人属性;所述属性定位模块包括空间变换单元和属性识别单元;所述空间变换单元,用于从输入所述空间变换单元的特征信息中提取出可判别区域内的特征信息,所述可判别区域与所述属性定位模块对应的多个行人属性相关;所述属性识别单元,用于根据所述可判别区域内的特征信息,获取所述属性定位模块对应的多个行人属性的识别结果。
- 根据权利要求1所述的系统,其特征在于,所述空间变换单元,具体用于根据输入所述空间变换模块的特征信息,确定所述可判别区域的变换参数;根据所述可判别区域的变换参数,从输入所述空间变换模块的特征信息中提取出所述可判别区域内的特征信息;其中,所述变换参数包括水平方向上的缩放变换参数、垂直方向上的缩放变换参数、水平方向上的平移变换参数以及垂直方向上的平移变换参数。
- 根据权利要求1或2所述的系统,其特征在于,所述属性定位模块还包括通道注意力单元;所述通道注意力单元,用于对输入所述通道注意力单元的特征信息进行校准,得到校准后的特征信息,所述校准后的特征信息作为输入所述空间变换单元的特征信息。
- 根据权利要求3所述的系统,其特征在于,所述通道注意力单元,具体用于将输入所述通道注意力单元的特征信息依次经过全局平均池化层、1×1卷积层、ReLU激活层、1×1卷积层以及Sigmoid激活层,得到第一校准向量;将所述第一校准向量与输入所述通道注意力单元的特征信息按通道逐一相乘,得到第二校正向量;将所述第二校正向量与输入所述通道注意力单元的特征信息逐元素相加,得到所述校准后的特征信息。
- 根据权利要求1至4任一项所述的系统,其特征在于,所述至少一个属性定位模块包括第一属性定位模块和/或第二属性定位模块,所述第一属性定位模块用于识别与人体部位相关的多个行人属性,所述第二属性定位模块用于识别与人体全局相关的多个行人属性。
- 根据权利要求5所述的系统,其特征在于,所述第一属性定位模块包括头部属性定位模块、上半身属性定位模块或者下半身属性定位模块中的一项或者多项;其中,所述头部属性定位模块用于识别与人体头部相关的多个 行人属性,所述上半身属性定位模块用于识别与人体上半身相关的多个行人属性,所述下半身属性定位模块用于识别与人体下半身相关的多个行人属性。
- 根据权利要求1至6任一项所述的系统,其特征在于,所述行人属性识别系统还包括特征提取模块,所述特征提取模块包括P个特征提取层,P为大于1的整数;所述特征提取模块,具体用于将行人图像依次通过P个特征提取层,提取出不同层次的P个特征信息,一个特征信息与一个特征提取层对应。
- 根据权利要求7所述的系统,其特征在于,所述行人属性识别系统还包括特征融合模块;所述特征融合模块,用于对所述特征提取模块提取出的不同层次的P个特征信息进行融合处理,得到P个融合处理后的特征信息。
- 根据权利要求8所述的系统,其特征在于,所述至少一个属性定位模块被划分为P组属性定位模块,一组属性定位模块与一个融合处理后的特征信息对应,并且每组属性定位模块包括K个属性定位模块,K为大于1小于M的整数,M为大于1的整数;一组属性定位模块,用于根据对应的融合处理后的特征信息,输出第一行人属性预测向量,所述第一行人属性预测向量包括M个行人属性的识别结果。
- 根据权利要求9所述的系统,其特征在于,所述行人属性识别系统还包括特征识别模块;所述特征识别模块,用于根据所述特征提取模块所提取出来的最高层次的特征信息,输出第二行人属性预测向量,所述第二行人属性预测向量包括M个行人属性的识别结果。
- 根据权利要求10所述的系统,其特征在于,所述行人属性识别系统还包括结果输出模块;所述结果输出模块,用于根据所述P组属性定位模块中各组属性定位模块输出的第一行人属性预测向量,以及所述特征识别模块输出的第二行人属性向量,确定M个行人属性的最终识别结果。
- 一种用于训练权利要求1至11任一项所述的行人属性识别系统的方法,其特征在于,所述方法包括:获取训练样本集,所述训练样本集包括多个样本行人图像,每个样本行人图像具有对应的属性标签,所述属性标签用于指示对应的样本行人图像中存在的行人属性;根据所述训练样本集,对所述行人属性识别系统进行训练,以获得训练完成的行人属性识别系统。
- 一种基于权利要求1至11任一项所述的行人属性识别系统的行人属性识别方法,其特征在于,所述方法包括:获取待识别的行人图像;将所述待识别行人图像输入所述行人属性识别系统,获得所述待识别行人图像的识别结果。
- 一种训练装置,其特征在于,包括:获取模块,用于获取训练样本集,所述训练样本集包括多个样本行人图像,每个样本行人图像具有对应的属性标签,所述属性标签用于指示对应的样本行人图像中存在的行人属性;训练模块,用于根据所述训练样本集,对行人属性识别系统进行训练,以获得训练完成的行人属性识别系统;其中,所述行人属性识别系统为权利要求1至11任一项所述的行人属性识别系统。
- 一种识别装置,其特征在于,包括:获取模块,用于获取待识别的行人图像;识别模块,用于将所述待识别行人图像输入行人属性识别系统,获得所述待识别行人图像的识别结果;其中,所述行人属性识别系统为权利要求1至11任一项所述的行人属性识别系统。
- 一种训练装置,其特征在于,所述装置包括存储器和处理器;所述存储器和所述处理器耦合;所述存储器用于存储计算机程序代码,所述计算机程序代码包括计算机指令;其中,当所述处理器执行所述计算机指令时,使得所述装置执行如权利要求12所述的训练方法。
- 一种识别装置,其特征在于,所述装置包括存储器和处理器;所述存储器和所述处理器耦合;所述存储器用于存储计算机程序代码,所述计算机程序代码包括计算机指令;其中,当所述处理器执行所述计算机指令时,使得所述装置执行如权利要求13所述的行人属性识别方法。
- 一种非瞬态的计算机可读存储介质,所述计算机可读存储介质存储有计算机程序;其中,所述计算机程序在训练装置上运行时,使得所述训练 装置实现如权利要求12中任一项所述的训练方法;或者,所述计算机程序在识别装置运行时,使得所述识别装置实现如权利要求13所述的行人属性识别方法。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202180003361.XA CN116438586A (zh) | 2021-11-12 | 2021-11-12 | 行人属性识别系统及其训练方法、行人属性识别方法 |
US18/005,379 US20240249547A1 (en) | 2021-11-12 | 2021-11-12 | Pedestrian attribute recognition method based on a pedestrian attribute recognition system and method for training the same |
PCT/CN2021/130421 WO2023082196A1 (zh) | 2021-11-12 | 2021-11-12 | 行人属性识别系统及其训练方法、行人属性识别方法 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2021/130421 WO2023082196A1 (zh) | 2021-11-12 | 2021-11-12 | 行人属性识别系统及其训练方法、行人属性识别方法 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023082196A1 true WO2023082196A1 (zh) | 2023-05-19 |
Family
ID=86334858
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/130421 WO2023082196A1 (zh) | 2021-11-12 | 2021-11-12 | 行人属性识别系统及其训练方法、行人属性识别方法 |
Country Status (3)
Country | Link |
---|---|
US (1) | US20240249547A1 (zh) |
CN (1) | CN116438586A (zh) |
WO (1) | WO2023082196A1 (zh) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118072359B (zh) * | 2024-04-18 | 2024-07-23 | 浙江深象智能科技有限公司 | 行人服饰识别方法、装置及设备 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110705474A (zh) * | 2019-09-30 | 2020-01-17 | 清华大学 | 一种行人属性识别方法和装置 |
US20200272902A1 (en) * | 2017-09-04 | 2020-08-27 | Huawei Technologies Co., Ltd. | Pedestrian attribute identification and positioning method and convolutional neural network system |
CN111738074A (zh) * | 2020-05-18 | 2020-10-02 | 上海交通大学 | 基于弱监督学习的行人属性识别方法、系统及装置 |
CN112507978A (zh) * | 2021-01-29 | 2021-03-16 | 长沙海信智能系统研究院有限公司 | 人物属性识别方法、装置、设备及介质 |
CN113239820A (zh) * | 2021-05-18 | 2021-08-10 | 中国科学院自动化研究所 | 基于属性定位与关联的行人属性识别方法及系统 |
-
2021
- 2021-11-12 CN CN202180003361.XA patent/CN116438586A/zh active Pending
- 2021-11-12 US US18/005,379 patent/US20240249547A1/en active Pending
- 2021-11-12 WO PCT/CN2021/130421 patent/WO2023082196A1/zh active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200272902A1 (en) * | 2017-09-04 | 2020-08-27 | Huawei Technologies Co., Ltd. | Pedestrian attribute identification and positioning method and convolutional neural network system |
CN110705474A (zh) * | 2019-09-30 | 2020-01-17 | 清华大学 | 一种行人属性识别方法和装置 |
CN111738074A (zh) * | 2020-05-18 | 2020-10-02 | 上海交通大学 | 基于弱监督学习的行人属性识别方法、系统及装置 |
CN112507978A (zh) * | 2021-01-29 | 2021-03-16 | 长沙海信智能系统研究院有限公司 | 人物属性识别方法、装置、设备及介质 |
CN113239820A (zh) * | 2021-05-18 | 2021-08-10 | 中国科学院自动化研究所 | 基于属性定位与关联的行人属性识别方法及系统 |
Also Published As
Publication number | Publication date |
---|---|
US20240249547A1 (en) | 2024-07-25 |
CN116438586A (zh) | 2023-07-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Jain et al. | Hybrid deep neural networks for face emotion recognition | |
Miao et al. | Recognizing facial expressions using a shallow convolutional neural network | |
Zhu et al. | Multi-label convolutional neural network based pedestrian attribute classification | |
Fathallah et al. | Facial expression recognition via deep learning | |
CN112800903B (zh) | 一种基于时空图卷积神经网络的动态表情识别方法及系统 | |
WO2020182121A1 (zh) | 表情识别方法及相关装置 | |
CN108520226B (zh) | 一种基于躯体分解和显著性检测的行人重识别方法 | |
CN108830237B (zh) | 一种人脸表情的识别方法 | |
CN112464865A (zh) | 一种基于像素和几何混合特征的人脸表情识别方法 | |
Ramesh et al. | Cell segmentation using a similarity interface with a multi-task convolutional neural network | |
CN104063721B (zh) | 一种基于语义特征自动学习与筛选的人类行为识别方法 | |
CN113435335B (zh) | 微观表情识别方法、装置、电子设备及存储介质 | |
Islam et al. | A CNN based approach for garments texture design classification | |
CN109509191A (zh) | 一种图像显著性目标检测方法及系统 | |
CN112084891A (zh) | 基于多模态特征与对抗学习的跨域人体动作识别方法 | |
CN114758382B (zh) | 基于自适应补丁学习的面部au检测模型建立方法及应用 | |
CN116434010A (zh) | 一种多视图的行人属性识别方法 | |
Tong et al. | Adaptive weight based on overlapping blocks network for facial expression recognition | |
CN114782979A (zh) | 一种行人重识别模型的训练方法、装置、存储介质及终端 | |
WO2023082196A1 (zh) | 行人属性识别系统及其训练方法、行人属性识别方法 | |
Jiang et al. | An end-to-end human segmentation by region proposed fully convolutional network | |
Sumalakshmi et al. | Fused deep learning based Facial Expression Recognition of students in online learning mode | |
Srininvas et al. | A framework to recognize the sign language system for deaf and dumb using mining techniques | |
CN111401122B (zh) | 一种基于知识分类的复杂目标渐近识别方法及装置 | |
Bai et al. | An incremental structured part model for object recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WWE | Wipo information: entry into national phase |
Ref document number: 18005379 Country of ref document: US |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21963644 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |