WO2020015752A1 - 一种对象属性识别方法、装置、计算设备及系统 - Google Patents
一种对象属性识别方法、装置、计算设备及系统 Download PDFInfo
- Publication number
- WO2020015752A1 WO2020015752A1 PCT/CN2019/096873 CN2019096873W WO2020015752A1 WO 2020015752 A1 WO2020015752 A1 WO 2020015752A1 CN 2019096873 W CN2019096873 W CN 2019096873W WO 2020015752 A1 WO2020015752 A1 WO 2020015752A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- image
- parts
- attribute recognition
- feature
- attribute
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 68
- 238000000605 extraction Methods 0.000 claims description 47
- 230000015654 memory Effects 0.000 claims description 41
- 230000009466 transformation Effects 0.000 claims description 25
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 claims description 22
- 238000004364 calculation method Methods 0.000 claims description 15
- 238000005070 sampling Methods 0.000 claims description 10
- 238000004590 computer program Methods 0.000 claims description 6
- 238000013215 result calculation Methods 0.000 claims description 5
- 230000004927 fusion Effects 0.000 claims description 3
- 239000000284 extract Substances 0.000 abstract description 12
- 238000005516 engineering process Methods 0.000 abstract description 7
- 238000013473 artificial intelligence Methods 0.000 abstract 1
- 238000011176 pooling Methods 0.000 description 43
- 238000013527 convolutional neural network Methods 0.000 description 34
- 239000011159 matrix material Substances 0.000 description 30
- 238000012549 training Methods 0.000 description 17
- 239000013598 vector Substances 0.000 description 15
- 230000008569 process Effects 0.000 description 12
- 238000010586 diagram Methods 0.000 description 11
- 230000004913 activation Effects 0.000 description 10
- 230000006870 function Effects 0.000 description 10
- 238000012545 processing Methods 0.000 description 10
- 230000004807 localization Effects 0.000 description 8
- 238000013519 translation Methods 0.000 description 7
- 238000013528 artificial neural network Methods 0.000 description 6
- 210000003423 ankle Anatomy 0.000 description 5
- 210000002683 foot Anatomy 0.000 description 5
- 238000012706 support-vector machine Methods 0.000 description 5
- MHABMANUFPZXEB-UHFFFAOYSA-N O-demethyl-aloesaponarin I Natural products O=C1C2=CC=CC(O)=C2C(=O)C2=C1C=C(O)C(C(O)=O)=C2C MHABMANUFPZXEB-UHFFFAOYSA-N 0.000 description 4
- 238000005457 optimization Methods 0.000 description 4
- 239000000872 buffer Substances 0.000 description 3
- 210000002569 neuron Anatomy 0.000 description 3
- 238000012546 transfer Methods 0.000 description 3
- 210000000707 wrist Anatomy 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 210000003127 knee Anatomy 0.000 description 2
- 230000009021 linear effect Effects 0.000 description 2
- 230000001537 neural effect Effects 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 241001465754 Metazoa Species 0.000 description 1
- 102000005717 Myeloma Proteins Human genes 0.000 description 1
- 108010045503 Myeloma Proteins Proteins 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 210000000988 bone and bone Anatomy 0.000 description 1
- 238000010224 classification analysis Methods 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000005284 excitation Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012886 linear function Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000009022 nonlinear effect Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000010349 pulsation Effects 0.000 description 1
- 238000007430 reference method Methods 0.000 description 1
- 238000000611 regression analysis Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
Definitions
- the present invention relates to the field of image processing technology, and in particular, to a method, a device, a computing device, and a system for identifying object attributes.
- Recent attribute recognition models based on deep convolutional neural networks are usually based on overall images, that is, inputting pedestrian images into deep convolutional neural networks to extract features shared by pedestrian attributes, and then learn attribute classifiers for each attribute individually.
- pedestrian images usually have various poses. For example, some pedestrians in the image are standing, some are sitting, some are riding bicycles, etc. It is difficult for rigid deep convolutional neural networks to overcome the changes in pedestrian posture. The recognition of attributes is inaccurate and poorly robust.
- the technical problem to be solved by the embodiments of the present invention is to provide a method, a device, a computing device and a system for recognizing object attributes. Recognition of attributes based on M part feature maps determined by M pose key points can overcome the pose of the target object The effect on the recognition result makes the attribute recognition of the object more accurate.
- an embodiment of the present invention provides a method for identifying object attributes.
- the method includes: a computing device extracts features of M parts in a first image according to M pose keypoints, and obtains feature maps of M parts, and further , Inputting the M feature map into a first attribute recognition model to obtain a first attribute recognition result of the target object.
- the first image is an original image or an original feature map extracted from the original image.
- the original image includes a target object, and the target object includes M parts.
- the M pose key points correspond to the M parts one by one.
- the feature maps of M parts correspond one-to-one; the attitude key points are used to determine the positions of the parts corresponding to the attitude key points, and M is a positive integer.
- the feature maps corresponding to the respective M parts are extracted from the first image according to the M pose keypoints, that is, the first image is disassembled from the target.
- the feature maps of M parts that are not related to the pose of the object, and the feature maps of M parts that are not related to the pose of the target object are input to the first attribute recognition model for model training and recognition, so as to overcome the influence of the pose of the target object on the recognition result, so that The recognition of object attributes is more accurate and robust.
- the computing device extracts the features of M parts in the first image according to the M pose keypoints, and an implementation manner of obtaining the feature map of the M parts may be:
- the computing device inputs the first image to the part positioning model, and obtains the positioning parameters of the corresponding positions of the M posture keypoints.
- the positioning parameters of the corresponding positions of the first posture keypoints are used to determine the first posture keypoint corresponding to the first posture image.
- the region where the part is located; according to the positioning parameters of the parts corresponding to the M posture key points, the feature maps of the parts corresponding to the M parts are extracted from the first image by interpolation sampling.
- the first attitude key point is any one of the M attitude key points.
- the component positioning model determines the position according to the key points of the posture, so that no matter what posture the target object is in, it can accurately locate each part of the target object, and then extract the feature map of the part corresponding to each part, so that An attribute recognition model can realize the attribute recognition of the target object based on the part feature map.
- the positioning parameter of the position k corresponding to the key point k of the attitude is an affine transformation parameter between the first position coordinate and the second position coordinate
- the first position coordinate is the position k at The position coordinates in the first image
- the second position coordinates are the position coordinates in the part feature map corresponding to the part k
- the part feature map corresponding to the part k is calculated by the following formula:
- k is an index of a part, k is a positive integer and k ⁇ M; F is the first image; V k is a feature map of the part corresponding to the k; i is a feature map of the part corresponding to the k
- an affine transformation parameter is used to determine a region where a part corresponding to a first pose key point is located in a first image, and a part feature map is determined by interpolation sampling, so as to extract a feature map of a part from the first image.
- the normalized coordinates of the coordinate position i in the feature map of the part corresponding to the part k are among them,
- the first attribute recognition model includes M depth feature extraction models and region-based feature learning models, where the M depth feature extraction models correspond one-to-one with M locations, and the computing device translates the The feature maps of M parts are input to the first attribute recognition model, and an implementation manner to obtain the first attribute recognition result of the target object may be:
- the computing device inputs the feature maps of the M parts into M depth feature extraction models to obtain M feature maps of the depth parts, where the feature maps of the M depth parts correspond to the M parts one to one, the first
- the depth feature extraction model corresponding to the part is used to extract the feature map of the depth part corresponding to the first part from the part feature map corresponding to the first part, where the first part is any one of the M parts;
- the stitched depth feature map is input to the region-based feature learning model to obtain a first attribute recognition result of the target object.
- the second attribute recognition result based on the global (that is, the first image) and the first attribute recognition result based on the local (that is, the feature map of the M parts) are fused so that the obtained third attribute recognition results are simultaneously Global and local effects are considered to further improve the accuracy and robustness of attribute recognition.
- an embodiment of the present application further provides an attribute recognition device, which includes a module or a unit for executing the object attribute recognition method provided by the first aspect or any possible implementation manner of the first aspect.
- an embodiment of the present application further provides a computing device.
- the computing device includes a processor and a memory coupled to the processor.
- the memory is used to store program code, and the processor is used to call the memory to store.
- the program code executes the object attribute recognition method provided by the first aspect or any one of the possible implementation manners of the first aspect.
- an embodiment of the present application further provides a computer storage medium, where the computer storage medium is used for computer software instructions, and the computer software instructions, when executed by a computer, cause the computer to execute the instructions described in the first aspect. Any kind of object attribute recognition method.
- an embodiment of the present application further provides a computer program, where the computer program includes computer software instructions, and when the computer software instructions are executed by a computer, the computer executes any one of the methods described in the first aspect Object attribute identification method.
- an embodiment of the present application further provides a chip, where the chip includes a processor and a data interface, and the processor reads an instruction stored in the memory through the data interface, and executes the first aspect or the first aspect.
- An object attribute recognition method provided by any possible implementation manner.
- the chip may further include a memory, where the memory stores instructions, and the processor is configured to execute the instructions stored on the memory, and when the instructions are executed, the chip The processor is configured to execute the object attribute recognition method provided in the first aspect or any one of the possible implementation manners of the first aspect.
- a computing device which includes the attribute recognition device in the second aspect described above.
- FIG. 1 is a schematic framework diagram of a convolutional neural network according to an embodiment of the present application.
- FIG. 2 is a schematic diagram of a framework of an object attribute recognition system according to an embodiment of the present application.
- FIG. 3 is a schematic flowchart of an object attribute recognition method according to an embodiment of the present application.
- FIG. 4 is a schematic diagram of another object attribute recognition system framework provided by an embodiment of the present application.
- FIG. 5 is a schematic flowchart of another object attribute recognition method according to an embodiment of the present application.
- FIG. 6 is a schematic structural diagram of an attribute recognition device according to an embodiment of the present invention.
- FIG. 7 is a schematic structural diagram of another attribute recognition device according to an embodiment of the present invention.
- FIG. 8 is a schematic structural diagram of still another computing device according to an embodiment of the present application.
- FIG. 9 is a schematic diagram of a hardware structure of a chip according to an embodiment of the present application.
- Convolutional neural network is a deep neural network with a convolutional structure.
- Convolutional neural networks include a feature extractor consisting of a convolutional layer and a sub-sampling layer.
- the feature extractor can be regarded as a filter, and the convolution process can be regarded as a convolution using a trainable filter and an input image or a convolution feature map.
- a convolution layer refers to a neuron layer in a convolutional neural network that performs convolution processing on an input signal.
- a neuron can be connected to only some of the neighboring layer neurons.
- a convolution layer usually contains several feature planes, and each feature plane can be composed of some rectangularly arranged neural units.
- Neural units in the same feature plane share weights, and the weights shared here are convolution kernels. Sharing weights can be understood as the way of extracting image information is independent of location. The underlying principle is that the statistical information of one part of the image is the same as the other parts. That means that the image information learned in one part can also be used in another part. So for all locations on the image, the same learned image information can be used. In the same convolution layer, multiple convolution kernels can be used to extract different image information. Generally, the more the number of convolution kernels, the richer the image information reflected by the convolution operation.
- the convolution kernel can be initialized in the form of a random-sized matrix. During the training process of the convolutional neural network, the convolution kernel can obtain reasonable weights through learning. In addition, the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
- a convolutional neural network (CNN) 100 may include an input layer 110, a convolutional layer / activation layer / pooling layer 120, and a full layer Connected layer (fully connected layer) 130. Among them, the activation layer and the pooling layer are both optional.
- the convolutional neural network 100 may include multiple convolutional layers, and any one of the convolutional layers may be connected to an activation layer and / or a pooling layer.
- 121 layers are convolutional layers
- 122 layers are pooling layers
- 123 layers are convolution layers
- 124 layers are pooling layers
- 125 are convolution layers
- 126 are pooling layers
- 121 and 122 are convolutional layers
- 123 is a pooling layer
- 124 and 125 are convolutional layers
- 126 is a pooling layer.
- the output of the convolution layer can be used as the input of the subsequent pooling layer, or it can be used as the input of another convolution layer to continue the convolution operation.
- the input layer 110 mainly performs preprocessing on the input image, including de-averaging and normalization.
- the convolutional layer 121 will be taken as an example to introduce the inner working principle of a convolutional layer.
- the convolution layer 121 can include many convolution kernels. Its role in image processing is equivalent to a filter that extracts specific information from the input image matrix.
- the convolution kernel can be a weight matrix in nature. This weight matrix is usually It is pre-defined that during the convolution operation on the image, the weight matrix is usually one pixel and one pixel (or two pixels and two pixels in the horizontal direction on the input image). This depends on the value of the stride step. ) To perform processing to complete the task of extracting specific features from the image.
- the size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix and the depth dimension of the input image are the same. During the convolution operation, the weight matrix will be extended to The entire depth of the input image.
- convolving with a single weight matrix will produce a convolution output of a single depth dimension, but in most cases, a single weight matrix is not used, but multiple weight matrices with the same size (row ⁇ column) are applied. That is, multiple isotype matrices.
- the output of each weight matrix is stacked to form the depth dimension of the convolution image.
- the dimensions here can be understood as determined by the "multiple" described above.
- Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract image edge information, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to eliminate unwanted noise in the image. Blurring, etc.
- the multiple weight matrices have the same size (row ⁇ column), and the feature maps extracted after the multiple weight matrices of the same size have the same size, and then the multiple extracted feature maps of the same size are combined to form a convolution operation. Output.
- weight values in these weight matrices need to be obtained after a lot of training in practical applications.
- Each weight matrix formed by the weight values obtained through training can be used to extract information from the input image, so that the convolutional neural network 100 can make correct predictions. .
- the initial convolutional layer (such as 121) often extracts more general features, which can also be called low-level features; as the convolutional neural network
- the features extracted by subsequent convolutional layers (such as 126) become more and more complex, such as features such as high-level semantics.
- An activation layer can be applied after a convolutional layer to introduce non-linear factors into the model, increasing the non-linear properties of the model and the entire convolutional neural network.
- the activation function may include a Tanh function, a ReLU function, a Leaky ReLU function, a Maxout function, and the like.
- the 121-126 layers shown in 120 in Figure 1 can be a convolution layer followed by a layer.
- the pooling layer may also be a multi-layer convolution layer followed by one or more pooling layers.
- the sole purpose of the pooling layer is to reduce the spatial size of the image.
- the pooling layer may include an average pooling operator and / or a maximum pooling operator for sampling the input image to obtain a smaller-sized image.
- the average pooling operator can calculate the pixel values in the image within a specific range to produce an average value as the result of the average pooling.
- the maximum pooling operator can take the pixel with the largest value in the range in a specific range as the result of the maximum pooling.
- the operators in the pooling layer should also be related to the size of the image.
- the size of the output image processed by the pooling layer may be smaller than the size of the image of the input pooling layer.
- Each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding subregion of the image of the input pooling layer.
- the convolutional neural network 100 After processing by the convolutional layer / activation layer / pooling layer 120, the convolutional neural network 100 is not enough to output the required output information. Because as described above, the convolution layer / pooling layer 120 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other related information), the convolutional neural network 100 needs to use the fully connected layer 130 to generate the output of one or a set of required classes. Therefore, the fully-connected layer 130 may include multiple hidden layers (such as 131, 132 to 13n shown in FIG. 1) and the output layer 140. The parameters included in the multiple hidden layer may be based on the specific task type. Relevant training data is obtained by pre-training. In the embodiment of the present application, for the part location model, the task type is high-level attribute recognition and posture keypoint regression; for the first attribute recognition model or the second attribute recognition model, the task type is high-level attributes. Identify.
- the output layer 140 After the multiple hidden layers in the fully connected layer 130, that is, the last layer of the entire convolutional neural network 100 is the output layer 140, which has a loss function similar to the classification cross-entropy, and is specifically used to calculate the prediction error.
- the forward propagation of the entire convolutional neural network 100 (as shown in Fig. 1 from 110 to 140 is forward propagation)
- the reverse propagation (as shown in Fig. 1 from 140 to 110 is backward propagation) Start to update the weight values and deviations of the layers mentioned earlier to reduce the loss of the convolutional neural network 100 and the error between the results output by the convolutional neural network 100 through the output layer and the ideal results.
- the convolutional neural network 100 shown in FIG. 1 is only used as an example of a convolutional neural network. In specific applications, the convolutional neural network may also exist in the form of other network models.
- the underlying features are directly extracted from the features of the original image.
- the middle-level features which are between the bottom-level features and the semantic features, are extracted through the convolutional layer / pooling layer and are the features of a certain layer in the convolutional neural network.
- Semantic features which have a direct semantic meaning, or features directly related to semantics, are referred to as attributes in the embodiments of the present application.
- Support vector machine is a supervised learning model related to related learning algorithms. It can analyze data and recognize patterns for pattern recognition, classification, and regression analysis.
- FIG. 2 is a schematic diagram of an object attribute recognition system framework provided by an embodiment of the present application.
- the object attribute recognition system may include a first attribute recognition model, a second attribute recognition model, a part positioning model, a part feature map extraction module, and the like. among them:
- the first image is an image to be identified, and may be an original image or an original feature map extracted from the original image, where the original image includes a target object and the target object includes M parts.
- the original feature map is a middle-level feature obtained by extracting the original image through one or more convolutional / pooling layers of the second attribute recognition model.
- the embodiment of the present invention is described by using the first image as an original feature map. It can be understood that the embodiment of the present application may not include the second attribute recognition model, and the first image is the original image.
- the part localization model can be a convolutional neural network, which is used to obtain the localization parameters of M parts according to the input first image, which usually includes an input layer, one or more convolutional layers, one or more pooling layers, and a fully connected layer. Wait.
- the positioning parameter of the part is used to determine a region of the part corresponding to the key point of the posture in the first image.
- the positioning parameters of the part can be affine transformation parameters, including translation parameters and transformation parameters.
- the translation parameters include horizontal translation parameters and vertical translation parameters. The coordinates determined by the horizontal translation parameters and the vertical translation parameters are the key to the attitude obtained by the location positioning model. The position coordinates of the point in the first image.
- the first image is input to the part positioning model, and M posture key points and position parameters of the corresponding positions of the M posture key points are obtained. It can be understood that the part positioning model outputs M sets of positioning parameters. Each set of positioning parameters is used to determine a location.
- the part feature map extraction module is used to determine the regions where the M pose keypoints correspond to the parts in the first image according to the input M set of positioning parameters and the first image, to obtain M part feature maps, M parts and M Part feature maps correspond one-to-one.
- M positioning parameters are input to a feature map extraction module, and the feature map extraction module extracts M feature maps corresponding to M locations from the first image through interpolation sampling.
- the first attribute recognition model is used to extract the first attribute recognition result of each of the L attributes of the target object from the M part feature maps input to the model, where M and L are positive integers.
- the second attribute recognition model is used to extract a second attribute recognition result of each of the L attributes of the target object from the original image input to the model.
- the second attribute recognition model may be a convolutional neural network, which may include an input layer, one or more convolutional layers, one or more pooling layers, and a fully connected layer. It can be understood that the second attribute recognition model performs attribute recognition based on the entirety of the original image.
- the first attribute recognition model may include an M-depth feature extraction model, a first stitching module, and a region-based feature learning model.
- M depth feature extraction models correspond to M parts one by one
- the depth feature extraction model corresponding to part j is used to extract the feature map of the depth part corresponding to part j from the part feature map corresponding to part j, where j is the Index, j is a positive integer and j ⁇ M.
- the depth feature extraction model may include one or more convolutional layers, one or more pooling layers, fully connected layers, etc., to extract the depth features of the parts corresponding to the part feature maps from the input part feature maps. For example, a part feature map corresponding to the part j is input into a depth feature extraction model corresponding to the part j to extract a depth part feature map for the part j from the part feature map corresponding to the part j.
- the stitching module stitches the feature maps of the depth parts corresponding to the M parts output by the M depth feature extraction models.
- the stitched deep part feature maps are input to a regional feature-based learning model to obtain a first attribute recognition result for each of the L attributes of the object.
- the regional feature-based learning model may include one or more convolutional layers, pooling layers, fully connected layers, and the like. In another embodiment of the present application, the region-based feature learning model may also include only a fully connected layer.
- the first attribute recognition system may further include a second stitching module, and the second stitching module is configured to stitch the M part feature maps.
- the M part feature maps are input to the first attribute recognition model.
- the first attribute recognition model may include one or more convolutional layers, one or more pooling layers, a fully connected layer, and the like.
- the first attribute recognition model extracts the first attribute recognition result of each of the L attributes of the object from the M feature maps after stitching. It can be understood that the first attribute recognition model is based on the learning of the M feature maps. model.
- the attribute recognition system may further include a result fusion module, configured to combine the first attribute recognition result of each of the L attributes of the object obtained by the first attribute recognition model and the object obtained by the second attribute recognition model.
- the second attribute recognition result of each of the L attributes is fused, and the third attribute recognition result of each of the L attributes is calculated.
- the third attribute recognition result may also be converted into an attribute recognition probability through a Sigmoid function to indicate the predicted probability of the attribute.
- each model is a trained model.
- the first attribute recognition model and the part positioning model may be trained together.
- the areas related to the key points of different poses can share the feature learning network of the front end, and learn the affine transformation parameters of the respective related areas.
- we supervised the part localization model through two tasks one is attribute recognition at a high level, and the other is posture keypoint regression.
- High-level attribute recognition can be optimized using cross-entropy.
- the gradient information is extracted from the back-end region-based feature learning model, passed through M depth feature extraction models, and finally passed to the location localization model.
- European-style loss can be used.
- the gradient information is directly transmitted to the location localization model.
- the gradient information from the attribute recognition optimization target and the gradient key information from the pose keypoint regression optimization target to update the part localization model parameters. It should be noted that the loss of pose keypoint regression is in order to better learn the position area related to each pose keypoint for each pose keypoint.
- the first attribute recognition model, the part positioning model, and the second attribute recognition model may be separately trained.
- the first attribute recognition model is trained based on M part feature maps, and the M and part feature maps are acquired on the first image based on the positioning parameters of the M parts obtained from the first image input to the trained part positioning model. ; While, the second attribute recognition model is trained based on the original image or the first image.
- the computing device can be a terminal device or a server.
- the terminal device may be a mobile phone, a desktop computer, a portable computer, a tablet computer, or other electronic devices that can perform part or all of the processes of the object attribute recognition method in this application, which is not limited in this application.
- the first attribute recognition model, the second attribute recognition model, and the part location model may be machine learning models such as a neural network, a convolutional neural network, and a support vector machine, which are not limited in this sending embodiment.
- the object attribute recognition system described in the embodiments of the present application can be applied to fields such as retrieval and analysis of attributes based objects.
- pedestrian attribute recognition uses computer vision technology to intelligently analyze pedestrian images, and then determine various fine-grained attributes of the pedestrian, such as gender, age, color and type of clothing, backpacks, etc., further applied to pedestrian-based attribute descriptions Pedestrian search, etc. to quickly find the pedestrian.
- the object attribute recognition method in the embodiment of the present application will be described below with reference to the frame diagram of the object attribute recognition system in FIG. 2 and the schematic flowchart of the object attribute recognition method shown in FIG. 3.
- the execution subject of the object attribute identification method in this application may be a computing device, an attribute identification device, a processor in the computing device, or a distributed computer system.
- the embodiment of the present application is described with a computing device as an example.
- the object attribute identification method may include The following steps:
- the computing device extracts the features of the M parts in the first image according to the M pose key points, and obtains the feature maps of the M parts.
- the first image is an original image or an original feature map extracted from the original image.
- the original image includes a target object, and the target object includes M parts.
- the M posture key points correspond to the M parts one by one, and the M parts and M Each part feature map corresponds to one, and M is a positive integer.
- S2 The feature maps of the M parts are input to a first attribute recognition model to obtain a first attribute recognition result of the target object.
- the first image may be an original image, and the original image includes a target object, and the target object may be a person, an animal, or an object, such as a car, a bicycle, or the like, which is not limited in the embodiment of the present application.
- the first image may also be an original feature map obtained by extracting middle-level features of the original image.
- the posture key point is a position point on the target object in the first image, and is used to determine the position of the position corresponding to the posture key point in the target object. It can be understood that the key points of the target object, part and attitude in the original image can be mapped to the original feature map.
- Part positioning refers to extracting a part region of a pedestrian, such as a head region or a foot region, from a feature map (referred to as a first image in this application) including an entire object (such as a pedestrian).
- the part is an area related to the key points of the pedestrian posture.
- the posture of the target object can be determined based on the key points of the posture in the first image, and the position of the part corresponding to the key points of the posture can be determined.
- An attitude key point is a position point on the first image.
- the physical meaning of the key points of the posture in the original image can be the key points of the human skeleton, for example, right shoulder, right elbow, right wrist, left shoulder, left elbow, left wrist, right hip, right knee, right ankle, Left hip, left knee, left ankle, top of head, neck, etc.
- the key point of the posture is the right ankle
- the location is centered on the key point of the right ankle, and the entire foot area is found out in order to better identify the attributes related to the foot. Because the pedestrian's posture is constantly changing, it is more accurate to determine the position of the pedestrian based on the key points of the posture.
- a target object is taken as an example.
- the key point of the posture may be the right ankle and the corresponding part is the right foot.
- the key point of the posture is the left wrist and the corresponding part is the left hand.
- the key point is the left The elbow, its corresponding part is the left arm, etc.
- the object may also be a car, and the part may also be a wheel, a window, a door, etc., which is not limited in the embodiment of the present application.
- the first attribute recognition result includes recognition results of a plurality of attributes of the target object, and specifically includes a predicted score of each of the plurality of attributes.
- L attributes L is a positive integer
- the first attribute is any one of the L attributes.
- the prediction result of the first attribute can be mapped to obtain the attribute recognition result of the first attribute.
- the prediction score may be a probability value, used to indicate a probability that the target object includes the first attribute, for example, a probability that the target object is female.
- L is a set value.
- Object attributes have semantic characteristics.
- the target object is a pedestrian.
- the attributes of the target object can be gender (male, female), age (such as adolescents, middle-aged, and elderly), race (han, Uygur), and figure (fat). , Thin, standard), top style (short sleeve, long sleeve), top color (black, red, blue, green, white, yellow) and so on.
- the convolutional feature extractor of the middle layer may be one or more convolutional layers and pooling layers in the second attribute recognition model.
- the first image is used as the original feature map, that is, the middle-level feature extracted from the original image I by the convolution feature extractor is used as an example for illustration. It can be understood that the first image may also be the original image itself. Not limited.
- the first attribute recognition model performs attribute recognition based on M part feature maps determined by M pose key points, which can overcome the influence of the target object ’s pose on the recognition result, and make the attribute recognition of the object more accurate, and Robustness is good.
- the computing device extracts the features of M parts in the first image according to the M posture key points, and obtains the feature maps of the M parts. That is, a realistic way of step S1 may include the following steps:
- the first image is input to the part positioning model, and the positioning parameters corresponding to the M pose keypoints are obtained.
- the first pose keypoint is any one of the M pose keypoints, and the first pose keypoint.
- the positioning parameter of the corresponding part is used to determine the region where the part corresponding to the first pose key point is located in the original feature map.
- M is a preset value, such as 14, the M posture key points can be 14 human bone key points.
- the model can consist of convolutional layers, incentive layers, pooling layers, and fully connected layers.
- the position k corresponds to the position k
- the positioning parameter of the position k corresponding to the position key point k is an affine transformation parameter between the first position coordinate and the second position coordinate
- the first position coordinate is the position coordinate of the position k in the first image
- the second position coordinate Is the position coordinates in the feature map of the part corresponding to part k.
- the transformation relationship between the first position coordinate and the second position coordinate is:
- k is the index of the part
- i is the index of the coordinate position in the feature map of the part corresponding to the part k.
- Yes Normalized coordinates in the first image after affine transformation Is the positioning parameter of the part k, that is, the affine transformation parameter between the first position coordinate and the second position coordinate. among them, They are the horizontal and vertical translation parameters, respectively, and the coordinates of the attitude key point corresponding to the part k. Is the transformation parameter.
- position coordinates may be normalized coordinates, It is also the normalized coordinate value of the attitude key point k.
- the method for determining the part feature map corresponding to part k is:
- k is an index of a part, k is a positive integer and k ⁇ M; F is a first image; and V k is a feature map of the part corresponding to the part k; i is an index of a coordinate position in the feature map of the part corresponding to the part k; H is the height of the first image, which is the number of vertical pixels in the first image; W is the width of the first image, which is the number of horizontal pixels in the first image; (m, n) is the coordinates in the first image position, The normalized coordinates in the first image after the affine transformation of the coordinate position i in the feature map corresponding to the part k.
- the position is determined by the max function.
- the neighboring pixels are sampled by interpolation to determine the value of coordinate position i in the feature map of the part corresponding to part k.
- the target object includes M posture key points, that is, M parts. Therefore, through the above step S12, a feature map of M parts can be generated, that is, V 1 to V M.
- Step S2 that is, the computing device inputs the feature maps of M parts into the first attribute recognition model to obtain the first attribute recognition result of the target object, that is, step S2, which may include but not Limited to the following two implementations.
- the architecture of the first attribute recognition model may be the first attribute recognition model shown in FIG. 2, and may include M depth feature extraction models corresponding to M parts one-to-one, a first stitching module, and a region-based feature learning model.
- the depth feature extraction model may include one or more convolutional layers, one or more pooling layers, fully connected layers, etc., to extract the depth features of the part corresponding to the part feature map from the input part feature map.
- the M feature maps are respectively input to the M depth feature extraction models to obtain M depth feature maps.
- M depth part feature maps correspond to M parts one by one
- the depth feature extraction model corresponding to part j is used to extract the depth part feature map corresponding to part j from the part feature map corresponding to part j, where j is the index of the part.
- J is a positive integer and j ⁇ M; the computing device will stitch the extracted feature maps of the M depth parts, and input the stitched feature maps of the M depth parts into the region-based feature learning model to obtain the target object The first attribute recognition result.
- each depth feature extraction model may include one or more convolutional layers, one or more pooling layers, and a fully connected layer. They are used to extract the feature map of the part from the input feature map. The depth characteristics of the corresponding part.
- the first stitching module may use the horizontal stitching or the vertical stitching to stitch the feature maps of the M depth parts.
- the embodiment of the present application uses vertical stitching as an example for illustration.
- the first attribute recognition model can be obtained through separate training, that is, the M deep feature extraction models and the region-based learning model can be trained as a whole.
- the weights of the feature maps of each depth part can be determined through training, and the stitched feature maps of the depth parts are input to the trained region-based feature learning model to obtain the first attribute recognition result of the target object.
- the recognition result of the first attribute of the target object includes the recognition results of L attributes, and the recognition result of the attribute j can be expressed as:
- j is an index of an attribute
- j is a positive integer
- j ⁇ L is a recognition result of the attribute j of the target object.
- f local-k represents the depth feature extraction model of part k.
- f local-k (V k ) is the feature map of the part corresponding to the part k extracted by the depth feature extraction model corresponding to the part k.
- W j T is a weight matrix obtained through training and used to represent the weights of the feature maps of M depth parts.
- the object attribute recognition system includes a part location model, a component feature map extraction module, a second stitching module, and a first attribute recognition model. among them:
- the part positioning model is used to determine the positioning parameters of M components. It is any one of the part positioning models described in the above method or system embodiment. For specific implementation, refer to the related description in FIG. 2 or FIG. 3 described above. More details.
- the component feature map extraction module is used for extracting M feature maps corresponding to the M parts from the first image according to the positioning parameters of the corresponding positions of the M pose key points respectively.
- M feature maps corresponding to the M parts from the first image according to the positioning parameters of the corresponding positions of the M pose key points respectively.
- the second stitching module is used for stitching the features of M parts.
- the first attribute recognition model may include one or more convolutional layers, one or more pooling layers, a fully connected layer, an output layer, and the like.
- the stitched M part feature maps are input to a first attribute recognition model.
- the first attribute recognition model extracts the first attribute recognition result of the target object from the M part feature maps after stitching.
- the first attribute recognition model can be obtained through separate training. By inputting the spliced M part feature maps into the first attribute recognition model, the first attribute recognition result of the target object can be obtained.
- FIG. 5 is a schematic flowchart of another object attribute recognition method according to an embodiment of the present application.
- the object attribute recognition method may include the following steps in addition to steps S1 and S2 described in FIG. 3. :
- S3 The first image is input to the second attribute recognition model, and the second attribute recognition result of the target object is recognized.
- the second attribute recognition result includes recognition results of a plurality of attributes of the target object, and specifically includes a prediction score of each of the plurality of attributes. For example, it includes L attributes, L is a positive integer, and the first attribute is any one of the L attributes.
- the prediction result of the first attribute can be mapped to obtain the recognition result of the first attribute.
- the prediction score may be a probability value, used to indicate a probability that the target object includes the first attribute, for example, a probability that the target object is female.
- the second attribute recognition model is used to extract a second attribute recognition result of the target object according to the first image input to the model.
- the second attribute recognition model may be a convolutional neural network, which may include an input layer, one or more convolutional layers, an activation layer, a pooling layer, and a fully connected layer. It can be understood that the second attribute recognition model performs attribute recognition based on the entirety of the first image including the target object.
- the first image input to the second attribute recognition model is an original image including the object, and the original feature map input to the part location model is one or more of the models recognized by the second attribute. Features extracted from each convolutional layer.
- the recognition result of the second attribute of the target object includes the recognition results of L attributes, and the recognition result of the attribute j can be expressed as:
- Y2 j is the second attribute recognition result of the target object attribute j
- f global is a global depth-based feature extractor learned on the basis of the sample image. Is the parameter of the attribute j, which is obtained through learning.
- the third attribute recognition result of the object attribute j may be a linear addition of the first attribute recognition result of the object attribute j and the second attribute recognition result of the object attribute j, where j is an index of the attribute, j is a positive integer, and j ⁇ L, that is:
- Y3 j ⁇ Y1 j + ⁇ Y2 j
- ⁇ and ⁇ are constants greater than 0.
- Y1 j is the recognition result of attribute j obtained by the first attribute recognition model
- Y2 j is the recognition result of attribute j obtained by the second attribute recognition model
- Y3 j is the third attribute recognition result of the target object attribute j
- ⁇ and ⁇ are constants greater than 0.
- ⁇ 0.8
- ⁇ 0.5
- Steps S3 and S4 and steps S1 and S2 can be performed in any order, that is, steps S3 and S4 can be performed before steps S1 or S2, can be performed after steps S1 or S2, and can be performed simultaneously with steps S1 or S2.
- steps S3 and S4 can be performed before steps S1 or S2, can be performed after steps S1 or S2, and can be performed simultaneously with steps S1 or S2.
- the application examples are not limited.
- a model based on the first attribute of each part of the first image and a global image based on the first image are respectively used.
- the second attribute recognition model obtains a first attribute recognition result and a second attribute recognition result for the attribute, and further performs a weighted summation of the first attribute recognition result and the second attribute recognition result to obtain a third attribute recognition result for the attribute. , Using the third attribute recognition result as the final score of the attribute to improve the accuracy of object attribute recognition.
- first attribute recognition result, the second attribute recognition result, or the third attribute recognition result may be converted into a predicted probability of the attribute.
- the third attribute recognition result is converted into an attribute recognition probability through a Sigmoid function to indicate the predicted probability of the attribute.
- j is the index of the attribute
- j is a positive integer
- j ⁇ L is the predicted probability of the attribute j
- Y3 j is the third attribute recognition result of the object attribute j.
- the probability that the age of the subject is middle age is 0.88
- the probability of juveniles is 0.21
- the probability of old age is 0.1.
- the attribute recognition device 60 may include a part feature extraction unit 601 and a first attribute recognition unit 602, where:
- a part feature extraction unit 601 is configured to extract the features of M parts in the first image according to the M pose key points to obtain M part feature maps, where the first image is an original image or an image obtained from the original image.
- Original feature map the original image includes a target object, the target object includes the M parts, the M posture key points correspond to the M parts one to one, and the M parts are related to the M parts Part feature maps correspond one-to-one; the attitude key points are used to determine the positions of the parts corresponding to the attitude key points, and M is a positive integer;
- a first attribute recognition unit 602 is configured to input the feature maps of the M parts into a first attribute recognition model to obtain a first attribute recognition result of the target object.
- the part feature extraction unit 601 is specifically configured to:
- the first image is input to a part positioning model, and positioning parameters of the positions corresponding to the M posture key points are obtained.
- the positioning parameters of the positions corresponding to the first posture key points are used to determine the first posture in the first image.
- the feature maps of the M parts corresponding to the M parts are extracted from the first image by interpolation sampling.
- the positioning parameter of the position k corresponding to the key point k of the attitude is an affine transformation parameter between the first position coordinate and the second position coordinate
- the first position coordinate is the position k at which The position coordinates in the first image
- the second position coordinates are the position coordinates in the part feature map corresponding to the part k
- the part feature map corresponding to the part k is calculated by the following formula:
- k is an index of a part, k is a positive integer and k ⁇ M; F is the first image; V k is a feature map of the part corresponding to the k; i is a feature map of the part corresponding to the k
- the normalized coordinates of the coordinate position i in the feature map of the part corresponding to the part k are among them,
- the first attribute recognition model includes M depth feature extraction models and region-based learning models, wherein the M depth feature extraction models correspond to the parts in a one-to-one manner, and
- the first attribute recognition unit 602 is specifically configured to:
- the feature maps of the M parts are inputted into the extraction model of the depth features to obtain M feature maps of the depth parts, wherein the feature maps of the M depth parts correspond to the M parts one to one, and the first part corresponds to the A depth feature extraction model for extracting a feature map of depth parts corresponding to the first part from the feature map of parts corresponding to the first part, where the first part is any one of the M parts;
- the stitched depth part feature map is input into the region-based learning model to obtain a first attribute recognition result of the target object.
- the attribute recognition apparatus 70 may further include a second attribute recognition unit 603. Configured to: input the first image into a second attribute recognition model, and recognize a second attribute recognition result of the target object;
- each unit may also correspond to the corresponding description of the reference method embodiment, which is not repeated in this embodiment of the present application.
- FIG. 8 is a schematic structural diagram of another computing device according to an embodiment of the present application.
- the computing device may include, but is not limited to, a processor 801 and a memory 802, and the processor is connected to the memory 802 through a bus 803.
- the memory 802 may be a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), or other memories.
- the memory 802 is configured to store data, such as the original image, the original feature map, the part feature map, or the depth feature map, and various software programs, such as the object attribute recognition program in the application. Wait.
- the computing device 80 may further include at least one communication interface 804, which is used to implement data exchange between the computing device 80 and a terminal, a server, or other computing devices.
- the processor 801 may be a central processing unit (CPU), and the processor 801 may also be another general-purpose processor, a digital signal processor (DSP), or an application-specific integrated circuit (ASIC). ), Ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
- a general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
- the processor 801 is configured to call data stored in the memory and execute program code:
- the features of M parts are extracted from the first image according to the M pose keypoints to obtain M part feature maps, where the first image is an original image or an original feature map extracted from the original image, the original image Including a target object, the target object including the M parts, the M posture key points corresponding to the M parts one to one, the M parts corresponding to the M part feature maps one to one;
- the attitude key is used to determine the position of the corresponding part of the attitude key, and M is a positive integer;
- the feature maps of the M parts are input into a first attribute recognition model to obtain a first attribute recognition result of the target object.
- the processor 801 executes extracting the features of M parts in the first image according to the M pose keypoints to obtain the feature maps of the M parts, which specifically includes executing:
- the first image is input to a part positioning model, and positioning parameters of the positions corresponding to the M posture key points are obtained.
- the positioning parameters of the positions corresponding to the first posture key points are used to determine the first posture in the first image.
- the feature maps of the M parts corresponding to the M parts are extracted from the first image by interpolation sampling.
- the positioning parameter of the position k corresponding to the position key point k is an affine transformation parameter between a first position coordinate and a second position coordinate
- the first position coordinate is a position coordinate of the position k in the first image
- the second position coordinate is a position coordinate in a part feature map corresponding to the part k
- a part feature map corresponding to the part k is calculated by the following formula:
- k is an index of a part, k is a positive integer and k ⁇ M; F is the first image; V k is a feature map of the part corresponding to the k; i is a feature map of the part corresponding to the k
- the normalized coordinates of the coordinate position i in the feature map of the part corresponding to the part k are among them,
- the first attribute recognition model includes M depth feature extraction models and region-based learning models, wherein the M depth feature extraction models correspond to the parts in a one-to-one manner, and
- the processor 801 executes the input of the M part feature maps to a first attribute recognition model to obtain a first attribute recognition result of the target object, including executing:
- the feature maps of the M parts are inputted into the extraction model of the depth features to obtain M feature maps of the depth parts, wherein the feature maps of the M depth parts correspond to the M parts one to one, and the first part corresponds to the A depth feature extraction model for extracting a feature map of depth parts corresponding to the first part from the feature map of parts corresponding to the first part, where the first part is any one of the M parts;
- the stitched depth part feature map is input into the region-based learning model to obtain a first attribute recognition result of the target object.
- the processor 801 is further configured to execute:
- Y3 ⁇ Y1 + ⁇ Y2; ⁇ , ⁇ are greater than A constant of 0, Y1 is the first attribute recognition result, and Y2 is the second attribute recognition result.
- each device may also refer to the corresponding description in the foregoing method embodiments, and details are not described in the embodiments of the present application.
- FIG. 9 is a chip hardware structure according to an embodiment of the present invention.
- the chip includes a neural network processor 90.
- the chip can be set in the attribute recognition device as shown in FIG. 6 and FIG. 7, and is used for calculation of each unit in the attribute recognition device.
- the chip may also be provided in the computing device 80 shown in FIG. 8 to complete object attribute recognition of the computing device and output a first attribute recognition result and a second attribute recognition result.
- the algorithms of each layer in the convolutional neural network shown in FIG. 1 can be implemented in the chip shown in FIG. 9.
- the neural network processor 90 may be an NPU, TPU, or GPU and other processors suitable for large-scale XOR operations. Take NPU as an example: The NPU can be mounted as a coprocessor on the host CPU (Host CPU), and the main CPU assigns tasks to it. The core part of the NPU is an arithmetic circuit 903. The controller 904 controls the arithmetic circuit 903 to extract matrix data in the memories (901 and 902) and perform multiplication and addition operations.
- the operation circuit 903 includes a plurality of processing units (Process Engines, PEs).
- the arithmetic circuit 903 is a two-dimensional pulsating array.
- the operation circuit 903 may also be a one-dimensional pulsation array or other electronic circuits capable of performing mathematical operations such as multiplication and addition.
- the arithmetic circuit 903 is a general-purpose matrix processor.
- the arithmetic circuit 903 takes the weight data of the matrix B from the weight memory 902 and buffers it on each PE in the arithmetic circuit 903.
- the operation circuit 903 takes the input data of the matrix A from the input memory 901 and performs matrix operations based on the input data of the matrix A and the weight data of the matrix B. Partial results or final results of the obtained matrix are stored in an accumulator 908 .
- the unified memory 906 is used to store input data and output data.
- the weight data is directly transferred to the weight memory 902 through a storage unit access controller (DMAC, Direct Memory Access Controller) 905.
- the input data is also transferred to the unified memory 906 through the DMAC.
- DMAC Direct Memory Access Controller
- Bus interface unit (BIU, Bus Interface Unit) 910 also called data interface in this application, is used for the interaction between DMAC and instruction fetch buffer (Instruction, Fetch buffer) 909; bus interface unit 910 is also used for fetch memory 909 from external memory Get instruction; the bus interface unit 910 is further used for the storage unit access controller 905 to obtain the original data of the input matrix A or the weight matrix B from an external memory.
- the DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 906, or to transfer the weight data to the weight memory 902, or to transfer the input data to the input memory 901.
- the vector calculation unit 907 has a plurality of operation processing units. If necessary, the output of the operation circuit 903 is further processed, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, and the like.
- the vector calculation unit 907 is mainly used for the calculation of non-convolutional layers or fully connected layers (FCs) in a neural network. Specifically, the vector calculation unit 907 can process calculations such as Pooling, Normalization, and the like.
- the vector calculation unit 907 may apply a non-linear function to the output of the arithmetic circuit 903, such as a vector of accumulated values, to generate an activation value.
- the vector calculation unit 907 generates a normalized value, a merged value, or both.
- the vector calculation unit 907 stores the processed vectors to the unified memory 906.
- the vector processed by the vector calculation unit 907 can be used as an activation input of the arithmetic circuit 903, for example, for use in subsequent layers in the neural network, as shown in FIG. 2, if the current processing layer is a hidden layer 1 (131), the vector processed by the vector calculation unit 907 can also be used for calculation in the hidden layer 2 (132).
- An instruction fetch memory 909 connected to the controller 904 is used to store instructions used by the controller 904;
- the unified memory 906, the input memory 901, the weight memory 902, and the instruction fetch memory 909 are all On-Chip memories.
- the external memory is independent of the NPU hardware architecture.
- the operations of the layers in the convolutional neural network shown in FIG. 1 may be performed by the operation circuit 903 or the vector calculation unit 907.
- An embodiment of the present application further provides a computing device, and the computing device includes the attribute recognition device shown in FIG. 8 or FIG. 9.
- An embodiment of the present application further provides a computer storage medium for computer software instructions.
- the computer software instructions When the computer software instructions are executed by a computer, the computer executes the object attributes as provided in FIG. 2 or FIG. 5. recognition methods.
- An embodiment of the present application further provides a computer program, where the computer program includes computer software instructions that, when executed by a computer, cause the computer to execute the object attribute recognition method provided in FIG. 2 or FIG. 5.
- the processes may be completed by a computer program instructing related hardware.
- the program may be stored in a computer-readable storage medium.
- the foregoing storage media include: ROM or random storage memory RAM, magnetic disks, or optical discs, which can store various program code media.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Image Analysis (AREA)
Abstract
人工智能领域的计算机视觉技术领域中一种对象属性的识别方法、装置、计算设备及系统,该方法包括:计算设备根据M个姿态关键点在第一图像中提取M个部位的特征,得到M个部位特征图(S1),进而,将所述M个部位特征图输入第一属性识别模型,得到目标对象的第一属性识别结果(S2)。其中,第一图像为原始图像或根据原始图像提取得到的原始特征图,该原始图像包括目标对象,目标对象包括M个部位,M个姿态关键点与M个部位一一对应,M个部位与M个部位特征图一一对应。在第一属性识别模型对目标对象进行属性识别之前,将第一图像拆解出与目标对象姿势无关的M个部位特征图,进而克服目标对象的姿势对识别结果的影响,使得对对象的属性识别更加准确,且鲁棒性好。
Description
本发明涉及图像处理技术领域,尤其涉及一种对象属性识别方法、装置、计算设备及系统。
随着图像识别技术的快速发展,很多应用中采用了各种识别技术,例如,人脸识别技术和对象的属性识别技术。特别是在基于行人的检索领域,行人的属性识别至关重要。
早期的行人属性识别一般都是靠人工设计的特征,再基于支持向量机(SVM)进行分类。但是手工设计的特征很难处理实际监控场景下的各种复杂条件以及行人自身的各种变化,如姿态、视角等。最近的基于深度卷积神经网络的属性识别模型通常是基于整体图像,即,将行人图像输入到深度卷积神经网络中提取行人属性共享的特征,再对每一个属性单独学习属性分类器。然而,行人的图像通常具有各式各样的姿势,比如图像中行人有的是站着的、有的是坐着的、有的是骑自行车等,刚性的深度卷积神经网络很难克服行人姿态的变化,对行人属性的识别不准确、鲁棒性差。
发明内容
本发明实施例所要解决的技术问题在于,提供一种对象属性的识别方法、装置、计算设备及系统,基于M个姿态关键点确定的M个部位特征图进行属性识别,能克服目标对象的姿势对识别结果的影响,使得对对象的属性识别更加准确。
第一方面,本发明实施例提供了一种对象属性的识别方法,该方法包括:计算设备根据M个姿态关键点在第一图像中提取M个部位的特征,得到M个部位特征图,进而,将所述M个部位特征图输入第一属性识别模型,得到目标对象的第一属性识别结果。其中,第一图像为原始图像或根据原始图像提取得到的原始特征图,该原始图像包括目标对象,目标对象包括M个部位,M个姿态关键点与M个部位一一对应,M个部位与M个部位特征图一一对应;姿态关键点用于确定姿态关键点对应部件的位置,M为正整数。
通过执行上述方法,在第一属性识别模型对目标对象进行属性识别之前,根据M个姿态关键点在第一图像中提取M个部位分别对应的部位特征图,即将第一图像拆解出与目标对象的姿势无关的M个部位特征图,将与目标对象的姿势无关的M个部位特征图输入到第一属性识别模型进行模型训练和识别,进而克服目标对象的姿势对识别结果的影响,使得对对象属性的识别更加准确,且鲁棒性好。
在一种可能的实现方式中,计算设备根据M个姿态关键点在第一图像中提取M个部位的特征,得到M个部位特征图的一种实现方式可以是:
计算设备将第一图像输入到部位定位模型,得到M个姿态关键点分别对应部位的定位参数,第一姿态关键点对应部位的定位参数用于在第一图像中确定第一姿态关键点对应的部位所在的区域;根据所述M个姿态关键点分别对应部位的定位参数,通过插值采样从所述第一 图像中提取出所述M个部位分别对应的M个部位特征图。其中,第一姿态关键点为M个姿态关键点中任意一个姿态关键点。
通过执行上述方法,部件定位模型根据姿态关键点来确定部位,实现无论目标对象处于何种姿势,都能准确的定位到目标对象的各个部位,进而提取到各个部位对应的部位特征图,使得第一属性识别模型可以基于部位特征图实现对目标对象的属性识别。
在又一种可能的实现方式中,姿态关键点k对应部位k的定位参数为第一位置坐标与第二位置坐标之间的仿射变换参数,所述第一位置坐标为所述部位k在所述第一图像中的位置坐标,所述第二位置坐标为所述部位k对应的部位特征图中的位置坐标,所述部位k对应的部位特征图的通过下述公式计算:
其中,k是部位的索引,k为正整数且k≤M;F为所述第一图像;V
k为所述部位k对应的部位特征图;i为所述部位k对应的部位特征图中坐标位置的索引;H为所述第一图像的高;W为所述第一图像的宽;
为所述部位k对应的部位特征图中的坐标位置i经过仿射变换后在所述第一图像中的归一化坐标;(m,n)为所述第一图像中的坐标位置。
通过执行上述方法,通过仿射变换参数在第一图像中确定第一姿态关键点对应的部位所在的区域,通过插值采样确定部位特征图,以实现从第一图像中提取部件特征图。
在又一种可能的实现方式中,第一属性识别模型包括M个深度特征提取模型以及基于区域特征学习模型,其中,M个深度特征提取模型与M个部位一一对应,计算设备将所述M个部位特征图输入到第一属性识别模型,得到目标对象的第一属性识别结果的一种实现方式可以是:
计算设备将所述M个部位特征图分别输入到M个深度特征提取模型,得到M个深度部位特征图,其中,所述M个深度部位特征图与所述M个部位一一对应,第一部位对应的深度特征提取模型用于从所述第一部位对应的部位特征图中提取所述第一部位对应的深度部位特征图,所述第一部位为所述M个部位中任意一个部位;
将提取得到的所述M个部位分别对应的深度特征图进行拼接;
将拼接后的深度特征图输入到所述基于区域特征学习模型,得到所述目标对象的第一属性识别结果。
在又一种可能的实现方式中,该方法还可以包括:计算设备将所述第一图像输入到第二属性识别模型,识别出所述目标对象的第二属性识别结果;进而,根据所述第一识别结果和所述第二识别结果,计算所述目标对象的第三识别结果,其中,所述第三识别结果计算方法为:Y3=αY1+βY2;α、β为大于0的常数,Y1为所述第一属性识别结果,Y2为所述第二属性识别结果。
通过执行上述方法,将基于全局(即第一图像)的第二属性识别结果和基于局部(即M个部位特征图)的第一属性识别结果进行融合,以使得到的第三属性识别结果同时考虑了全 局和局部的影响,进一步提高属性识别的准确性和鲁棒性。
第二方面,本申请实施例还提供了一种属性识别装置,该装置包括用于执行第一方面或第一方面的任一种可能实现方式所提供的对象属性识别方法的模块或单元。
第三方面,本申请实施例还提供了一种计算设备,该计算设备包括处理器和耦合所述处理器的存储器,所述存储器用于存储程序代码,所述处理器用于调用所述存储器存储的程序代码执行第一方面或第一方面的任一种可能实现方式所提供的对象属性识别方法。
第四方面,本申请实施例还提供了一种计算机存储介质,所述计算机存储介质用于计算机软件指令,所述计算机软件指令当被计算机执行时使所述计算机执行如第一方面所述的任意一种对象属性识别方法。
第五方面,本申请实施例还提供了一种计算机程序,所述计算机程序包括计算机软件指令,所述计算机软件指令当被计算机执行时使所述计算机执行如第一方面所述的任意一种对象属性识别方法。
第六方面,本申请实施例还提供一种芯片,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,执行第一方面或第一方面的任一种可能实现方式所提供的对象属性识别方法。
可选地,作为一种实现方式,所述芯片还可以包括存储器,所述存储器中存储有指令,所述处理器用于执行所述存储器上存储的指令,当所述指令被执行时,所述处理器用于执行第一方面中或第一方面的任一种可能实现方式所提供的对象属性识别方法。
第七方面,提供一种计算设备,该计算设备包括上述第二方面中的属性识别装置。
为了更清楚地说明本发明实施例或背景技术中的技术方案,下面将对本发明实施例或背景技术中所需要使用的附图进行说明。
图1是本申请实施例提供的一种卷积神经网络的框架示意图;
图2是本申请实施例提供的一种对象属性识别系统框架示意图;
图3是本申请实施例提供的一种对象属性识别方法的流程示意图;
图4是本申请实施例提供的另一种对象属性识别系统框架示意图;
图5是本申请实施例提供的另一种对象属性识别方法的流程示意图;
图6是本发明实施例提供的一种属性识别装置的结构示意图;
图7是本发明实施例提供的另一种属性识别装置的结构示意图;
图8是本申请实施例提供的又一种计算设备的结构示意图;
图9是本申请实施例提供的一种芯片的硬件结构示意图。
下面对本发明各个实施例涉及的相关概念进行简要介绍:
卷积神经网络(convolutional neural network,CNN)是一种带有卷积结构的深度神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器。该特征抽取器可以看作是滤波器,卷积过程可以看作是使用一个可训练的滤波器与一个输入的图像或者卷积特征平面(feature map)做卷积。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中,通常包含若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取图像信息的方式与位置无关。这其中隐含的原理是:图像的某一部分的统计信息与其他部分是一样的。即意味着在某一部分学习的图像信息也能用在另一部分上。所以对于图像上的所有位置,都能使用同样的学习得到的图像信息。在同一卷积层中,可以使用多个卷积核来提取不同的图像信息,一般地,卷积核数量越多,卷积操作反映的图像信息越丰富。
卷积核可以以随机大小的矩阵的形式初始化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。
如图1所示,卷积神经网络(CNN)100可以包括输入层(input layer)110,卷积层(convolutional layer)/激活层(activation layer)/池化层(pooling layer)120,以及全连接层(fully connected layer)130。其中,激活层和池化层都为可选的。卷积神经网络100可以包括多个卷积层,任意一个卷积层后可以连接一个激活层和/或池化层。如图1所示121层为卷积层,122层为池化层,123层为卷积层,124层为池化层,125为卷积层,126为池化层;在另一种实现方式中,121、122为卷积层,123为池化层,124、125为卷积层,126为池化层。卷积层的输出可以作为随后的池化层的输入,也可以作为另一个卷积层的输入以继续进行卷积操作。
输入层110主要对输入的图像进行预处理,其中,包括去均值和归一化等。
下面将以卷积层121为例,介绍一层卷积层的内部工作原理。
卷积层121可以包括很多个卷积核,其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器,卷积核本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义,在对图像进行卷积操作的过程中,权重矩阵通常在输入图像上沿着水平方向一个像素接着一个像素(或两个像素接着两个像素……这取决于步长stride的取值)的进行处理,从而完成从图像中提取特定特征的工作。该权重矩阵的大小应该与图像的大小相关,需要注意的是,权重矩阵的纵深维度(depth dimension)和输入图像的纵深维度是相同的,在进行卷积运算的过程中,权重矩阵会延伸到输入图像的整个深度。因此,和一个单一的权重矩阵进行卷积会产生一个单一纵深维度的卷积化输出,但是大多数情况下不使用单一权重矩阵,而是应用多个尺寸(行×列)相同的权重矩阵,即多个同型矩阵。每个权重矩阵的输出被堆叠起来形成卷积图像的纵深维度,这里的维度可以理解为由上面所述的“多个”来决定。不同的权重矩阵可以用来提取图像中不同的特征,例如一个权重矩阵用来提取图像边缘信息,另一个权重矩阵用来提取图像的特定颜色,又一个权重矩阵用来对图像中不需要的噪点进行模糊化等。该多个权重矩阵尺寸(行×列)相同,经过该多个尺寸相同的权重矩阵提取后的特征图的尺寸也相同,再将提取到的多个尺寸相同的特征图合并形成卷积运算的输出。
这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到,通过训练得到的权重 值形成的各个权重矩阵可以用来从输入图像中提取信息,从而使得卷积神经网络100进行正确的预测。
当卷积神经网络100有多个卷积层的时候,初始的卷积层(例如121)往往提取较多的一般特征,该一般特征也可以称之为低级别的特征;随着卷积神经网络100深度的加深,越往后的卷积层(例如126)提取到的特征越来越复杂,比如高级别的语义之类的特征,语义越高的特征越适用于待解决的问题。
激活层:
在一个卷积层之后可以应用一个激活层,以将非线性因素引入到模型中,增加了模型和整个卷积神经网络的非线性属性。激活函数可以包括Tanh函数、ReLU函数、Leaky ReLU函数、Maxout函数等。
池化层:
由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层,在如图1中120所示例的121-126各层,可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层。在图像处理过程中,池化层的唯一目的就是减少图像的空间大小。池化层可以包括平均池化算子和/或最大池化算子,以用于对输入图像进行采样得到较小尺寸的图像。平均池化算子可以在特定范围内对图像中的像素值进行计算产生平均值作为平均池化的结果。最大池化算子可以在特定范围内取该范围内值最大的像素作为最大池化的结果。另外,就像卷积层中用权重矩阵的大小应该与图像尺寸相关一样,池化层中的运算符也应该与图像的大小相关。通过池化层处理后输出的图像尺寸可以小于输入池化层的图像的尺寸,池化层输出的图像中每个像素点表示输入池化层的图像的对应子区域的平均值或最大值。
全连接层130:
在经过卷积层/激活层/池化层120的处理后,卷积神经网络100还不足以输出所需要的输出信息。因为如前所述,卷积层/池化层120只会提取特征,并减少输入图像带来的参数。然而为了生成最终的输出信息(所需要的类信息或其他相关信息),卷积神经网络100需要利用全连接层130来生成一个或者一组所需要的类的数量的输出。因此,在全连接层130中可以包括多层隐含层(如图1所示的131、132至13n)以及输出层140,该多层隐含层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到。本申请实施例中,对于的部位定位模型来说,该任务类型为高层的属性识别和姿态关键点回归;对于第一属性识别模型或第二属性识别模型来说,该任务类型为高层的属性识别。
在全连接层130中的多层隐含层之后,也就是整个卷积神经网络100的最后层为输出层140,该输出层140具有类似分类交叉熵的损失函数,具体用于计算预测误差,一旦整个卷积神经网络100的前向传播(如图1由110至140方向的传播为前向传播)完成,反向传播(如图1由140至110方向的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差,以减少卷积神经网络100的损失,及卷积神经网络100通过输出层输出的结果和理想结果之间的误差。
需要说明的是,如图1所示的卷积神经网络100仅作为一种卷积神经网络的示例,在具体的应用中,卷积神经网络还可以以其他网络模型的形式存在。
底层特征,直接提取自原始图像的特征。
中层特征,介于底层特征和语义特征之间的,经过卷积层/池化层提取得到的,为卷积神 经网络中的某一层的特征。
语义特征,有直接的语义含义的,或者直接和语义相关的特征,本申请实施例中称为属性。
支持向量机(support vector machine,SVM)是与相关的学习算法有关的监督学习模型,可以分析数据,识别模式,用于模式识别、分类和回归分析等。
下面结合本发明实施例中的附图对本发明实施例进行描述。
请参阅图2,图2是本申请实施例提供的一种对象属性识别系统框架示意图。该对象属性识别系统可以包括:第一属性识别模型、第二属性识别模型、部位定位模型、部位特征图提取模块等。其中:
第一图像为待识别图像,可以是原始图像或根据原始图像提取得到的原始特征图,其中,原始图像包括目标对象,目标对象包括M个部位。可选地,原始特征图是原始图像经过第二属性识别模型的一个或多个卷积层/池化层提取得到的中层特征。本发明实施例以第一图像为原始特征图为例来说明,可以理解,本申请实施例也可以不包括第二属性识别模型,第一图像为原始图像。
部位定位模型可以是卷积神经网络,用于根据输入的第一图像得到M个部位的定位参数,通常包括输入层、一个或多个卷积层、一个或多个池化层、全连接层等。该部位的定位参数用于确定该姿态关键点对应的部位在第一图像中的区域。部位的定位参数可以是仿射变换参数,包括平移参数和变换参数,该平移参数包括水平平移参数和垂直平移参数,水平平移参数和垂直平移参数确定的坐标即为通过部位定位模型得到的姿态关键点在第一图像中的位置坐标。
将第一图像输入到部位定位模型,得到M个姿态关键点以及该M个姿态关键点分别对应部位的定位参数。可以理解,部位定位模型输出M组定位参数。每一组定位参数用于确定一个部位。
部位特征图提取模块用于根据输入的M组定位参数和第一图像,在第一图像中确定M个姿态关键点分别对应部位所在的区域,得到M个部位特征图,M个部位与M个部位特征图一一对应。本申请实施例的一种具体实现中,将M定位参数输入到部位特征图提取模块,部位特征图提取模块通过插值采样从第一图像中提取出M个部位分别对应的M个部位特征图。
第一属性识别模型用于从输入到该模型的M个部位特征图中提取目标对象的L个属性中每个属性的第一属性识别结果,M、L为正整数。
第二属性识别模型用于从输入到该模型的原始图像提取目标对象的L个属性中每个属性的第二属性识别结果。第二属性识别模型可以是卷积神经网络,可以包括输入层、一个或多个卷积层、一个或多个池化层以及全连接层等组成。可以理解,第二属性识别模型是基于原始图像的整体进行属性识别。
在本申请的一种实现中,第一属性识别模型可以包括与M个深度特征提取模型、第一拼接模块以及基于区域特征学习模型。其中,M个深度特征提取模型与M个部位一一对应,部位j对应的深度特征提取模型用于从部位j对应的部位特征图中提取该部位j对应的深度部位特征图,j为部位的索引,j为正整数且j≤M。
深度特征提取模型可以包括一层或多层卷积层、一层或多层池化层、全连接层等,以从 输入的部位特征图中提取该部位特征图对应的部位的深度特征。例如,部位j对应的部位特征图输入到该部位j对应的深度特征提取模型中,以从部位j对应的部位特征图中提取部位j的深度部件特征图。
拼接模块对M个深度特征提取模型输出的M个部位分别对应的深度部件特征图进行拼接。拼接后的深度部件特征图输入到基于区域特征学习模型,以得到对象L个属性中每个属性的第一属性识别结果。该基于区域特征学习模型可以包括一个或多个卷积层、池化层、全连接层等。在本申请另一种实施例中,该基于区域特征学习模型也可以仅包括全连接层。
在本申请的另一种实现中,该第一属性识别系统还可以包括第二拼接模块,该第二拼接模块用于对M个部位特征图进行拼接。拼接后的M个部位特征图输入到第一属性识别模型。此时,该第一属性识别模型可以包括一层或多层卷积层、一层或多层池化层、全连接层等。第一属性识别模型从拼接后的M个部位特征图提取出对象的L个属性中每个属性的第一属性识别结果,可以理解,该第一属性识别模型是基于M个部位特征图的学习模型。
应理解,第一属性识别模型和第二属性识别模型得到的L个属性类别相同,但各个属性的识别结果不同。可选地,该属性识别系统还可以包括结果融合模块,用于将第一属性识别模型得到的对象的L个属性中每个属性的第一属性识别结果和第二属性识别模型得到的对象的L个属性中每个属性的第二属性识别结果进行融合,计算得到L个属性中每个属性的第三属性识别结果。进一步地,还可以将第三属性识别结果通过Sigmoid函数转化为属性识别概率,以指示属性的预测概率。
在执行本申请实施例所述的对象属性识别方法之前,各个模型为训练好的模型,下面介绍本申请各个模型的训练方法:
本申请一实施例中,第一属性识别模型、部位定位模型可以一起训练。其中,在部位定位模型中,不同姿态关键点相关的区域可以共享前端的特征学习网络,并学习各自相关的区域的仿射变换参数。特别指出,在部位定位模型的训练过程中,我们通过两个任务来对部位定位模型进行监督训练,一个是高层的属性识别,另外一个是姿态关键点回归。高层的属性识别,可以采用交叉熵进行优化。在优化的过程中,梯度信息从后端的基于区域的特征学习模型,经过M个深度特征提取模型,最后传到部位定位模型。姿态关键点回归,可以采用欧式损失。在优化过程中,梯度信息直接传到部位定位模型。最后,我们利用来自属性识别优化目标的梯度信息和来自姿态关键点回归优化目标的梯度信息,来对部位定位模型进行参数更新。需要说明的是,姿态关键点回归的损失为了更好地让对每一个姿态关键点都学习各自姿态关键点相关的部位区域。
可以理解,在本申请的另一实施例中,第一属性识别模型、部位定位模型、第二属性识别模型可以单独训练。其中,在部位定位模型的训练过程中,我们通过部位的定位参数来对部位定位模型进行监督训练;在第一属性识别模型或第二属性识别模型的训练过程中,通过对属性的识别来对第一属性识别模型或第二属性识别模型进行监督训练,不同的是,第一属性识别模型和第二属性识别模型的样本数据不同。第一属性识别模型是基于M个部位特征图来训练,该M和部位特征图是基于第一图像输入到训练得到的部位定位模型得到的M个部位的定位参数在第一图像上采集得到的;而,第二属性识别模型是基于原始图像或第一图像来训练。
需要说明的是,上述各个模型或模块可以在一个计算设备中执行,也可以分布在多个计算设备中执行,比如分布式云计算系统。本申请不作限定。计算设备可以是终端设备、也可 以是服务器。终端设备可以是手机、台式计算机、便携式计算机、平板电脑或其他包括可执行本申请中对象属性识别方法的中部分或全部流程的电子设备,本申请不作限定。
需要说明的是,上述各个模型或模块的具体功能实现可以参照下述模型训练方法或对象属性识别方法实施例中相关描述,本申请实施例不再赘述。
第一属性识别模型、第二属性识别模型、部位定位模型等可以是神经网络、卷积神经网络、支持向量机等机器学习模型,本发送实施例不作限定。
本申请实施例所述的对象属性识别系统,可以应用于基于属性的对象的检索、分析等领域。例如,行人属性识别利用计算机视觉技术对行人图像进行智能分析,进而判断出该行人的各种细粒度属性,比如性别、年龄、衣服颜色和类型、背包等,进一步地,应用于行人基于属性描述的行人检索等,以快速查找到该行人。
下面结合图2对象属性识别系统框架图及图3所示的对象属性识别方法的流程示意图对本申请实施例中对象属性识别方法进行描述。本申请中对象属性识别方法的执行主体可以是计算设备、属性识别装置、计算设备中处理器或分布式计算机系统等,本申请实施例以计算设备为例来说明,该对象属性识别方法可以包括如下步骤:
S1:计算设备根据M个姿态关键点在第一图像中提取M个部位的特征,得到M个部位特征图。其中,第一图像为原始图像或根据原始图像提取得到的原始特征图,原始图像包括目标对象,目标对象包括M个部位,M个姿态关键点与M个部位一一对应,M个部位与M个部位特征图一一对应,M为正整数。
S2:将所述M个部位特征图输入到第一属性识别模型,以得到所述目标对象第一属性识别结果。
其中,第一图像可以是原始图像,该原始图像包括目标对象,目标对象可以是人、动物或物体,比如汽车、自行车等,本申请实施例不作限定。第一图像也可以是提取原始图像的中层特征所得到的原始特征图。姿态关键点为第一图像中目标对象上的位置点,用于确定目标对象中该姿态关键点对应部位的位置。可以理解,原始图像中目标对象、部位、姿态关键点可以映射到原始特征图。
部位定位是指从一个包括对象(比如行人)整体的特征图(本申请中称为第一图像)中,提取一个行人的部位区域,比如头部区域或脚部区域。在本发明中,部位是和行人姿态关键点相关的一个区域,可以基于第一图像中姿态关键点确定目标对象的姿态,以及确定该姿态关键点对应的部位的位置。姿态关键点为第一图像上的一个位置点。以目标对象为人为例,在原始图像中姿态关键点的物理意义可以是人体骨骼关键点,例如,右肩、右肘、右腕、左肩、左肘、左腕、右髋、右膝、右踝、左髋、左膝、左踝、头顶、脖子等。比如,对于姿态关键点为右踝,部位定位是以右踝这个关键点为中心,把整个脚部的区域给找出来,目的是为了更好地识别脚部相关的属性。由于行人的姿态不断变化,基于姿态关键点来确定行人的部位更加精确。
本申请中以目标对象为人为例,该姿态关键点可以为右踝,其对应的部位为右脚;又例如,姿态关键点为左腕,其对应的部位为左手;又例如,关键点为左肘,其对应的部位为左手臂等。可以理解的是,对象还可以是汽车,部位还可以是车轮、车窗、车门等,本申请实施例不作限定。
第一属性识别结果包括对该目标对象的多个属性的识别结果,具体包括多个属性中每个 属性的预测得分。例如L个属性,L为正整数,第一属性为L个属性中任意一种属性,通过第一属性的预测得分可以映射得到该第一属性的属性识别结果。在本申请的另一实施例中,该预测得分可以概率值,用于指示目标对象包括第一属性的概率,例如,目标对象为女的概率。
其中,L为设定的值。对象的属性为具有语义的特征,例如,目标对象为行人,目标对象的属性可以是性别(男、女)、年龄(比如青少年、中年、老年)、种族(汉族,维族)、身材(胖,瘦,标准)、上衣款式(短袖,长袖)、上衣颜色(黑,红,蓝,绿,白,黄)等。
可选地,原始特征图可以表示为:F=f
low(I),其中,I是输入的原始图像,f
low是中层的卷积特征提取器,该中层的卷积特征提取器由一层或多层卷积层、激励层、池化层等组成,用于从原始图像中提取该原始图像的中层特征。该中层的卷积特征提取器可以是第二属性识别模型中一个或多个卷积层、池化层。
本申请实施例以第一图像为原始特征图,即从原始图像I通过卷积特征提取器提取的中层特征为例来说明,可以理解,第一图像还可以是原始图像本身,本申请实施例不作限定。
本发明实施例中,第一属性识别模型基于M个姿态关键点确定的M个部位特征图进行属性识别,能克服目标对象的姿势对识别结果的影响,使得对对象的属性识别更加准确,且鲁棒性好。
请一并参阅图2、图3,计算设备根据M个姿态关键点在第一图像中提取M个部位的特征,得到M个部位特征图,即步骤S1的一种现实方式可以包括如下步骤:
S11:将第一图像输入到部位定位模型,得到M个姿态关键点分别对应部位的定位参数,其中,第一姿态关键点为M个姿态关键点中任意一个姿态关键点,第一姿态关键点对应部位的定位参数用于在原始特征图中确定所述第一姿态关键点对应的部位所在的区域。
其中,M为预先设定的数值,比如14,M个姿态关键点可以是14个人体骨骼关键点。
其中,部位定位模型表示为:θ=f
regression(F),其中,θ为M个部位的定位参数,可以表示为(θ
1,θ
2,…,θ
k,…,θ
M);部位定位模型可以由卷积层,激励层、池化层和全连接层等组成。θ
1
本申请一实施例中,姿态关键点k对应部位k,
姿态关键点k对应部位k的定位参数为第一位置坐标与第二位置坐标之间的仿射变换参数,第一位置坐标为部位k在所述第一图像中的位置坐标,第二位置坐标为部位k对应的部位特征图中的位置坐标,第一位置坐标与第二位置坐标之间变换关系为:
其中,k是部位的索引,i是部位k对应的部位特征图中坐标位置的索引,
为部位k对应的部位特征图中位置坐标i的归一化坐标,
是
经过仿射变换后在第一图像中的归一化坐标,
是部位k的定位参数,即第一位置坐标与第二位置坐标之间的仿射变换参。其中,
分别为水平平移参数和垂直平移参数,也是部位k对应的姿态关键点坐标。
为变换参数。
S12:根据所述M个姿态关键点分别对应部位的定位参数,通过插值采样从第一图像中 提取出M个部位分别对应的M个部位特征图。
确定部位k对应的部位特征图的方法为:
其中,k是部位的索引,k为正整数且k≤M;F为第一图像;和V
k为部位k对应的部位特征图;i为部位k对应的部位特征图中坐标位置的索引;H为第一图像的高,为第一图像纵向像素点的个数;W为第一图像的宽,为第一图像横向像素点的个数;(m,n)为第一图像中的坐标位置,
部位k对应的部位特征图中的坐标位置i经过仿射变换后在第一图像中的归一化坐标。
目标对象包括M个姿态关键点,也即M个部位,因此通过上述步骤S12可生成M个部位特征图,即V
1到V
M。
请一并参阅图2、图3,步骤S2,即计算设备提将M个部位特征图输入到第一属性识别模型,得到目标对象的第一属性识别结果,也即步骤S2,可以包括但不限于以下两种实现方式。
第一实现方式:
第一属性识别模型的架构可以如图2所示的第一属性识别模型,可以包括与M个部位一一对应的M个深度特征提取模型、第一拼接模块以及基于区域特征学习模型。深度特征提取模型可以包括一层或多层卷积层、一层或多层池化层、全连接层等,以从输入的部位特征图中提取该部位特征图对应的部位的深度特征。
具体地,将M个部位特征图分别输入到M个深度特征提取模型,得到M个深度部位特征图。其中,M个深度部位特征图与M个部位一一对应,部位j对应的深度特征提取模型用于从部位j对应的部位特征图中提取部位j对应的深度部位特征图,j为部位的索引,j为正整数且j≤M;计算设备将将提取得到的该M个深度部位特征图进行拼接,并将拼接后的M个深度部位特征图输入到基于区域特征学习模型,以得到目标对象的第一属性识别结果。
可以理解,每一个深度特征提取模型可以包括一层或多层卷积层、一层或多层池化层,以及全连接层等组成,用于从输入的部位特征图中提取该部位特征图对应的部位的深度特征。在得到M个深度部位特征图后,第一拼接模块对M个深度部位特征图进行拼接可以采用横向拼接或纵向拼接,本申请实施例以纵向拼接为例来说明。第一属性识别模型可以通过单独训练得到,即该M个深度特征提取模型和基于区域特征学习模型可以作为一个整体进行训练。通过训练可以确定各个深度部位特征图的权重,将拼接后的深度部位特征图输入到训练好的基于区域特征学习模型,得到目标对象的第一属性识别结果。
例如,目标对象的第一属性识别结果,包括L个属性的识别结果,属性j的识别结果可以表示为:
Y1
j=W
j
T[f
local-1(V
1),f
local-2(V
2),…,f
local-M(V
M)]
其中,j是属性的索引,j为正整数,j≤L,Y1
j为目标对象的属性j的识别结果。f
local-k代表部位k的深度特征提取模型。f
local-k(V
k)为部位k对应的部位特征图通过部位k对应的 深度特征提取模型提取的深度部位特征图。W
j
T为权重矩阵,通过训练得到,用于表示M个深度部位特征图的权重。
第二实现方式:
如图4所示本申请实施例提供的另一种对象属性识别系统的框架示意图,该对象属性识别系统包括部位定位模型、部件特征图提取模块、第二拼接模块以及第一属性识别模型。其中:
部位定位模型用于确定M个部件的定位参数,为上述方法或系统实施例所述的任意一种部位定位模型,其具体实现可以参见上述图2或图3中相关描述,本申请实施例不再赘述。
部件特征图提取模块用于根据M个姿态关键点分别对应部位的定位参数,通过插值采样从第一图像中提取出M个部位分别对应的M个部位特征图。其具体实现可以参见上述图1或图2中相关描述,本申请实施例不再赘述。
该第二拼接模块用于对M个部位特征进行拼接。
该第一属性识别模型可以包括一层或多层卷积层、一层或多层池化层、全连接层、输出层等。将拼接后的M个部位特征图输入到第一属性识别模型。第一属性识别模型从拼接后的M个部位特征图提取出目标对象的第一属性识别结果。第一属性识别模型可以通过单独训练得到。将拼接后的M个部位特征图输入到第一属性识别模型,可以得到目标对象的第一属性识别结果。
如图5所示,图5是本申请实施例提供的另一种对象属性识别方法的流程示意图,该对象属性识别方法除包括如图3所述的步骤S1、S2外,还可以包括如下步骤:
S3:将第一图像输入到第二属性识别模型,识别出目标对象的第二属性识别结果。
第二属性识别结果包括对该目标对象的多个属性的识别结果,具体包括多个属性中每个属性的预测得分。例如包括L个属性,L为正整数,第一属性为L个属性中任意一种属性,通过第一属性的预测得分可以映射得到该第一属性的识别结果。在本申请的另一实施例中,该预测得分可以概率值,用于指示目标对象包括第一属性的概率,例如,目标对象为女的概率。
第二属性识别模型用于根据输入到该模型的第一图像提取目标对象的第二属性识别结果。第二属性识别模型可以是卷积神经网络,可以包括输入层,一个或多个卷积层、激活层、池化层,以及全连接层等组成。可以理解,第二属性识别模型是基于包括目标对象的第一图像的整体进行属性识别。在本申请实施例的一种具体实现中,输入到第二属性识别模型的第一图像为包括对象的原始图像,输入到部位定位模型的原始特征图为通过第二属性识别模型的一个或多个卷积层提取的中层特征。
例如,目标对象的第二属性识别结果,包括L个属性的识别结果,属性j的识别结果可以表示为:
S4:根据第一属性识别结果和第二属性识别结果计算该目标对象的第三属性识别结果。
具体的,对象属性j的第三属性识别结果可以是对象属性j的第一属性识别结果与对象属性j第二属性识别结果的线性相加,j为属性的索引,j为正整数,j≤L,即:
Y3
j=αY1
j+βY2
j
其中,α、β为大于0的常数。
其中,Y1
j为通过第一属性识别模型得到的属性j的识别结果,Y2
j为通过第二属性识别模型得到的属性j的识别结果,Y3
j为目标对象属性j的第三属性识别结果,α、β为大于0的常数。可选地,α=0.8、β=0.5,α=1、β=1,或α、β为其他数值,本申请实施例不作限定。
可选地,对象属性j的第三属性识别结果可以是对象属性j的第一属性识别结果与对象属性j第二属性识别结果的加权求和,即α+β=1。
步骤S3、S4与步骤S1、S2的可以以任意次序执行,即步骤S3、S4可以在步骤S1或S2之前执行,可以在步骤S1或S2之后执行,也可以与步骤S1或S2同时执行,本申请实施例不作限定。
本申请实施例中,在给定一张需要测试的第一图像后,对于目标对象的每一个属性,分别通过基于第一图像的各个部位的第一属性识别模型和基于第一图像的全局的第二属性识别模型,得到该属性的第一属性识别结果和第二属性识别结果,进而,将第一属性识别结果和第二属性识别结果进行加权求和,得到该属性的第三属性识别结果,将该第三属性识别结果作为该属性的最终得分,提高对象属性识别的准确率。
进一步地,可以将第一属性识别结果、第二属性识别结果或第三属性识别结果转换为属性的预测概率。
例如,将第三属性识别结果通过Sigmoid函数转化为属性识别概率,以指示属性的预测概率。
其中,j为属性的索引,j为正整数,j≤L。P
j为属性j的预测概率,Y3
j为对象属性j的第三属性识别结果。
例如,预测得到对象的年龄为中年的概率为0.88,少年的概率为0.21、老年的概率为0.1。
下面介绍本申请实施例涉及的相关装置。
如图6所示的属性识别装置,该属性识别装置60可以包括部位特征提取单元601和第一属性识别单元602,其中:
部位特征提取单元601,用于根据M个姿态关键点在第一图像中提取M个部位的特征,得到M个部位特征图,其中,所述第一图像为原始图像或根据原始图像提取得到的原始特征图,所述原始图像包括目标对象,所述目标对象包括所述M个部位,所述M个姿态关键点与所述M个部位一一对应,所述M个部位与所述M个部位特征图一一对应;所述姿态关键点用于确定所述姿态关键点对应部件的位置,M为正整数;
第一属性识别单元602,用于将所述M个部位特征图输入第一属性识别模型,得到所述目标对象的第一属性识别结果。
在本申请的一种实现中,所述部位特征提取单元601,具体用于:
将所述第一图像输入到部位定位模型,得到M个姿态关键点分别对应部位的定位参数,第一姿态关键点对应部位的定位参数用于在所述第一图像中确定所述第一姿态关键点对应的部位所在的区域;其中,所述第一姿态关键点为所述M个姿态关键点中任意一个姿态关键点;
根据所述M个姿态关键点分别对应部位的定位参数,通过插值采样从所述第一图像中提取出所述M个部位分别对应的M个部位特征图。
在本申请的一种实现中,姿态关键点k对应部位k的定位参数为第一位置坐标与第二位 置坐标之间的仿射变换参数,所述第一位置坐标为所述部位k在所述第一图像中的位置坐标,所述第二位置坐标为所述部位k对应的部位特征图中的位置坐标,所述部位k对应的部位特征图的通过下述公式计算:
其中,k是部位的索引,k为正整数且k≤M;F为所述第一图像;V
k为所述部位k对应的部位特征图;i为所述部位k对应的部位特征图中坐标位置的索引;H为所述第一图像的高;W为所述第一图像的宽;
为所述部位k对应的部位特征图中的坐标位置i经过仿射变换后在所述第一图像中的归一化坐标;(m,n)为所述第一图像中的坐标位置。
在本申请的一种实现中,所述第一属性识别模型包括M个深度特征提取模型以及基于区域特征学习模型,其中,所述M个深度特征提取模型与所述部位一一对应,所述第一属性识别单元602具体用于:
将所述M个部位特征图分别输入到M个深度特征提取模型,得到M个深度部位特征图,其中,所述M个深度部位特征图与所述M个部位一一对应,第一部位对应的深度特征提取模型用于从所述第一部位对应的部位特征图中提取所述第一部位对应的深度部位特征图,所述第一部位为所述M个部位中任意一个部位;
将提取得到的所述M个深度部位特征图进行拼接;
将拼接后的深度部位特征图输入到所述基于区域特征学习模型,得到所述目标对象的第一属性识别结果。
如图7所示的属性识别装置,在本申请的一种实现中,该属性识别装置70除包括上述部位特征提取单元601和第一属性识别单元602,还可以包括:第二属性识别单元603,用于:将所述第一图像输入到第二属性识别模型,识别出所述目标对象的第二属性识别结果;
属性融合单元604,用于根据所述第一识别结果和所述第二识别结果,计算所述目标对象的第三识别结果,其中,所述第三识别结果计算方法为:Y3=αY1+βY2;α、β为大于0的常数,Y1为所述第一属性识别结果,Y2为所述第二属性识别结果。
需要说明的是,各个单元的实现还可以对应参照方法实施例的相应描述,本申请实施例不再赘述。
请参阅图8,图8是本申请实施例提供的又一种计算设备的结构示意图,该计算设备可以包括但不限于处理器801和存储器802,处理器通过总线803连接到存储器802。
存储器802可以是只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)或其他存储器。本申请实施例中,存储器802用于存储数据,例如本申请实施例中原始图像、原始特征图、部位特征图或深度特征图等数据,以及各种软件程序,例如本申请中对象属性识别程序等。
可选地,计算设备80还可以包括至少一个通信接口804,该通信接口804用于实现计算设备80与终端、服务器或其他计算设备等之间的数据交换。
处理器801可以是中央处理单元(Central Processing Unit,CPU),该处理器801还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
处理器801用于调用存储器存储的数据和程序代码执行:
根据M个姿态关键点在第一图像中提取M个部位的特征,得到M个部位特征图,其中,所述第一图像为原始图像或根据原始图像提取得到的原始特征图,所述原始图像包括目标对象,所述目标对象包括所述M个部位,所述M个姿态关键点与所述M个部位一一对应,所述M个部位与所述M个部位特征图一一对应;所述姿态关键点用于确定所述姿态关键点对应部件的位置,M为正整数;
将所述M个部位特征图输入第一属性识别模型,得到所述目标对象的第一属性识别结果。
在本申请的一种实现中,所述处理器801执行根据M个姿态关键点在第一图像中提取M个部位的特征,得到M个部位特征图,具体包括执行:
将所述第一图像输入到部位定位模型,得到M个姿态关键点分别对应部位的定位参数,第一姿态关键点对应部位的定位参数用于在所述第一图像中确定所述第一姿态关键点对应的部位所在的区域;其中,所述第一姿态关键点为所述M个姿态关键点中任意一个姿态关键点;
根据所述M个姿态关键点分别对应部位的定位参数,通过插值采样从所述第一图像中提取出所述M个部位分别对应的M个部位特征图。
在本申请的一种实现中,
姿态关键点k对应部位k的定位参数为第一位置坐标与第二位置坐标之间的仿射变换参数,所述第一位置坐标为所述部位k在所述第一图像中的位置坐标,所述第二位置坐标为所述部位k对应的部位特征图中的位置坐标,所述部位k对应的部位特征图的通过下述公式计算:
其中,k是部位的索引,k为正整数且k≤M;F为所述第一图像;V
k为所述部位k对应的部位特征图;i为所述部位k对应的部位特征图中坐标位置的索引;H为所述第一图像的高;W为所述第一图像的宽;
为所述部位k对应的部位特征图中的坐标位置i经过仿射变换后在所述第一图像中的归一化坐标;(m,n)为所述第一图像中的坐标位置。
在本申请的一种实现中,所述第一属性识别模型包括M个深度特征提取模型以及基于区域特征学习模型,其中,所述M个深度特征提取模型与所述部位一一对应,所述处理器801 执行所述将所述M个部位特征图输入到第一属性识别模型,得到所述目标对象的第一属性识别结果,包括执行:
将所述M个部位特征图分别输入到M个深度特征提取模型,得到M个深度部位特征图,其中,所述M个深度部位特征图与所述M个部位一一对应,第一部位对应的深度特征提取模型用于从所述第一部位对应的部位特征图中提取所述第一部位对应的深度部位特征图,所述第一部位为所述M个部位中任意一个部位;
将提取得到的所述M个深度部位特征图进行拼接;
将拼接后的深度部位特征图输入到所述基于区域特征学习模型,得到所述目标对象的第一属性识别结果。
在本申请的一种实现中,所述处理器801还用于执行:
将所述第一图像输入到第二属性识别模型,识别出所述目标对象的第二属性识别结果;
根据所述第一识别结果和所述第二识别结果,计算所述目标对象的的第三识别结果,其中,所述第三识别结果计算方法为:Y3=αY1+βY2;α、β为大于0的常数,Y1为所述第一属性识别结果,Y2为所述第二属性识别结果。
需要说明的是,各个器件的实现还可以对应参照上述方法实施例中的相应描述,本申请实施例不再赘述。
下面介绍本申请实施例提供的一种芯片硬件结构。
图9为本发明实施例提供的一种芯片硬件结构,该芯片包括神经网络处理器90。该芯片可以被设置在如图6、图7所示的属性识别装置中,用以属性识别装置中各个单元的计算工作。该芯片也可以被设置在如图8所示的计算设备80中,用以完成计算设备的对象属性识别并输出第一属性识别结果和第二属性识别结果。如图1所示的卷积神经网络中各层的算法均可在如图9所示的芯片中得以实现。
神经网络处理器90可以是NPU,TPU,或者GPU等一切适合用于大规模异或运算处理的处理器。以NPU为例:NPU可以作为协处理器挂载到主CPU(Host CPU)上,由主CPU为其分配任务。NPU的核心部分为运算电路903,通过控制器904控制运算电路903提取存储器(901和902)中的矩阵数据并进行乘加运算。
在一些实现中,运算电路903内部包括多个处理单元(Process Engine,PE)。在一些实现中,运算电路903是二维脉动阵列。运算电路903还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路903是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路903从权重存储器902中取矩阵B的权重数据,并缓存在运算电路903中的每一个PE上。运算电路903从输入存储器901中取矩阵A的输入数据,根据矩阵A的输入数据与矩阵B的权重数据进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)908中。
统一存储器906用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器(DMAC,Direct Memory Access Controller)905,被搬运到权重存储器902中。输入数据也通过DMAC被搬运到统一存储器906中。
总线接口单元(BIU,Bus Interface Unit)910,本申请中也称数据接口,用于DMAC和取指存储器(Instruction Fetch Buffer)909的交互;总线接口单元910还用于取指存储器909 从外部存储器获取指令;总线接口单元910还用于存储单元访问控制器905从外部存储器获取输入矩阵A或者权重矩阵B的原数据。
DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器906中,或将权重数据搬运到权重存储器902中,或将输入数据搬运到输入存储器901中。
向量计算单元907多个运算处理单元,在需要的情况下,对运算电路903的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。向量计算单元907主要用于神经网络中非卷积层,或全连接层(FC,fully connected layers)的计算,具体可以处理:Pooling(池化),Normalization(归一化)等的计算。例如,向量计算单元907可以将非线性函数应用到运算电路903的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元907生成归一化的值、合并值,或二者均有。
在一些实现中,向量计算单元907将经处理的向量存储到统一存储器906。在一些实现中,经向量计算单元907处理过的向量能够用作运算电路903的激活输入,例如用于神经网络中后续层中的使用,如图2所示,若当前处理层是隐含层1(131),则经向量计算单元907处理过的向量还可以被用到隐含层2(132)中的计算。
控制器904连接的取指存储器(instruction fetch buffer)909,用于存储控制器904使用的指令;
统一存储器906,输入存储器901,权重存储器902以及取指存储器909均为On-Chip存储器。外部存储器独立于该NPU硬件架构。
其中,图1所示的卷积神经网络中各层的运算可以由运算电路903或向量计算单元907执行。
本申请实施例还提供了一种计算设备,该计算设备包括上述图8或图9所示的属性识别装置。
本申请实施例还提供了一种计算机存储介质,所述计算机存储介质用于计算机软件指令,所述计算机软件指令当被计算机执行时使所述计算机执行如图2或图5所提供的对象属性识别方法。
本申请实施例还提供了一种计算机程序,所述计算机程序包括计算机软件指令,所述计算机软件指令当被计算机执行时使所述计算机执行如图2或图5所提供的对象属性识别方法。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,该流程可以由计算机程序来指令相关的硬件完成,该程序可存储于计算机可读取存储介质中,该程序在执行时,可包括如上述各方法实施例的流程。而前述的存储介质包括:ROM或随机存储记忆体RAM、磁碟或者光盘等各种可存储程序代码的介质。
Claims (14)
- 一种对象属性识别方法,其特征在于,所述方法包括:根据M个姿态关键点在第一图像中提取M个部位的特征,得到M个部位特征图,其中,所述第一图像为原始图像或根据原始图像提取得到的原始特征图,所述原始图像包括目标对象,所述目标对象包括所述M个部位,所述M个姿态关键点与所述M个部位一一对应,所述M个部位与所述M个部位特征图一一对应;所述姿态关键点用于确定所述姿态关键点对应部件的位置,M为正整数;将所述M个部位特征图输入第一属性识别模型,得到所述目标对象的第一属性识别结果。
- 如权利要求1所述的对象属性识别方法,其特征在于,所述根据M个姿态关键点在第一图像中提取M个部位的特征,得到M个部位特征图,包括:将所述第一图像输入到部位定位模型,得到M个姿态关键点分别对应部位的定位参数,第一姿态关键点对应部位的定位参数用于在所述第一图像中确定所述第一姿态关键点对应的部位所在的区域;其中,所述第一姿态关键点为所述M个姿态关键点中任意一个姿态关键点;根据所述M个姿态关键点分别对应部位的定位参数,通过插值采样从所述第一图像中提取出所述M个部位分别对应的M个部位特征图。
- 如权利要求2所述的对象属性识别方法,其特征在于,姿态关键点k对应部位k的定位参数为第一位置坐标与第二位置坐标之间的仿射变换参数,所述第一位置坐标为所述部位k在所述第一图像中的位置坐标,所述第二位置坐标为所述部位k对应的部位特征图中的位置坐标,所述部位k对应的部位特征图的通过下述公式计算:
- 如权利要求2-4任意一项权利要求所述的对象属性识别方法,其特征在于,所述第一属性识别模型包括M个深度特征提取模型以及基于区域特征学习模型,其中,所述M个深度特征提取模型与所述部位一一对应,所述将所述M个部位特征图输入到第一属性识别模型, 得到所述目标对象的第一属性识别结果,包括:将所述M个部位特征图分别输入到M个深度特征提取模型,得到M个深度部位特征图,其中,所述M个深度部位特征图与所述M个部位一一对应,第一部位对应的深度特征提取模型用于从所述第一部位对应的部位特征图中提取所述第一部位对应的深度部位特征图,所述第一部位为所述M个部位中任意一个部位;将提取得到的所述M个深度部位特征图进行拼接;将拼接后的深度部位特征图输入到所述基于区域特征学习模型,得到所述目标对象的第一属性识别结果。
- 如权利要求1-5任意一项权利要求所述的对象属性识别方法,其特征在于,所述方法还包括:将所述第一图像输入到第二属性识别模型,识别出所述目标对象的第二属性识别结果;根据所述第一识别结果和所述第二识别结果,计算所述目标对象的的第三识别结果,其中,所述第三识别结果计算方法为:Y3=αY1+βY2;α、β为大于0的常数,Y1为所述第一属性识别结果,Y2为所述第二属性识别结果。
- 一种属性识别装置,其特征在于,所述属性识别装置包括:部位特征提取单元,用于根据M个姿态关键点在第一图像中提取M个部位的特征,得到M个部位特征图,其中,所述第一图像为原始图像或根据原始图像提取得到的原始特征图,所述原始图像包括目标对象,所述目标对象包括所述M个部位,所述M个姿态关键点与所述M个部位一一对应,所述M个部位与所述M个部位特征图一一对应;所述姿态关键点用于确定所述姿态关键点对应部件的位置,M为正整数;第一属性识别单元,用于将所述M个部位特征图输入第一属性识别模型,得到所述目标对象的第一属性识别结果。
- 如权利要求7所述的属性识别装置,其特征在于,所述部位特征提取单元具体用于:将所述第一图像输入到部位定位模型,得到M个姿态关键点分别对应部位的定位参数,第一姿态关键点对应部位的定位参数用于在所述第一图像中确定所述第一姿态关键点对应的部位所在的区域;其中,所述第一姿态关键点为所述M个姿态关键点中任意一个姿态关键点;根据所述M个姿态关键点分别对应部位的定位参数,通过插值采样从所述第一图像中提取出所述M个部位分别对应的M个部位特征图。
- 如权利要求8所述的属性识别装置,其特征在于,姿态关键点k对应部位k的定位参数为第一位置坐标与第二位置坐标之间的仿射变换参数,所述第一位置坐标为所述部位k在所述第一图像中的位置坐标,所述第二位置坐标为所述部位k对应的部位特征图中的位置坐标,所述部位k对应的部位特征图的通过下述公式计算:
- 如权利要求8-10任意一项权利要求所述的属性识别装置,其特征在于,所述第一属性识别模型包括M个深度特征提取模型以及基于区域特征学习模型,其中,所述M个深度特征提取模型与所述部位一一对应,所述第一属性识别单元具体用于:将所述M个部位特征图分别输入到M个深度特征提取模型,得到M个深度部位特征图,其中,所述M个深度部位特征图与所述M个部位一一对应,第一部位对应的深度特征提取模型用于从所述第一部位对应的部位特征图中提取所述第一部位对应的深度部位特征图,所述第一部位为所述M个部位中任意一个部位;将提取得到的所述M个深度部位特征图进行拼接;将拼接后的深度部位特征图输入到所述基于区域特征学习模型,得到所述目标对象的第一属性识别结果。
- 权利要求7-11任意一项权利要求所述的属性识别装置,其特征在于,所述属性识别装置还包括:第二属性识别单元,用于:将所述第一图像输入到第二属性识别模型,识别出所述目标对象的第二属性识别结果;属性融合单元,用于根据所述第一识别结果和所述第二识别结果,计算所述目标对象的第三识别结果,其中,所述第三识别结果计算方法为:Y3=αY1+βY2;α、β为大于0的常数,Y1为所述第一属性识别结果,Y2为所述第二属性识别结果。
- 一种计算设备,其特征在于,所述计算设备包括处理器和耦合所述处理器的存储器,所述存储器用于数据和程序代码,所述处理器用于调用所述存储器存储的程序代码执行如权利要求1-6任意一项权利要求所述的对象属性识别方法。
- 一种计算机存储介质,其特征在于,所述计算机存储介质用于计算机软件指令,所述计算机软件指令当被计算机执行时使所述计算机执行如权利要求1-5中任一权利要求所述的对象属性识别方法。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810810453.9A CN109902548B (zh) | 2018-07-20 | 2018-07-20 | 一种对象属性识别方法、装置、计算设备及系统 |
CN201810810453.9 | 2018-07-20 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020015752A1 true WO2020015752A1 (zh) | 2020-01-23 |
Family
ID=66943070
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/096873 WO2020015752A1 (zh) | 2018-07-20 | 2019-07-19 | 一种对象属性识别方法、装置、计算设备及系统 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN109902548B (zh) |
WO (1) | WO2020015752A1 (zh) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111368787A (zh) * | 2020-03-17 | 2020-07-03 | 浙江大学 | 视频处理方法及装置、设备和计算机可读存储介质 |
CN111862031A (zh) * | 2020-07-15 | 2020-10-30 | 北京百度网讯科技有限公司 | 一种人脸合成图检测方法、装置、电子设备及存储介质 |
CN112633119A (zh) * | 2020-12-17 | 2021-04-09 | 北京赢识科技有限公司 | 一种人体属性识别方法、装置、电子设备及介质 |
CN112826446A (zh) * | 2020-12-30 | 2021-05-25 | 上海联影医疗科技股份有限公司 | 一种医学扫描语音增强方法、装置、系统及存储介质 |
CN114612930A (zh) * | 2021-12-13 | 2022-06-10 | 航天长征火箭技术有限公司 | 一种毫米波图像可疑物品人偶映射方法 |
CN114972944A (zh) * | 2022-06-16 | 2022-08-30 | 中国电信股份有限公司 | 视觉问答模型的训练方法及装置、问答方法、介质、设备 |
CN117789185A (zh) * | 2024-02-28 | 2024-03-29 | 浙江驿公里智能科技有限公司 | 基于深度学习的汽车油孔姿态识别系统及方法 |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109902548B (zh) * | 2018-07-20 | 2022-05-31 | 华为技术有限公司 | 一种对象属性识别方法、装置、计算设备及系统 |
CN110705474B (zh) * | 2019-09-30 | 2022-05-03 | 清华大学 | 一种行人属性识别方法和装置 |
CN111274432B (zh) * | 2020-02-06 | 2023-05-09 | 浙江大华技术股份有限公司 | 一种布控处理方法及装置 |
CN111428689B (zh) * | 2020-04-20 | 2022-07-01 | 重庆邮电大学 | 一种多池化信息融合的人脸图像特征提取方法 |
CN111753847B (zh) * | 2020-06-28 | 2023-04-18 | 浙江大华技术股份有限公司 | 图像预处理方法及装置、存储介质、电子装置 |
CN114239754B (zh) * | 2022-02-24 | 2022-05-03 | 中国科学院自动化研究所 | 基于属性特征学习解耦的行人属性识别方法及系统 |
CN116108225A (zh) * | 2023-04-13 | 2023-05-12 | 深圳开鸿数字产业发展有限公司 | 视频数据结构化方法、装置、终端设备及存储介质 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104992148A (zh) * | 2015-06-18 | 2015-10-21 | 江南大学 | 基于随机森林的atm终端部分遮挡人脸关键点检测方法 |
CN106779919A (zh) * | 2016-11-29 | 2017-05-31 | 深圳市凯木金科技有限公司 | 一种远程实时3d模拟试衣镜系统及其方法 |
CN106909896A (zh) * | 2017-02-17 | 2017-06-30 | 竹间智能科技(上海)有限公司 | 基于人物性格与人际关系识别的人机交互系统及工作方法 |
CN108021920A (zh) * | 2017-11-09 | 2018-05-11 | 华南理工大学 | 一种图像对象协同发现的方法 |
CN109902548A (zh) * | 2018-07-20 | 2019-06-18 | 华为技术有限公司 | 一种对象属性识别方法、装置、计算设备及系统 |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101815975B1 (ko) * | 2011-07-27 | 2018-01-09 | 삼성전자주식회사 | 객체 자세 검색 장치 및 방법 |
CN106991364B (zh) * | 2016-01-21 | 2020-06-12 | 阿里巴巴集团控股有限公司 | 人脸识别处理方法、装置以及移动终端 |
CN106021281A (zh) * | 2016-04-29 | 2016-10-12 | 京东方科技集团股份有限公司 | 医学知识图谱的构建方法、其装置及其查询方法 |
CN108279573B (zh) * | 2018-02-05 | 2019-05-28 | 北京儒博科技有限公司 | 基于人体属性检测的控制方法、装置、智能家电和介质 |
CN108288271A (zh) * | 2018-02-06 | 2018-07-17 | 上海交通大学 | 基于三维残差网络的图像检测系统及方法 |
-
2018
- 2018-07-20 CN CN201810810453.9A patent/CN109902548B/zh active Active
-
2019
- 2019-07-19 WO PCT/CN2019/096873 patent/WO2020015752A1/zh active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104992148A (zh) * | 2015-06-18 | 2015-10-21 | 江南大学 | 基于随机森林的atm终端部分遮挡人脸关键点检测方法 |
CN106779919A (zh) * | 2016-11-29 | 2017-05-31 | 深圳市凯木金科技有限公司 | 一种远程实时3d模拟试衣镜系统及其方法 |
CN106909896A (zh) * | 2017-02-17 | 2017-06-30 | 竹间智能科技(上海)有限公司 | 基于人物性格与人际关系识别的人机交互系统及工作方法 |
CN108021920A (zh) * | 2017-11-09 | 2018-05-11 | 华南理工大学 | 一种图像对象协同发现的方法 |
CN109902548A (zh) * | 2018-07-20 | 2019-06-18 | 华为技术有限公司 | 一种对象属性识别方法、装置、计算设备及系统 |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111368787A (zh) * | 2020-03-17 | 2020-07-03 | 浙江大学 | 视频处理方法及装置、设备和计算机可读存储介质 |
CN111862031A (zh) * | 2020-07-15 | 2020-10-30 | 北京百度网讯科技有限公司 | 一种人脸合成图检测方法、装置、电子设备及存储介质 |
CN112633119A (zh) * | 2020-12-17 | 2021-04-09 | 北京赢识科技有限公司 | 一种人体属性识别方法、装置、电子设备及介质 |
CN112826446A (zh) * | 2020-12-30 | 2021-05-25 | 上海联影医疗科技股份有限公司 | 一种医学扫描语音增强方法、装置、系统及存储介质 |
CN114612930A (zh) * | 2021-12-13 | 2022-06-10 | 航天长征火箭技术有限公司 | 一种毫米波图像可疑物品人偶映射方法 |
CN114972944A (zh) * | 2022-06-16 | 2022-08-30 | 中国电信股份有限公司 | 视觉问答模型的训练方法及装置、问答方法、介质、设备 |
CN114972944B (zh) * | 2022-06-16 | 2023-10-27 | 中国电信股份有限公司 | 视觉问答模型的训练方法及装置、问答方法、介质、设备 |
CN117789185A (zh) * | 2024-02-28 | 2024-03-29 | 浙江驿公里智能科技有限公司 | 基于深度学习的汽车油孔姿态识别系统及方法 |
CN117789185B (zh) * | 2024-02-28 | 2024-05-10 | 浙江驿公里智能科技有限公司 | 基于深度学习的汽车油孔姿态识别系统及方法 |
Also Published As
Publication number | Publication date |
---|---|
CN109902548B (zh) | 2022-05-31 |
CN109902548A (zh) | 2019-06-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020015752A1 (zh) | 一种对象属性识别方法、装置、计算设备及系统 | |
WO2021227726A1 (zh) | 面部检测、图像检测神经网络训练方法、装置和设备 | |
CN110728209B (zh) | 一种姿态识别方法、装置、电子设备及存储介质 | |
CN110555481B (zh) | 一种人像风格识别方法、装置和计算机可读存储介质 | |
WO2019228358A1 (zh) | 深度神经网络的训练方法和装置 | |
WO2022052601A1 (zh) | 神经网络模型的训练方法、图像处理方法及装置 | |
CN112288011B (zh) | 一种基于自注意力深度神经网络的图像匹配方法 | |
WO2021218786A1 (zh) | 一种数据处理系统、物体检测方法及其装置 | |
CN109949255A (zh) | 图像重建方法及设备 | |
CN110309856A (zh) | 图像分类方法、神经网络的训练方法及装置 | |
US20220375213A1 (en) | Processing Apparatus and Method and Storage Medium | |
WO2021190296A1 (zh) | 一种动态手势识别方法及设备 | |
WO2019227479A1 (zh) | 人脸旋转图像的生成方法及装置 | |
CN111797893A (zh) | 一种神经网络的训练方法、图像分类系统及相关设备 | |
CN112070044B (zh) | 一种视频物体分类方法及装置 | |
CN113807399B (zh) | 一种神经网络训练方法、检测方法以及装置 | |
CN109753891A (zh) | 基于人体关键点检测的足球运动员姿势校准方法及系统 | |
CN110222718B (zh) | 图像处理的方法及装置 | |
CN113705769A (zh) | 一种神经网络训练方法以及装置 | |
CN111625667A (zh) | 一种基于复杂背景图像的三维模型跨域检索方法及系统 | |
WO2022179606A1 (zh) | 一种图像处理方法及相关装置 | |
US20220262093A1 (en) | Object detection method and system, and non-transitory computer-readable medium | |
WO2023083030A1 (zh) | 一种姿态识别方法及其相关设备 | |
WO2022156475A1 (zh) | 神经网络模型的训练方法、数据处理方法及装置 | |
CN111104911A (zh) | 一种基于大数据训练的行人重识别方法及装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19837033 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19837033 Country of ref document: EP Kind code of ref document: A1 |