WO2023246921A1

WO2023246921A1 - Target attribute recognition method and apparatus, and model training method and apparatus

Info

Publication number: WO2023246921A1
Application number: PCT/CN2023/101952
Authority: WO
Inventors: 刘宪彬; 安占福
Original assignee: 京东方科技集团股份有限公司
Priority date: 2022-06-23
Filing date: 2023-06-21
Publication date: 2023-12-28
Also published as: CN115100469A

Abstract

The present application discloses a target attribute recognition method and apparatus, and a model training method and apparatus. The target attribute recognition method comprises: using a preset target attribute recognition model to perform target recognition on a received image to be recognized, and outputting a target mask; using the target mask to segment a target mask image from the image to be recognized; and performing target attribute recognition on the target mask image, and finally outputting a multi-label attribute of a target. According to the present application, a region unrelated to the target is filtered by means of a segmentation algorithm, and attribute recognition is performed by means of a pedestrian mask image, so that interference caused by the environment can be avoided, and the recognition speed and accuracy are remarkably improved. In particular, in the security field, according to a pedestrian attribute recognition method in an embodiment of the present application, automated monitoring can be implemented, rapid filtering and assisted search are effectively realized, the working efficiency is improved, and the present application thus has a wide application prospect.

Description

Target attribute identification method, model training method and device

This application claims priority to the Chinese patent application with application number 202210714705.4 and the invention title "A target attribute identification method, training method and device based on segmentation algorithm" submitted on June 23, 2022, the entire content of which is incorporated by reference incorporated in this application.

Technical field

The present application relates to the field of computer vision, and in particular to a target attribute recognition method, model training method and device.

Background technique

With the increase in video surveillance scenarios, a large amount of video data is generated. How to quickly and accurately identify targets in a large amount of video data is an urgent problem that needs to be solved.

Contents of the invention

This application provides a target attribute recognition method, training method and device.

On the one hand, a target attribute identification method is provided, which specifically includes:

Use a preset target attribute recognition model to perform target recognition on the received image to be recognized, and output a target mask, which is obtained by pixel space alignment based on a segmentation algorithm;

According to the target mask, use the target attribute recognition model to perform a mask operation on the image to be recognized and obtain a target mask image;

According to the target mask image, the target attribute recognition model is used to perform target attribute recognition, and attributes of the target of the image to be recognized are output, where the attributes include multi-label attributes of the target.

Further, using a preset target attribute recognition model to perform target recognition on the received image to be recognized, and outputting the target mask further includes:

Use the target attribute recognition model to perform feature extraction on the image to be recognized and output a first feature map;

Use the target attribute recognition model to perform region detection on the first feature map and output multiple region filtering frames;

Use the target attribute recognition model to perform regional feature matching on the regional filter box and output it A second feature map, the second feature map is obtained by pixel space alignment based on a segmentation algorithm;

The target attribute recognition model is used to perform area detection on the second feature map and output a target mask.

Further, the target attribute recognition model includes a feature extraction network, a first feature map pyramid network and a region generation network;

The use of the target attribute recognition model to extract features from the image to be recognized and output the first feature map further includes:

Use the feature extraction network to perform feature extraction on the image to be recognized and output a multi-layer feature original image;

Use the first feature map pyramid network to output the first feature map according to at least one layer of the original feature map;

The method of using the target attribute recognition model to perform region detection on the first feature map and outputting a plurality of region filtering frames further includes: using the region generation network to perform region detection on the first feature map according to a preset anchor frame. Region detection and output of multiple region filter boxes.

Further, the target attribute recognition model includes a mask prediction branch, a regression prediction branch and a classification prediction branch;

Using the target attribute recognition model to perform region detection on the second feature map and outputting a target mask further includes:

Use the mask prediction branch to perform region prediction on the second feature map and output a target mask;

Use the regression prediction branch to perform regional prediction on the second feature map and output a target frame;

Use the classification prediction branch to perform classification prediction on the second feature map and output a target classification.

Further, the step of using the target attribute recognition model to perform a masking operation on the image to be recognized and obtaining the target mask image according to the target mask further includes: combining the target mask with the image to be recognized. Perform multiplication operations and obtain the target mask image;

The step of using the target attribute recognition model to perform target attribute recognition according to the target mask image, and outputting the attributes of the target in the image to be recognized further includes: using the corresponding attributes in the target attribute recognition model according to the target classification. The recognition model performs target attribute recognition on the target mask image and outputs the attributes of the target in the image to be recognized. The attribute recognition model is a multi-task multi-label classification model.

Further, according to the target mask, using the target attribute recognition model to perform a masking operation on the image to be recognized and obtaining the target mask image further includes: combining the output target frame with the image to be recognized. Perform a multiplication operation to obtain the target frame mask image, perform a multiplication operation on the target mask and the target frame mask image, and obtain the target mask image;

Further, the feature extraction network is one of a VGG network, a googlenet network, a resnet network, and a resnext network.

On the other hand, a pedestrian attribute recognition method is provided:

Use a preset pedestrian attribute recognition model to perform pedestrian recognition on the received image to be recognized, and output a pedestrian mask, which is obtained by pixel space alignment based on a segmentation algorithm;

According to the pedestrian mask, use the pedestrian attribute recognition model to perform a mask operation on the image to be recognized and obtain a pedestrian mask image;

According to the pedestrian mask image, the pedestrian attribute recognition model is used to perform pedestrian attribute recognition, and the attributes of the pedestrian in the image to be recognized are output, where the attributes include multi-label attributes of the pedestrian.

Further, the multi-label attributes include at least three of gender attributes, headgear attributes, hairstyle attributes, clothing attributes, clothing color attributes, accessories attributes, occlusion attributes, truncation attributes and orientation attributes.

On the other hand, a model training method is provided, including:

Obtain multiple sample recognition images, and label the targets of each sample recognition image according to the pixel space alignment;

Use the labeled multiple sample recognition images to train the target attribute recognition model for target recognition.

Further, the target attribute recognition model includes a mask prediction branch, a regression prediction branch and a classification prediction branch, as well as a multi-label classification loss function,

The target recognition training of the target attribute recognition model using multiple labeled sample recognition images further includes:

According to the preset first accuracy threshold, the mask prediction branch, regression prediction branch and branch The class prediction branches are calculated through the preset loss function and adjust the model parameters;

According to the preset second accuracy threshold, the model parameters of the target attribute recognition model are adjusted through the multi-label classification loss function.

On the other hand, a target attribute identification device is provided, including:

A target mask acquisition unit, used to perform target recognition on the received image to be recognized, and output a target mask, where the target mask is obtained by pixel space alignment based on a segmentation algorithm;

A target mask image acquisition unit, configured to perform a masking operation on the image to be recognized according to the target mask and acquire the target mask image;

A target attribute recognition unit is configured to perform target attribute recognition on the target mask image, and output attributes of the target in the image to be recognized, where the attributes include multi-label attributes of the target.

In yet another aspect, a pedestrian attribute recognition device is provided, including:

A pedestrian mask acquisition unit is used to perform pedestrian recognition on the received image to be recognized, and output a pedestrian mask, which is obtained by pixel space alignment based on a segmentation algorithm;

A pedestrian mask image acquisition unit, configured to perform a masking operation on the image to be recognized according to the pedestrian mask and obtain a pedestrian mask image;

A pedestrian attribute recognition unit is configured to perform pedestrian attribute recognition on the pedestrian mask image and output attributes of the pedestrian in the image to be recognized, where the attributes include multi-label attributes of the pedestrian.

In another aspect, a model training device is provided, including:

The labeling unit is used to obtain multiple sample recognition images and label the targets of each sample recognition image according to the pixel space alignment;

The training unit is used to perform target recognition training on the target attribute recognition model using multiple labeled sample recognition images.

In yet another aspect, a computer-readable storage medium is provided with a computer program stored thereon,

The program, when executed by the processor, implements a method as described in one aspect;

or

The program when executed by the processor implements a method as described in another aspect;

or

The program, when executed by the processor, implements a method as described in yet another aspect.

In yet another aspect, a computer device is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor,

When the processor executes the program, the method as described in one aspect is implemented;

or

When the processor executes the program, the method as described in another aspect is implemented;

or

When the processor executes the program, the method as described in yet another aspect is implemented.

Description of the drawings

Figure 1 shows a flow chart of a target attribute identification method according to an embodiment of the present application;

Figure 2 shows a block diagram of a target attribute identification method according to another embodiment of the present application;

Figure 3 shows a schematic diagram of an anchor frame according to an embodiment of the present application;

Figure 4 shows a schematic diagram of the target mask and the target mask image according to an embodiment of the present application;

Figure 5 shows a schematic diagram of an image to be recognized and target attributes according to an embodiment of the present application;

Figure 6 shows a structural diagram of a target attribute identification device according to another embodiment of the present application;

Figure 7 shows a structural diagram of a pedestrian attribute recognition device according to another embodiment of the present application;

Figure 8 shows a structural diagram of a model training device according to another embodiment of the present application;

Figure 9 shows a schematic structural diagram of a computer device according to another embodiment of the present application.

Detailed ways

In order to explain the solution of the present application more clearly, the solution of the present application will be further described below in conjunction with the embodiments and drawings. Similar parts are designated with the same reference numerals in the drawings. Those skilled in the art should understand that the content specifically described below is illustrative rather than restrictive, and should not be used to limit the scope of protection of the present application.

In security scenarios, computer vision algorithms are usually used to achieve automated monitoring. Specifically, through the deep learning algorithm, the image to be processed is first segmented and only the area of interest is retained. Secondly, pedestrian features are extracted for the pedestrian area to complete attribute recognition. With the increase of surveillance scenes, a large amount of video data is generated. Under the premise of massive data volume, how to quickly and accurately filter pedestrian attributes and accurately find target pedestrians has become an urgent problem to be solved.

In related technologies, when identifying pedestrian attributes, computer vision technology is widely used, and deep learning algorithms are used, such as extracting pedestrian features for pedestrian areas to complete attribute identification. Compared with previous traditional image processing, the current mainstream method of feature extraction uses convolutional neural networks and deep learning methods to solve such problems. For example, the YOLACT algorithm is used to filter pedestrian attribute background information and splice different sizes of Feature maps are used for multi-task network prediction to improve The gradient weight loss function is used to train the model; another example is to use human posture key points to obtain the human body area, combine the extracted detail key points with shallow features, combine the extracted human body area with deep features, and combine the combined data and deep features are respectively input into the regional guidance module to obtain multiple prediction vectors, and the multiple prediction vectors are fused to obtain the final prediction result. However, the above methods all require additional key point detection. This step requires high computing power of the device and increases the corresponding processing time. Considering that in actual applications, there is a huge amount of data to be identified in the image, As well as the requirements for real-time discrimination capabilities, higher requirements are put forward for recognition speed and accuracy. Therefore, how to quickly and accurately identify target attributes has become an urgent technical problem to be solved.

In response to the above situation, as shown in Figure 1, one embodiment of the present application provides a target attribute identification method, which is implemented based on a segmentation algorithm. The method includes:

The embodiment of the present application uses the target attribute identification method based on the segmentation algorithm, compared to the identification method using additional key points, bypassing the step of processing key points, reducing the performance requirements for the hardware, and shortening the identification time. , and can filter out non-target areas to the greatest extent, and perform attribute recognition through the target mask image, which can avoid environmental interference on attribute recognition, significantly improve the recognition speed and accuracy, and can achieve rapid filtering and assisted search, which greatly It improves work efficiency and has broad application prospects.

In a specific example, as shown in Figure 2, the attribute identification is divided into three steps:

First, the image to be recognized 100 is read, target recognition 200 is performed, and a target mask 300 is output.

In the embodiment of this application, the following steps are specifically included:

Perform feature extraction 210 on the image to be recognized 100, and output a first feature map 220, that is, Feature Map, which is the result of the input image being convolved by a neural network, and its resolution depends on the step size of the previous convolution kernel.

Region detection 230, that is, using the Region Proposal Network (RPN) to extract candidate frames for "region selection" and outputting multiple region filtering frames 240, regional feature matching 250 and outputting the second feature map 260, and performing region selection again. Detect 270 and output target mask 300.

Specifically, an image 100 to be recognized is input into a preset backbone convolutional neural network (Backbone Convolutional Neural Networks, Backbone CNN) that has completed training. The backbone convolutional neural network is mainly used to extract the to-be-recognized image. Feature maps of the image 100 are identified for use by subsequent networks.

In an optional embodiment, the feature extraction network is a vgg network, a googlenet network, a resnet network, or a resnext network.

Feature extraction is performed on the image to be identified through one of the above feature extraction networks.

Specifically, in the VGG (Visual Geometry Group) network, a deep convolutional neural network is constructed by using a series of small-sized convolution kernels of size 3x3 and pooling layers, which has a simple structure and strong applicability. specialty.

In the GoogLeNet network, the convolution block is called the Inception block. The Inception block is equivalent to a sub-network with 4 paths. Information is extracted in parallel through convolution layers and maximum pooling layers of different window shapes, and uses 1× 1 The convolutional layer reduces the channel dimension at each pixel level thereby reducing model complexity.

The ResNeXt network adopts both the stacking idea of the VGG network and the split-transform-merge idea of the inception block, which has stronger scalability and basically does not change or reduce the complexity of the model while increasing the accuracy.

The ResNet network is a residual learning structure proposed to address the problem that deeper neural networks are difficult to train. It increases the depth of the network while reducing the number of parameters, and is widely used in detection, segmentation, recognition and other fields.

In this embodiment of the present application, the feature extraction network adopts ResNet50 network. The ResNet50 network outputs multiple feature maps. This embodiment of the present application uses a feature map pyramid network (Feature Pyramid Network, FPN) to fuse the feature maps output by the last three layers and output the feature map 220.

Among them, Feature Pyramid Network (FPN) is a top-down feature fusion method and a multi-scale target detection algorithm, which uses more than 1 feature prediction layer to combine multiple stages. The feature maps are fused together to extract not only the semantic features of the high-level feature maps, but also the low-level contour features.

It is worth noting that this application does not specifically limit the number of feature maps for FPN network fusion. Those skilled in the art should select an appropriate number of feature maps for fusion based on actual application requirements, such as the processing speed of the network and the performance of the feature maps. This will not be described again.

In this embodiment of the present application, the ResNet50 network is used to extract the features of the image 100 to be recognized. map, and further use the FPN network to perform feature fusion and form the first feature map 220. The feature maps extracted at various stages through the ResNet50 network can extract not only the semantic features of the high-level feature map, but also the low-level contour features. This solves the problem that smaller objects cannot be detected.

Based on the first feature map 220, the RPN network is input to perform region detection 230, thereby extracting the region filtering frame 240.

Specifically, the first feature map 220 is subjected to a 3×3 convolution operation to obtain a feature map with 256 channels, the size of which is the same as the first feature map 220 . For example, if the length of the first feature map 220 is H and the width is W, then the feature map with a channel number of 256 is regarded as having H×W vectors, each of which has 256 dimensions. Continue to do this on this vector. Two fully connected operations yield 2 scores and 4 coordinates respectively, which is equivalent to performing two 1×1 convolutions on the feature map with 256 channels, resulting in a 2×H×W and a 4×H ×W size feature map.

Specifically, the 2×H×W feature map, that is, 2 confidence levels, represents the scores of the foreground and the background, because the PRN network is only responsible for extracting the area filtering frame 240 and does not need to judge the image 100 to be recognized. The category of the item, so the confidence of the foreground and background is used to determine whether it is an item; the feature map of 4×H×W size, that is, 4 coordinates, represents the offset coordinates (x, y, w,h).

It is worth noting that the offset coordinates are the coordinates of the image 100 to be recognized. Since the image 100 to be recognized is different from the first feature map 220 in width and height, in order to obtain the image 100 to be recognized, The picture coordinates, introduce the anchor point (Anchor). Specifically include:

Randomly select a point in the first feature map 220 that can be mapped to a frame of the image 100 to be recognized. For example, the scaling ratio between the image 100 to be recognized and the first feature map 220 is 8: 1, then the mapped box is 8×8, set the upper left corner or center point of this box as the anchor point, and generate several anchor boxes (Anchor Box) based on this anchor point according to pre-configured rules. Each anchor box The size is determined by two parameters: scale and aspect ratio. For example, if scale=[128] and ratio=[0.5,1,1.5] are preset, each pixel can produce 3 different size box. As shown in Figure 3, the three boxes have the same area, and their aspect ratios are changed by the value of ratio, thereby producing boxes of different shapes. It is worth noting that this application does not specifically limit the number, scaling ratio, and aspect ratio of the anchor frames. Those skilled in the art should make appropriate selections based on actual application requirements, such as network processing speed and performance. This will not be described again.

In this embodiment of the present application, for example, the number of anchor frames is K, that is, each anchor point generates K frames. The first feature map 220 includes H×W points, each point corresponding to the to-be-identified Image 100 There are K frames, then there are a total of H×W×K region filtering frames 240. Through the RPN, it is judged whether these frames are objects and their offset coordinates on the image 100 to be recognized, that is, the Region filter box 240.

Furthermore, considering that region of interest pooling (ROI Pooling) is used in related technologies to deal with the problem of different candidate area sizes, since ROI Pooling adopts a downward rounding method, it is easy to cause errors and cannot be guaranteed. The feature layer exactly corresponds to the pixels of the input layer, which cannot meet the requirements of the semantic segmentation task. Therefore, the embodiment of the present application adopts the ROI Align method to cancel the rounding operation and instead use bilinear interpolation to obtain the pixel values of the fixed four point coordinates, thereby making the discontinuous operations continuous and effectively reducing the Error, realize the spatial alignment of the pixels, and complete the regional feature matching 250. In other words, the embodiment of the present application uses the ROI Align method to perform regional feature matching 250 on the region filtering frame and output the second feature map 260, that is, based on the segmentation algorithm, pixel space alignment is performed to obtain the second feature map 260; implemented in The precise coordinate pixel value of the object to be detected (foreground object) is identified in the image 100 to be identified.

The embodiments of this application take into account the performance and accuracy requirements of the target attribute recognition model. On the one hand, the detection speed can be significantly improved by introducing the RPN network for area detection, and it is easier to combine with other neural networks; on the other hand, by using ROI Align This method achieves the pixel spatial alignment, which can effectively reduce errors.

Based on the second feature map 260, the region detection 270 is performed again and the target mask 300 is obtained. Specifically, it includes inputting the second feature map into three prediction branches respectively.

Specifically, the second feature map 260 is introduced into the classification prediction branch to perform classification prediction and output target classification. After a fully connected layer, a softmax layer is connected. The softmax layer receives an N-dimensional vector as input, and each The value of the dimension is converted into a real number between (0, 1) to map the output of the fully connected layer into a probability distribution, which is specifically used to implement foreground and background classification in the embodiment of the present application.

Specifically, the second feature map 260 is introduced into the regression prediction branch to perform classification prediction and output the target box. After a fully connected layer, a bounding box regression layer (Bounding Box Regression, bbox reg) is connected, and the regression prediction is obtained. More accurate coordinate pixel values, which are the precise coordinates of the object to be detected (foreground object) identified in the image 100 to be recognized.

Specifically, the second feature map 260 is introduced into the mask prediction branch to perform classification prediction and output the target mask, and a fully connected layer is connected after a head layer (Head), and the head layer will The output dimension of the second feature map 260 is expanded to increase the mask prediction accuracy, and then a fully connected network (FCN) operation is performed in each ROI to generate the target mask 300 as shown in Figure 4.

The embodiment of this application obtains target classification, target box and target mask through three branch operations respectively.

In an optional embodiment, the embodiment of the present application operates sequentially through three branches. For example, in the prediction stage, the classification prediction and regression prediction operations are first performed, and the obtained results are passed into the mask prediction branch, which is fast and accurate. to get the target mask.

Secondly, a mask operation 400 is performed using the target mask 300 and the image to be recognized 100, and a target mask image 500 is output. In this embodiment of the present application, as shown in Figure 4, the target mask 300 includes two elements, 0 and 1, where 0 represents black and 1 represents transparent. The mask operation 400 is to generate a slice picture according to the target mask 300, that is, a multiplication operation is performed between the image to be recognized 100 and the target mask 300, and the 0 in the target mask 300 is the original picture. The RGB value is set to 0, and the 1 in the target mask 300 does not change the RGB value of the image 100 to be recognized. As shown in Figure 4, the target mask image 500 is generated by segmenting the target to be measured from the image. The target mask image 500 does not contain the background of the environment and can effectively reduce the noise caused by the environment.

Finally, the target mask image 500 is used to perform target attribute recognition 600, and the target attribute 700 is output.

In this embodiment of the present application, a convolution operation is performed on the target mask image 500, and multi-task multi-label classification is performed through multi-layer convolution operations. The recognition results are shown in Figure 5.

At this point, the embodiment of the present application uses the target attribute recognition method based on the segmentation algorithm to complete the target attribute recognition of the image 100 to be recognized, and output the target attribute 700.

In an optional embodiment, first use the mask prediction branch of the target attribute recognition model to perform area prediction on the second feature map and output a target mask, and compare the target mask with the image to be recognized. Perform a multiplication operation and obtain the target mask image; then use the target attribute recognition model to perform target attribute recognition and output the attributes of the target of the image to be recognized. Specifically, use the corresponding in the target attribute recognition model according to the target classification. The attribute recognition model performs target attribute recognition on the target mask image and outputs the attributes of the target in the image to be recognized. The attribute recognition model is a multi-task multi-label classification model.

In the embodiment of the present application, the target mask obtained by the mask prediction branch is multiplied with the image to be recognized to obtain the target mask image, and the attribute recognition model of the corresponding multi-task multi-label classification model is selected according to the target classification obtained by the classification prediction branch. Perform attribute recognition on the target mask image and output the attributes of the target.

Specifically, for example, if the target is a vehicle, select the attribute recognition model of the corresponding multi-task multi-label vehicle classification model to perform attribute recognition on the target mask image and output the attributes of the vehicle; for example, if the target is a dog, select the corresponding multi-task multi-label model. The attribute recognition model of the dog classification model performs attribute recognition on the target mask image and outputs the dog's attributes; for example, if the target is a pedestrian, select the attribute recognition model of the corresponding multi-task multi-label pedestrian classification model to perform attribute recognition on the target mask image and Output pedestrian attributes.

In another optional embodiment, the target attribute recognition model is used to perform a masking operation on the image to be recognized and obtain a target mask image. Specifically, the output target frame is first combined with the image to be recognized. Perform a multiplication operation to obtain the target frame mask image, and then perform a multiplication operation on the target mask and the target frame mask image to obtain the target mask image; then use the attribute recognition model to identify the target attribute and output the image to be recognized. The attributes of the target, specifically, use the corresponding attribute recognition model in the target attribute recognition model to perform target attribute recognition on the target mask image according to the target classification, and output the attributes of the target of the image to be recognized, the The attribute recognition model is a multi-task multi-label classification model.

In the embodiment of this application, the target frame obtained through the regression prediction branch is multiplied by the image to be recognized to obtain the target frame mask image, and then the target mask obtained by the mask prediction branch is multiplied by the target frame mask image to obtain the target. Mask image, and select the attribute recognition model of the corresponding multi-task multi-label classification model according to the target classification obtained by the classification prediction branch to identify the attributes of the target mask image and output the attributes of the target, which can further improve the accuracy of obtaining the target mask image. Rate.

The embodiment of the present application selects the ResNet50 network to extract feature maps of multiple stages of the image to be recognized 100, and further uses the FPN network to fuse the features of at least one stage together to form the first feature map 220, thereby utilizing ResNet50 The features extracted at each stage of the network not only extract the semantic features of the high-level feature map, but also extract the low-level contour features to solve the problem that smaller objects cannot be detected; at the same time, the embodiment of the present application introduces the RPN network for area detection, and the RPN network There is no need to search for all area filtering frames, which can significantly improve the detection speed and make it easier to combine with other neural networks; on the other hand, the embodiment of the present application uses ROI Align to achieve the pixel space alignment, which can effectively reduce errors; then , perform the classification prediction and regression prediction operations, pass the obtained results into the mask prediction branch, and quickly and accurately obtain the target mask 300; the target mask image 500 does not contain the background of the environment, which can effectively reduce Noise brought by the environment; perform a convolution operation on the target mask image 500, and perform multi-task and multi-label classification on the extracted features through multi-layer convolution operations, thereby extracting the target attributes 700; Embodiments of the present application able to implement It can quickly filter and assist queries, greatly improve work efficiency, and has broad application prospects.

Based on the target attribute identification method of the embodiment of the present application, in practical applications, such as in surveillance scenarios such as shopping malls and streets, the target attribute identification method of the embodiment of the present application can be further expanded into a pedestrian attribute identification method based on a segmentation algorithm. , where the same and common parts as those in the first embodiment described in this application will not be described again, and only the special parts of pedestrian recognition will be specifically explained. In the security field, as the number of scenes that need to be monitored increases, the density of people flow increases, and the monitoring time generally requires 7×24 hours, resulting in a surge in the amount of monitoring data. In this case, relying solely on manpower for investigation is time-consuming, labor-intensive, and inaccurate. There is no guarantee, so there is an urgent need to use computer vision algorithms to complete automated monitoring and achieve rapid identification and accurate search.

In the field of pedestrian recognition, pedestrian attributes are the most critical factor in the pedestrian recognition process. Computer vision is used, through deep learning algorithms, and the flexibility and speed of convolutional neural networks. Segmenting the image to be recognized only retains the pedestrian area of interest, and extracts pedestrian features for the pedestrian area to complete the identification of pedestrian attributes, which can greatly improve work efficiency.

The second embodiment of the present application provides a method for identifying pedestrian attributes, which is implemented based on a segmentation algorithm. The method includes:

In the embodiment of the present application, the pedestrian attribute recognition method can use a preset pedestrian attribute recognition model to perform pedestrian recognition on the received image to be recognized, and output a pedestrian mask, and use the pedestrian mask to segment from the image to be recognized. The pedestrian mask image is generated, and the attributes of the pedestrian are identified for the pedestrian mask image, and finally the multi-label attributes of the pedestrian are output, which has a high degree of recognition and accuracy.

In an optional embodiment, the multi-label attributes include at least three of gender attributes, headgear attributes, hairstyle attributes, clothing attributes, clothing color attributes, accessory attributes, occlusion attributes, truncation attributes and orientation attributes.

Let’s illustrate with a specific example:

First, an image 100 to be recognized is obtained. The source of the image to be recognized includes but not Limited to a certain frame in the video file or a certain frame in the surveillance video stream, the image to be recognized 100 is input into a preset backbone convolutional neural network that has completed training, taking into account the recognition speed and recognition For accuracy requirements, the ResNet50 network of the ResNet series is selected to extract feature maps of multiple stages, and then the Feature Pyramid Network (FPN) is introduced to fuse the feature maps of at least one stage together and output the first feature map. The semantic features of high-level feature maps are extracted, and the low-level contour features are extracted. Input the RPN network based on the first feature map 220 to perform region detection 230 to extract the region filtering frame 240, and then use ROI alignment (ROI Align) to achieve the pixel space alignment to complete regional feature matching 250 and output the second In the feature map 260, the area detection 270 is performed again to extract the pedestrian mask 300, and then the pedestrian mask 300 and the image to be recognized 100 are subjected to the mask operation 400 and the pedestrian mask image 500 is output. The pedestrian mask image 500 does not contain the background of the environment and can effectively reduce the noise caused by the environment.

Finally, the pedestrian mask image 500 is subjected to multi-task multi-label classification and pedestrian attributes are output. The recognition results are shown in Figure 5. The pedestrian attributes include but are not limited to gender attributes, headgear attributes, hairstyle attributes, clothing attributes, At least three attributes from the clothing color attribute, accessory attribute, occlusion attribute, truncation attribute and orientation attribute.

The third embodiment of this application provides a model training method, including:

In the embodiment of the present application, the target attribute recognition model is trained through labeled sample recognition images. For example, labeled sample recognition images of pedestrians, vehicles, and dogs are input into the target attribute recognition model. The target recognition model performs feature extraction on the sample recognition image and outputs a first feature map, performs region detection on the first feature map through the region detection model of the target attribute recognition model and outputs a plurality of region filtering frames, and performs region detection on the first feature map through the region detection model of the target attribute recognition model. The regional feature matching model performs regional feature matching on the regional filtering frame and outputs the second feature map. It performs region detection on the second feature map through the region detection model of the target attribute recognition model and outputs the target mask; and then obtains the target mask through the mask operation. code image, and use the attribute recognition model of the target attribute recognition model to perform target attribute recognition, and judge the accuracy of the target attribute recognition model based on the obtained target attributes. If the preset target is not reached, further adjust the parameters and continue training until the preset target is met. Until you set goals.

Further, the target attribute recognition model includes a mask prediction branch, a regression prediction branch and a classification prediction branch, as well as a multi-label classification loss function. The target attribute recognition model is trained using multiple labeled sample recognition images for target recognition. Further includes:

According to the preset first accuracy threshold, the mask prediction branch, regression prediction branch and classification prediction branch are respectively calculated through the preset loss function and the model parameters are adjusted;

According to the preset second accuracy threshold, the model parameters of the attribute recognition model are adjusted through the multi-label classification loss function.

In this embodiment of the present application, according to the preset first accuracy threshold, for example, 90%, each prediction branch calculates the loss function and obtains the total loss value, and uses the first accuracy threshold to determine the loss value until it meets until the first accuracy threshold; similarly, according to the preset second accuracy threshold, the loss value calculated by the multi-label classification loss function is judged until the second accuracy threshold is met.

In the embodiment of this application, the open source MS COCO data set is selected as the training set, the sample material with fine annotations of pedestrians in the Cityscapes data set is selected as the first test set, and the backup video from a security system that has been running for one year is selected. The sample materials manually organized and labeled in the data are used as the second test set.

The MS COCO data set is preset with 80 different objects, which is very suitable for training pedestrian detection models, and can effectively distinguish pedestrians and other related objects in sample materials, such as cars, cats, dogs, trees and signs, etc. . In addition, because the model trained by the convolutional neural network using the training set is sensitive to the target resolution, especially for pedestrian detection problems, target recall will occur when a model trained on one data set is used to test pedestrians in another data set. The problem of low rate, and the image resolution of the MS COCO data set is inconsistent. In order to improve the accuracy of the model, the input image is uniformly processed to a resolution of 1024×1024. On the premise of ensuring the original aspect ratio of the sample image, Fill other parts with 0s.

The Cityscapes data set contains 5,000 sample images with fine annotations, and not all of the images contain pedestrians. Therefore, in the embodiment of this application, 2,900 images with pedestrians are screened out for pedestrian detection as the third image. A test set for testing.

The second test set was obtained from the backup video data of the real security system, and was uniformly processed to a resolution of 1024×1024, and the relevant information was manually annotated by relevant technical personnel. A total of 500 images were used as the second test set for testing. In an optional embodiment, the first test set and the second test may also be mixed into a third test set, which will not be described again here.

In the embodiment of this application, the target attribute recognition model includes a mask prediction branch, a regression prediction branch and a classification prediction branch, as well as a multi-label classification loss function. The target attribute recognition model uses multiple labeled sample recognition images. Conducting target recognition training further includes:

According to the preset first accuracy threshold, the mask prediction branch selects the cross-entropy loss function, the regression prediction branch selects the smooth L1 loss function, the classification prediction branch selects the cross-entropy loss function, and sets the first accuracy rate Thresholds to calculate and adjust model parameters. In the embodiment of this application, the first accuracy threshold is set to 90%. On the other hand, in the multi-label classification, the cross-entropy loss function is selected, and in the embodiment of this application, the second accuracy threshold is set to 90%. , the attribute recognition model performs model parameter adjustment.

According to the specific needs of pedestrian attribute recognition, the embodiments of this application make targeted designs in the selection of data sets, neural network architecture, and loss function selection. By using the model training method, an efficient and stable model can be trained , thereby realizing target attribute recognition or pedestrian attribute recognition based on segmentation algorithm. It should be noted that this application does not specifically limit other details of model training, such as the selection of initial parameters, the selection of GPU hardware, etc. Those skilled in the art should make selections based on actual application requirements, and will not be described again here.

Correspondingly, this application also provides a target attribute identification device 700, which includes:

The target mask acquisition unit 701 is used to perform target recognition on the received image to be recognized, and output a target mask, which is obtained by pixel space alignment based on a segmentation algorithm;

The target mask image acquisition unit 702 is configured to perform a mask operation on the image to be recognized according to the target mask and acquire the target mask image;

The target attribute recognition unit 703 is configured to perform target attribute recognition on the target mask image, and output attributes of the target in the image to be recognized, where the attributes include multi-label attributes of the target.

The foregoing embodiments are also applicable to the target attribute identification device provided in the embodiments of the present application, and will not be described in detail in the embodiments of the present application. The aforementioned embodiments and the accompanying beneficial effects are also applicable to the embodiments of the present application, and therefore the same parts will not be described again.

Correspondingly, this application also provides a pedestrian attribute recognition device 800, which includes:

The pedestrian mask acquisition unit 801 is used to perform pedestrian recognition on the received image to be recognized, and output a pedestrian mask, which is obtained by pixel space alignment based on a segmentation algorithm;

A pedestrian mask image acquisition unit 802, configured to perform a mask operation on the image to be recognized according to the pedestrian mask and obtain a pedestrian mask image;

Pedestrian attribute identification unit 803 is used to identify pedestrian attributes on the pedestrian mask image, And output the attributes of the pedestrian in the image to be recognized, where the attributes include multi-label attributes of the pedestrian.

The foregoing embodiments are also applicable to the pedestrian attribute recognition device provided in the embodiments of the present application, and will not be described in detail in the embodiments of the present application. The aforementioned embodiments and the accompanying beneficial effects are also applicable to the embodiments of the present application, and therefore the same parts will not be described again.

Correspondingly, this application also provides a model training device 900, which includes:

Annotation unit 901 is used to obtain multiple sample identification images and annotate the targets of each sample identification image according to the pixel space alignment;

The training unit 902 is used to perform target recognition training on the target attribute recognition model using multiple labeled sample recognition images.

The foregoing embodiments are also applicable to the model training device provided in the embodiments of the present application, and will not be described in detail in the embodiments of the present application. The aforementioned embodiments and the accompanying beneficial effects are also applicable to the embodiments of the present application, and therefore the same parts will not be described again.

Another embodiment of the present application provides a computer-readable storage medium on which a computer program is stored, which when executed by a processor implements: the target attribute identification method based on the segmentation algorithm, or the pedestrian identification method based on the segmentation algorithm. Attribute identification method, or model training method.

In practical applications, the computer-readable storage medium may be any combination of one or more computer-readable media. The computer-readable medium may be a computer-readable signal medium or a computer-readable storage medium. The computer-readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, device or device, or any combination thereof. More specific examples (non-exhaustive list) of computer readable storage media include: electrical connections having one or more conductors, portable computer disks, hard drives, random access memory (RAM), read only memory (ROM), Erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. In the embodiments of the present application, the computer-readable storage medium may be any tangible medium containing or storing a program, which may be used by or in combination with an instruction execution system, apparatus or device.

A computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above. A computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium that can be sent, propagated, or transmitted for use by an instruction execution system or device. or a program for use with or in conjunction with the device.

Program code embodied on a computer-readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wire, optical cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for performing the operations of the present application may be written in one or more programming languages, including object-oriented programming languages such as Java, Smalltalk, C++, and conventional Procedural programming language—such as "C" or a similar programming language. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In situations involving remote computers, the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as an Internet service provider through Internet connection).

As shown in Figure 9, another embodiment of the present application provides a schematic structural diagram of a computer device. The computer device 12 shown in FIG. 9 is only an example and should not bring any limitations to the functions and scope of use of the embodiments of the present application.

As shown in Figure 9, computer device 12 is embodied in the form of a general purpose computing device. The components of computer device 12 may include, but are not limited to: one or more processors or processing units 16, system memory 28, and a bus 18 connecting various system components, including system memory 28 and processing unit 16.

Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a graphics accelerated port, a processor, or a local bus using any of a variety of bus structures. For example, these architectures include, but are not limited to, the Industry Standard Architecture (ISA) bus, the Micro Channel Architecture (MAC) bus, the Enhanced ISA bus, the Video Electronics Standards Association (VESA) local bus, and the Peripheral Component Interconnect ( PCI) bus.

Computer device 12 typically includes a variety of computer system readable media. These media can be any available media that can be accessed by computer device 12, including volatile and nonvolatile media, removable and non-removable media.

System memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (RAM) 30 and/or cache memory 32 . Computer device 12 may further include other removable/non-removable, volatile/non-volatile computer system storage media. By way of example only, storage system 34 may be used to read and write to non-removable, non-volatile magnetic media (not shown in Figure 9, commonly referred to as a "hard drive"). Although not shown in Figure 9, provision may be made for removable Disk drives that read and write removable non-volatile disks (such as "floppy disks"), and optical disk drives that read and write removable non-volatile optical disks (such as CD-ROM, DVD-ROM or other optical media). In these cases, each drive may be connected to bus 18 through one or more data media interfaces. The memory 28 may include at least one program product having a set (eg, at least one) of program modules configured to perform the functions of various embodiments of the present application.

A program/utility 40 having a set of (at least one) program modules 42, including but not limited to an operating system, one or more application programs, other program modules, and program data, may be stored, for example, in memory 28 , each of these examples or some combination may include the implementation of a network environment. Program modules 42 generally perform functions and/or methods in the embodiments described herein.

Computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with computer device 12, and/or with Any device (eg, network card, modem, etc.) that enables the computer device 12 to communicate with one or more other computing devices. This communication may occur through input/output (I/O) interface 22. Furthermore, computer device 12 may also communicate with one or more networks (eg, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) through network adapter 20. As shown in FIG. 9, network adapter 20 communicates with other modules of computer device 12 via bus 18. It should be understood that, although not shown in Figure 9, other hardware and/or software modules may be used in conjunction with computer device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tapes drives and data backup storage systems, etc.

The processor unit 16 executes various functional applications and data processing by running programs stored in the system memory 28, a target attribute identification method based on a segmentation algorithm, or a pedestrian attribute identification method based on a segmentation algorithm, or a Model training methods.

Obviously, the above-mentioned embodiments of the present application are only examples to clearly illustrate the present application, and are not intended to limit the implementation of the present application. For those of ordinary skill in the art, based on the above description, they can also make There are other different forms of changes or modifications, and it is impossible to exhaustively enumerate all the implementations here. All obvious changes or modifications derived from the technical solutions of the present application are still within the protection scope of the present application.

Claims

A target attribute identification method, the method includes:

Use a preset target attribute recognition model to perform target recognition on the received image to be recognized, and output a target mask, which is obtained by pixel space alignment based on a segmentation algorithm;

According to the target mask, use the target attribute recognition model to perform a mask operation on the image to be recognized and obtain a target mask image;

According to the target mask image, the target attribute recognition model is used to perform target attribute recognition, and attributes of the target in the image to be recognized are output, where the attributes include multi-label attributes of the target.
The target attribute recognition method according to claim 1, wherein using a preset target attribute recognition model to perform target recognition on the received image to be recognized and outputting the target mask further includes:

Use the target attribute recognition model to perform feature extraction on the image to be recognized and output a first feature map;

Use the target attribute recognition model to perform region detection on the first feature map and output multiple region filtering frames;

Use the target attribute recognition model to perform regional feature matching on the regional filtering frame and output a second feature map, where the second feature map is obtained by pixel space alignment based on a segmentation algorithm;

The target attribute recognition model is used to perform area detection on the second feature map and output a target mask.
The target attribute identification method according to claim 2, the target attribute identification model includes a feature extraction network, a first feature map pyramid network and a region generation network;

The use of the target attribute recognition model to extract features from the image to be recognized and output the first feature map further includes:

Use the feature extraction network to perform feature extraction on the image to be recognized and output a multi-layer feature original image;

Use the first feature map pyramid network to output the first feature map according to at least one layer of the original feature map;

The method of using the target attribute recognition model to perform region detection on the first feature map and outputting a plurality of region filtering frames further includes: using the region generation network to perform region detection on the first feature map according to a preset anchor frame. Region detection and output of multiple region filter boxes.
The target attribute identification method according to claim 2, the target attribute identification model includes a mask prediction branch, a regression prediction branch and a classification prediction branch;

Using the target attribute recognition model to perform region detection on the second feature map and outputting a target mask further includes:

Use the mask prediction branch to perform region prediction on the second feature map and output a target mask;

Use the regression prediction branch to perform regional prediction on the second feature map and output a target frame;

Use the classification prediction branch to perform classification prediction on the second feature map and output a target classification.
The target attribute identification method according to claim 4,

According to the target mask, using the target attribute recognition model to perform a masking operation on the image to be identified and obtaining the target mask image further includes: performing a multiplication operation on the target mask and the image to be identified. And obtain the target mask image;

The step of using the target attribute recognition model to perform target attribute recognition according to the target mask image, and outputting the attributes of the target in the image to be recognized further includes: using the corresponding attributes in the target attribute recognition model according to the target classification. The recognition model performs target attribute recognition on the target mask image and outputs the attributes of the target in the image to be recognized. The attribute recognition model is a multi-task multi-label classification model.
The target attribute identification method according to claim 4,

According to the target mask, using the target attribute recognition model to perform a masking operation on the image to be recognized and obtaining the target mask image further includes: performing a multiplication operation on the output target frame and the image to be recognized. To obtain the target frame mask image, perform a multiplication operation on the target mask and the target frame mask image and obtain the target mask image;

According to the target mask image, use the target attribute recognition model to identify target attributes. Identifying and outputting the attributes of the target of the image to be recognized further includes: using the corresponding attribute recognition model in the target attribute recognition model to perform target attribute recognition on the target mask image according to the target classification, and outputting the attributes of the image to be recognized. The attribute of the target, the attribute recognition model is a multi-task multi-label classification model.
According to the target attribute identification method of claim 3, the feature extraction network is one of a vgg network, a googlenet network, a resnet network, and a resnext network.
A pedestrian attribute recognition method based on segmentation algorithm, the method includes:

Use a preset pedestrian attribute recognition model to perform pedestrian recognition on the received image to be recognized, and output a pedestrian mask, which is obtained by pixel space alignment based on a segmentation algorithm;

According to the pedestrian mask, use the pedestrian attribute recognition model to perform a mask operation on the image to be recognized and obtain a pedestrian mask image;

According to the pedestrian mask image, the pedestrian attribute recognition model is used to perform pedestrian attribute recognition, and the attributes of the pedestrian in the image to be recognized are output, where the attributes include multi-label attributes of the pedestrian.
The pedestrian attribute identification method according to claim 8, the multi-label attributes include at least three of gender attributes, headgear attributes, hairstyle attributes, clothing attributes, clothing color attributes, accessory attributes, occlusion attributes, truncation attributes and orientation attributes. .
A model training method, the method includes:

Obtain multiple sample recognition images, and label the targets of each sample recognition image according to the pixel space alignment;

Use the labeled multiple sample recognition images to train the target attribute recognition model for target recognition.
The model training method according to claim 10, the target attribute recognition model includes a mask prediction branch, a regression prediction branch and a classification prediction branch, and a multi-label classification loss function,

The use of labeled multiple sample recognition images to perform target recognition on the target attribute recognition model Training further includes:

According to the preset first accuracy threshold, the mask prediction branch, regression prediction branch and classification prediction branch are respectively calculated through the preset loss function and the model parameters are adjusted;

According to the preset second accuracy threshold, the model parameters of the target attribute recognition model are adjusted through the multi-label classification loss function.
A computer-readable storage medium having a computer program stored thereon,

When the program is executed by the processor, the method as described in any one of claims 1-7 is implemented;

or

When the program is executed by the processor, the method as described in any one of claims 8-9 is implemented;

or

The program implements the method of claim 10 when executed by the processor.
A computer device including a memory, a processor and a computer program stored in the memory and executable on the processor,

When the processor executes the program, the method according to any one of claims 1-7 is implemented;

or

When the processor executes the program, the method according to any one of claims 8-9 is implemented;

or

When the processor executes the program, the method according to claim 10 is implemented.