CN112926427A

CN112926427A - Target user dressing attribute identification method and device

Info

Publication number: CN112926427A
Application number: CN202110187498.7A
Authority: CN
Inventors: 廖丹萍
Original assignee: Zhejiang Smart Video Security Innovation Center Co Ltd
Current assignee: Zhejiang Smart Video Security Innovation Center Co Ltd
Priority date: 2021-02-18
Filing date: 2021-02-18
Publication date: 2021-06-08

Abstract

The invention discloses a method for identifying the dressing attribute of a target user, which comprises the following steps: cutting out a human body image corresponding to the human body area from the monitoring image; inputting a human body image into a network model, extracting head characteristics and body characteristics of the human body image by a characteristic extraction network in the network model, inputting the head characteristics into a first identification network and a second identification network in the network model, and inputting the body characteristics into a third identification network in the network model, wherein the first identification network identifies whether a target user wears a mask according to the head characteristics, the second identification network identifies whether the target user wears a white hat according to the head characteristics, and the third identification network identifies whether the target user wears a white jacket according to the body characteristics; and acquiring a mask identification result, a white cap identification result and a white jacket identification result which are output by the network model. The dressing property of a target user (such as a delicatessen operator) is intelligently monitored by utilizing a real-time monitoring picture, so that the continuous monitoring is realized for a long time, and the labor cost is saved.

Description

Target user dressing attribute identification method and device

Technical Field

The invention relates to the technical field of image processing, in particular to a method and a device for identifying the dressing attribute of a target user.

Background

The safety problem of direct access to food such as cooked food and bean products in farmer markets has been an important concern for governments and people. A farmer market cooked food store strictly achieves the 'three-white' principle, namely, the cooked food store operator is ensured to wear white gowns and wear white caps and white mouth covers to operate, and the sanitation and safety of cooked food are guaranteed.

In order to standardize market operation and guarantee food safety, a person generally using a market supervision office patrols and checks delicatessens and punishes illegal stores. The method is time-consuming and labor-consuming, can only play a role in supervision at the inspection time point, and cannot supervise the delicatessen for a long time.

Disclosure of Invention

The present invention provides a method and a device for identifying a clothing attribute of a target user, which are provided to overcome the defects of the prior art, and the object is achieved by the following technical scheme.

The invention provides a method for identifying the dressing attribute of a target user, which comprises the following steps:

cutting out a human body image corresponding to the human body area from the monitoring image;

inputting the human body image into a trained network model, extracting head characteristics and body characteristics of the human body image by a characteristic extraction network in the network model, inputting the head characteristics into a first identification network and a second identification network in the network model, and inputting the body characteristics into a third identification network in the network model, wherein the first identification network identifies whether a target user wears a mask according to the head characteristics, the second identification network identifies whether the target user wears a white hat according to the head characteristics, and the third identification network identifies whether the target user wears a white coat according to the body characteristics;

and acquiring a mask identification result, a white cap identification result and a white jacket identification result which are output by the network model.

A second aspect of the present invention provides an apparatus for identifying a target user dressing attribute, the apparatus comprising:

the cutting module is used for cutting a human body image corresponding to the human body area from the monitoring image;

the recognition module is used for inputting the human body image into a trained network model, extracting head characteristics and body characteristics of the human body image by a characteristic extraction network in the network model, inputting the head characteristics into a first recognition network and a second recognition network in the network model, and inputting the body characteristics into a third recognition network in the network model, wherein the first recognition network recognizes whether a target user wears a mask according to the head characteristics, the second recognition network recognizes whether the target user wears a white hat according to the head characteristics, and the third recognition network recognizes whether the target user wears a white gown according to the body characteristics;

and the result acquisition module is used for acquiring the mask identification result, the white cap identification result and the white coat identification result which are output by the network model.

Based on the method and the device for identifying the target user dressing attribute in the first aspect and the second aspect, the method and the device have the following beneficial effects:

the dressing property of a target user (such as a delicatessen operator) is intelligently monitored by utilizing a real-time monitoring picture, so that the intelligent monitoring system can be used for continuously monitoring for a long time, and a large amount of labor cost is saved. In the process of dressing identification of a human body image cut out from a monitoring picture by using a network model, in order to obtain a more accurate identification result, the head characteristic and the body characteristic of the human body are distinguished and respectively identified, when a white hat and a mask are identified, only the head characteristic is used, when a white jacket is identified, only the body characteristic is used, and the interference of other region characteristics is avoided.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 is a flowchart illustrating an embodiment of a target user apparel attribute identification method in accordance with an illustrative embodiment of the present invention;

FIG. 2 is a schematic diagram of a network model architecture according to an exemplary embodiment of the present invention;

FIG. 3 is a flowchart illustrating an embodiment of a model training method according to an exemplary embodiment of the present invention;

fig. 4 is a schematic structural diagram illustrating a target user dressing attribute identification apparatus according to an exemplary embodiment of the present invention.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present invention. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

Fig. 1 is a flowchart illustrating an embodiment of a method for identifying clothing attributes of a target user according to an exemplary embodiment of the present invention, where the target user may be a delicatessen operator, and may also be another person who needs to be supervised for clothing. As shown in fig. 1, the method for identifying the clothing attribute of the target user includes the following steps:

step 101: and cutting out a human body image corresponding to the human body area from the monitoring image.

In some embodiments, the human body image corresponding to the human body region can be cut out from the monitoring image as the identification data source by acquiring the monitoring image of the target scene acquired by the camera and detecting the human body region in the monitoring image, so as to remove the irrelevant background elements.

The target scene is a scene needing to be supervised, for example, when the target scene is a delicatessen store, the camera monitors an operator in the delicatessen store in real time, so that a single-frame monitoring image can be extracted from a monitoring video acquired by the camera for human body detection, and if a human body area is detected, the single-frame monitoring image can be cut out to serve as an identification data source.

It can be understood that the human body detection technology in the prior art can be adopted to realize human body region detection, such as a human body detection model, a human body detection algorithm, and the like, and the present invention is not particularly limited thereto, as long as the human body detection function can be realized.

Step 102: inputting a human body image into a trained network model, extracting head characteristics and body characteristics of the human body image by a characteristic extraction network in the network model, inputting the head characteristics into a first identification network and a second identification network in the network model, and inputting the body characteristics into a third identification network in the network model, wherein the first identification network identifies whether a target user wears a mask according to the head characteristics, the second identification network identifies whether the target user wears a white hat according to the head characteristics, and the third identification network identifies whether the target user wears a white jacket according to the body characteristics.

Fig. 2 is a structure of a network model, and the following describes the identification process of the network model in detail with reference to fig. 2:

1. processing flow of input human body image by feature extraction network

The method comprises the steps of firstly, extracting global body characteristics of a human body image through a characteristic extraction module in a characteristic extraction network, segmenting the global body characteristics according to preset body proportion distribution, inputting head characteristics and body characteristics obtained through segmentation into a global average pooling layer, and respectively carrying out global average pooling on the head characteristics and the body characteristics through the global average pooling layer.

It should be noted that, because the body proportion distribution difference of people and the posture difference of people are large, the head and the body are distinguished by a unique dividing line, which results in inaccurate part cutting. Therefore, the present embodiment reserves a small amount of overlap between the head and body features to maximize the integrity of the relevant body part features. That is, there is an overlapping feature between the head feature and the body feature obtained by segmentation.

For example, assuming that the output feature of the feature extraction module is 7 × 7 × 2048, the feature of the upper 2 × 7 × 2048 of the output feature is used as a head feature, and is input to the global average pooling layer for global average pooling, so that the 2 × 7 × 2048 feature is mapped to a 1 × 1 × 2048 dimensional head feature; and taking the feature of the lower part 6 multiplied by 7 multiplied by 2048 of the output feature as a body feature, and inputting the feature into a global average pooling layer to perform global average pooling, so that the 6 multiplied by 7 multiplied by 2048 feature is mapped into the body feature of 1 multiplied by 2048 dimension.

It can be seen that the overlapping features between the head and body features are the features of row 2 in a 7 x 7 matrix.

2. Flow of first identification network for identifying whether to wear mask according to head characteristics

The classification layer in the first identification network calculates a first attribute value of a mask worn by a target user, a second attribute value of the mask not worn by the target user and a third attribute value of the mask which cannot be identified according to the head characteristics and outputs the first attribute value, the second attribute value and the third attribute value to the softmax layer, and the softmax layer converts the first attribute value, the second attribute value and the third attribute value into probability distribution and takes the probability distribution as a mask identification result.

Wherein the softmax layer is used to perform a normalization operation on the input plurality of attribute values to convert each attribute value into a probability, and the probabilities are summed to 1, that is, the probability of wearing the mask, the probability of not wearing the mask, and the probability of being unable to recognize the mask are summed to 1.

It should be noted that, when the target user faces the camera, the monitoring image collected by the camera can identify whether to wear the mask, and when the target user faces away from the camera, the monitoring image cannot identify whether to wear the mask, so that the category of "the mask cannot be identified" is added to the first identification network.

3. The second identification network identifies whether the target user wears a white hat according to the head characteristics:

and the classification layer in the second recognition network calculates a fourth attribute value of the target user wearing the white hat and a fifth attribute value of the target user not wearing the white hat according to the head characteristics and outputs the fourth attribute value and the fifth attribute value to the softmax layer in the second recognition network, and the softmax layer converts the fourth attribute value and the fifth attribute value into probability distribution and takes the probability distribution as a white hat recognition result.

Based on the same principle, the softmax layer is used to perform a normalization operation on a plurality of attribute values input to convert each attribute value into a probability, and the probabilities are added up to 1. That is, the probability of wearing white hat and the probability of not wearing white hat are added to 1.

4. The third identification network identifies whether the target user wears a Chinese jacket or not according to the body characteristics

And the classification layer in the third identification network calculates a sixth attribute value of the target user for penetrating the white gown and a seventh attribute value of the target user for not penetrating the white gown according to the body characteristics and outputs the sixth attribute value and the seventh attribute value to the softmax layer in the third identification network, and the softmax layer converts the sixth attribute value and the seventh attribute value into probability distribution and takes the probability distribution as the identification result of the white gown.

Wherein the sum of the probability of the white gown being penetrated and the probability of the white gown not being penetrated is 1.

Based on the above description, the final output length of the network model is the dressing attribute probability of 7, which is the probability of wearing a mask, the probability of not recognizing a mask, the probability of wearing a white hat, the probability of not wearing a white hat, the probability of wearing a white gown, and the probability of not wearing a white gown.

Step 103: and acquiring a mask identification result, a white cap identification result and a white jacket identification result which are output by the network model.

It should be noted that, for the mask wearing type, the attribute corresponding to the maximum probability in the mask identification result may be used as the classification attribute, for the white hat wearing type, the attribute corresponding to the maximum probability in the white hat identification result may be used as the classification attribute, and for the white gown wearing type, the attribute corresponding to the maximum probability in the gown identification result may be used as the classification attribute.

So far, the identification process shown in fig. 1 is completed, and the dressing attribute of the target user (such as a delicatessen operator) is intelligently monitored by using a real-time monitoring picture, so that the continuous monitoring process can be carried out for a long time, and a large amount of labor cost is saved. In the process of dressing identification of a human body image cut out from a monitoring picture by using a network model, in order to obtain a more accurate identification result, the head characteristic and the body characteristic of the human body are distinguished and respectively identified, when a white hat and a mask are identified, only the head characteristic is used, when a white jacket is identified, only the body characteristic is used, and the interference of other region characteristics is avoided.

Fig. 3 is a flowchart of an embodiment of a model training method according to an exemplary embodiment of the present invention, where the training method of the present embodiment is used for training the network model shown in fig. 2, and as shown in fig. 3, the model training method includes the following steps:

step 301: acquiring a plurality of training images containing target users, and establishing label vectors for each frame of training image, wherein the label vectors comprise 7 attribute components of a white hat, a non-white hat, a mask, a non-mask, an unrecognizable mask, a white gown and a non-white gown.

In this embodiment, for the process of creating the label vector for the training image, it is assumed that 0 represents that the training image does not have the component attribute, and 1 represents that the training image has the component attribute. For example, when the operator in the training image wears a white jacket, does not wear a white hat, and does not wear a mask, the corresponding label vector is [1,0,0,1,0,1,0 ].

That is, only one of the components of each type of dressing property is 1, and the rest are 0. Namely, the operator can only be in one of the two states of wearing white gowns or not wearing white gowns, and the mask and the white cap are also treated in the same way.

Step 302: and training a pre-constructed network model by using the multi-frame training image, calculating difference loss of a mask recognition result, a white cap recognition result and a white coat recognition result output by the network model and corresponding label vectors in the training process, and optimizing parameters of the network model according to the difference loss by adopting a gradient descent method.

Wherein, as described in the step 102, the present invention performs softmax normalization operation on each type of dressing attribute separately. Thus, in training, cross-entropy (cross entropy) loss can be calculated from the attribute components after softmax normalization of each type of rigged attribute with the true tag vector. The cross entropy loss is defined as follows:

wherein, y_iIs the label value, z, of the ith attribute component_iAnd predicting the probability value corresponding to the ith attribute component output by the network model, wherein the value of i is from 1 to 7.

Specifically, as shown in fig. 2, the constructed network model structure may be configured to set the hyper-parameters of the network training, including batch size, learning rate, and the like, according to the size of the data set composed of the training images and whether the pre-training model is adopted during training. And calculating the output error of the network model by using the loss function, so that the output of the network is as close to the label vector as possible, thereby realizing the minimum classification error and further extracting the distinguishing characteristics of the image.

Thus, the training process shown in fig. 3 is completed, and by using the network model obtained by the training process, whether the target user wears a white hat, whether wears a mask, and whether wears a white gown in the monitored image can be accurately identified.

Corresponding to the embodiment of the target user dressing attribute identification method, the invention also provides an embodiment of a target user dressing attribute identification device.

Fig. 4 is a flowchart illustrating an embodiment of a target user clothing attribute identification device according to an exemplary embodiment of the present invention, and as shown in fig. 4, the target user clothing attribute identification device includes:

a cutting module 410, configured to cut a human body image corresponding to a human body region from the monitoring image;

the recognition module 420 is configured to input the human body image into a trained network model, extract head features and body features of the human body image through a feature extraction network in the network model, input the head features into a first recognition network and a second recognition network in the network model, and input the body features into a third recognition network in the network model, where the first recognition network recognizes whether a target user wears a mask according to the head features, the second recognition network recognizes whether the target user wears a white hat according to the head features, and the third recognition network recognizes whether the target user wears a white gown according to the body features;

and a result obtaining module 430, configured to obtain a mask recognition result, a white cap recognition result, and a white coat recognition result output by the network model.

In an optional implementation manner, the cropping module 410 is specifically configured to obtain a monitoring image of a target scene acquired by a camera; detecting a human body region in the monitoring image; and cutting out the human body image corresponding to the human body area from the monitoring image.

In an optional implementation manner, the identification module 420 is specifically configured to, in the process of extracting the head feature and the body feature of the human body image by using a feature extraction module in a feature extraction network, extract the global body feature of the human body image by using the feature extraction module in the feature extraction network, segment the global body feature according to a preset body proportion distribution, and input the head feature and the body feature obtained by segmentation into a global average pooling layer in the feature extraction network; and the global average pooling layer is used for respectively carrying out global average pooling on the head features and the body features.

In an optional implementation manner, the identification module 420 is specifically configured to, in a process that a first identification network identifies whether a target user wears a mask according to head features, calculate, by a classification layer in the first identification network, a first attribute value of the mask worn by the target user, a second attribute value of the mask not worn by the target user, and a third attribute value of the mask which cannot be identified according to the head features, and output the first attribute value, the second attribute value, and the third attribute value to a softmax layer in the first identification network; and the softmax layer converts the first attribute value, the second attribute value and the third attribute value into probability distribution and takes the probability distribution as a mask identification result.

In an optional implementation manner, the identifying module 420 is specifically configured to, in the process that the second identifying network identifies whether the target user wears a white hat according to the head feature, calculate, by the classification layer in the second identifying network, a fourth attribute value of the target user wearing the white hat and a fifth attribute value of the target user not wearing the white hat according to the head feature, and output the fourth attribute value and the fifth attribute value to a softmax layer in the second identifying network; and the softmax layer converts the fourth attribute value and the fifth attribute value into probability distribution and takes the probability distribution as a white hat identification result.

In an optional implementation manner, the identification module 420 is specifically configured to, in a process that a third identification network identifies whether a target user wears a white gown according to body characteristics, calculate, by a classification layer in the third identification network, a sixth attribute value of the target user wearing a white gown and a seventh attribute value of the target user not wearing a white gown according to the body characteristics, and output the sixth attribute value and the seventh attribute value to a softmax layer in the third identification network; and the softmax layer converts the sixth attribute value and the seventh attribute value into probability distribution and takes the probability distribution as a recognition result of the white gown.

In an alternative implementation, the apparatus further comprises (not shown in fig. 4):

the model training module is used for acquiring a plurality of training images containing a target user and establishing a label vector for each training image, wherein the label vector comprises 7 attribute components of a white hat, a non-white hat, a mask, a non-mask, an unrecognizable mask, a white coat and a non-white coat; training a pre-constructed network model by using the multi-frame training image; in the training process, the mask recognition result, the white hat recognition result and the white coat recognition result output by the network model and the corresponding label vectors are used for calculating the difference loss, and the parameters of the network model are optimized according to the difference loss by adopting a gradient descent method.

In an optional implementation manner, the model training module is specifically configured to, in a process of obtaining multiple frames of training images including a target user, extract frames from a surveillance video of a target scene acquired by a camera to obtain multiple frames of surveillance images; and detecting a human body area in each monitoring image, and cutting out a human body image corresponding to the human body area from the monitoring image as a training image containing a target user.

The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for identifying target user dressing attributes, the method comprising:

2. The method according to claim 1, wherein the cropping of the human body image corresponding to the human body region from the monitoring image comprises:

acquiring a monitoring image of a target scene acquired by a camera;

detecting a human body region in the monitoring image;

and cutting out the human body image corresponding to the human body area from the monitoring image.

3. The method of claim 1, wherein the feature extraction network extracts head features and body features of the human body image, and comprises:

extracting global body characteristics of the human body image through a characteristic extraction module in the characteristic extraction network, segmenting the global body characteristics according to preset body proportion distribution, and inputting head characteristics and body characteristics obtained through segmentation into a global average pooling layer in the characteristic extraction network;

and the global average pooling layer is used for respectively carrying out global average pooling on the head features and the body features.

4. The method of claim 1, wherein the first identification network identifies whether the mask is worn by the target user based on the head characteristics, comprising:

the classification layer in the first identification network calculates a first attribute value of a mask worn by a target user, a second attribute value of the mask not worn by the target user and a third attribute value of the mask which cannot be identified according to the head features and outputs the first attribute value, the second attribute value and the third attribute value to a softmax layer in the first identification network;

and the softmax layer converts the first attribute value, the second attribute value and the third attribute value into probability distribution and takes the probability distribution as a mask identification result.

5. The method of claim 1, wherein the second recognition network recognizes whether the target user wears white hat according to head features, comprising:

the classification layer in the second recognition network calculates a fourth attribute value of the target user wearing a white hat and a fifth attribute value of the target user not wearing the white hat according to the head features and outputs the fourth attribute value and the fifth attribute value to the softmax layer in the second recognition network;

and the softmax layer converts the fourth attribute value and the fifth attribute value into probability distribution and takes the probability distribution as a white hat identification result.

6. The method of claim 1, wherein the third recognition network recognizing whether the target user wears a gown based on body characteristics comprises:

the classification layer in the third recognition network calculates a sixth attribute value of the target user with the Chinese gown threaded and a seventh attribute value of the Chinese gown not threaded according to the body characteristics and outputs the sixth attribute value and the seventh attribute value to the softmax layer in the third recognition network;

and the softmax layer converts the sixth attribute value and the seventh attribute value into probability distribution and takes the probability distribution as a recognition result of the white gown.

7. The method of claim 1, wherein the training process of the network model comprises:

acquiring a plurality of training images containing a target user, and establishing a label vector for each training image, wherein the label vector comprises 7 attribute components of a white hat, a non-white hat, a mask, a non-mask, an unrecognizable mask, a white gown and a non-white gown;

training a pre-constructed network model by using the multi-frame training image;

in the training process, the mask recognition result, the white hat recognition result and the white coat recognition result output by the network model and the corresponding label vectors are used for calculating the difference loss, and the parameters of the network model are optimized according to the difference loss by adopting a gradient descent method.

8. The method of claim 7, wherein the obtaining a plurality of frames of training images containing a target user comprises:

extracting frames from a monitoring video of a target scene collected by a camera to obtain a plurality of frames of monitoring images;

and detecting a human body area in each monitoring image, and cutting out a human body image corresponding to the human body area from the monitoring image as a training image containing a target user.

9. An apparatus for identifying target user dressing attributes, the apparatus comprising:

10. The apparatus of claim 9, wherein the apparatus comprises: