CN115019349A

CN115019349A - Image analysis method, image analysis device, electronic equipment and storage medium

Info

Publication number: CN115019349A
Application number: CN202210947162.0A
Authority: CN
Inventors: 王金桥; 陈盈盈; 周鲁; 赵朝阳; 陈康扬
Original assignee: Objecteye Beijing Technology Co Ltd
Current assignee: Objecteye Beijing Technology Co Ltd
Priority date: 2022-08-09
Filing date: 2022-08-09
Publication date: 2022-09-06
Anticipated expiration: 2042-08-09
Also published as: CN115019349B

Abstract

The invention relates to the technical field of computer vision, and provides an image analysis method, an image analysis device, electronic equipment and a storage medium, wherein the method comprises the following steps: determining a human body image to be analyzed; based on a feature extraction module in the multi-task human body analysis model, carrying out feature extraction on a human body image to obtain single task features respectively corresponding to at least two single tasks; performing attention interaction on the single task characteristics respectively corresponding to the at least two single tasks based on a cross-task attention module in the multi-task human body analysis model to obtain the attention characteristics respectively corresponding to the at least two single tasks; and based on a task analysis module in the multi-task human body analysis model, performing corresponding single-task analysis on the attention characteristics corresponding to the at least two single tasks respectively to obtain analysis results corresponding to the at least two single tasks respectively. The image analysis method, the image analysis device, the electronic equipment and the storage medium provided by the embodiment of the invention realize the application of the multi-task model in a human body analysis scene.

Description

Image analysis method, image analysis device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of computer vision technologies, and in particular, to an image analysis method and apparatus, an electronic device, and a storage medium.

Background

Multitask human body analysis is one of the subjects of popular research in computer vision, and is an important research direction for pattern recognition application based on vision. It needs to get the analysis result of the person in the input image under various tasks, and because reasoning is carried out under various tasks simultaneously, compared with a plurality of single task models, the method has less parameter quantity and shorter reasoning time. However, in the conventional analysis model based on a plurality of single tasks, the computation and parameter isolation between tasks lead to a complex system.

The existing multitask model based on deep learning is mostly applied to the field of search advertisement recommendation, and no attempt is made in a human body analysis scene.

Disclosure of Invention

The invention provides an image analysis method, an image analysis device, electronic equipment and a storage medium, which are used for solving the defect that no attempt is made in a human body analysis scene in a multi-task model based on deep learning in the prior art.

The invention provides an image analysis method, which comprises the following steps:

determining a human body image to be analyzed;

based on a feature extraction module in a multitask human body analysis model, performing feature extraction on the human body image to obtain single task features corresponding to at least two single tasks respectively;

performing attention interaction on the single task characteristics respectively corresponding to the at least two single tasks based on a cross-task attention module in the multi-task human body analysis model to obtain the attention characteristics respectively corresponding to the at least two single tasks;

based on a task analysis module in the multi-task human body analysis model, performing corresponding single-task analysis on attention features respectively corresponding to the at least two single tasks to obtain analysis results respectively corresponding to the at least two single tasks;

the multitask human body analysis model is obtained by training based on a sample human body image and the at least two single task labels corresponding to the sample human body image.

According to the image analysis method provided by the invention, the performing attention interaction on the single task features respectively corresponding to the at least two single tasks based on the cross-task attention module in the multi-task human body analysis model to obtain the attention features respectively corresponding to the at least two single tasks comprises:

performing feature splicing on the single task features respectively corresponding to the at least two single tasks based on a feature splicing module in the cross-task attention module to obtain splicing features;

and performing self-attention transformation on the spliced features based on a self-attention module in the cross-task attention module, and splitting transformation features obtained through the self-attention transformation to obtain attention features corresponding to the at least two single tasks respectively.

According to the image analysis method provided by the invention, the image analysis method comprises the following steps of:

based on a shared feature extraction module in the feature extraction module, carrying out shared feature extraction on the human body image to obtain shared features of the human body image;

and respectively extracting the features of the at least two single tasks from the shared features based on at least two single task feature extraction modules in the feature extraction modules to obtain the single task features respectively corresponding to the at least two single tasks.

According to the image analysis method provided by the invention, the multitask human body analysis model is obtained by training based on the following steps:

inputting the sample human body image into an initial multi-task student model to obtain student prediction results respectively corresponding to the at least two single tasks output by the initial multi-task student model;

respectively inputting the sample human body images into at least two single-task teacher models to obtain teacher prediction results respectively corresponding to the at least two single tasks output by each single-task teacher model;

and performing distillation training on the initial multitask student model based on the student prediction result and the teacher prediction result corresponding to the at least two single tasks respectively and the labels of the at least two single tasks corresponding to the sample human body image to obtain the multitask human body analysis model.

According to the image analysis method provided by the invention, the at least two single task labels corresponding to the sample human body image comprise a real label and a pseudo label, and the pseudo label is determined based on the at least two single task teacher models.

According to the image analysis method provided by the invention, the distillation training is performed on the initial multitask student model based on the student prediction result and the teacher prediction result respectively corresponding to the at least two single tasks and the at least two single task labels corresponding to the sample human body image to obtain the multitask human body analysis model, and the method comprises the following steps:

performing distillation training on the initial multi-task student model based on the student prediction result and the teacher prediction result corresponding to the at least two single tasks respectively and the at least two single task labels corresponding to the sample human body image to obtain a pre-training model for multi-task human body analysis;

inputting the sample human body image into the pre-training model to obtain prediction results respectively corresponding to the at least two single tasks output by the pre-training model;

and fine-tuning the pre-training model based on the loss determined by the prediction result and the real label in the at least two single-task labels and the punishment items corresponding to the at least two single tasks respectively to obtain the multi-task human body analysis model.

According to the image analysis method provided by the invention, the multitask human body analysis comprises the following steps: at least two of human behavior recognition, clothing recognition, human posture estimation and human body analysis.

The present invention also provides an image analysis apparatus comprising:

a human body image determining unit for determining a human body image to be analyzed;

the characteristic extraction unit is used for extracting the characteristics of the human body image based on a characteristic extraction module in the multi-task human body analysis model to obtain single task characteristics corresponding to at least two single tasks respectively;

the attention interaction unit is used for carrying out attention interaction on the single task characteristics respectively corresponding to the at least two single tasks based on a cross-task attention module in the multi-task human body analysis model to obtain the attention characteristics respectively corresponding to the at least two single tasks;

the task analysis unit is used for performing corresponding single-task analysis on the attention features respectively corresponding to the at least two single tasks based on a task analysis module in the multi-task human body analysis model to obtain analysis results respectively corresponding to the at least two single tasks;

the multi-task human body analysis model is obtained by training based on a sample human body image and at least two single-task labels corresponding to the sample human body image.

The present invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the image analysis method as described in any of the above when executing the program.

The invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements an image analysis method as described in any of the above.

The invention also provides a computer program product comprising a computer program which, when executed by a processor, implements the image analysis method as described in any one of the above.

According to the image analysis method, the image analysis device, the electronic equipment and the storage medium, the characteristic extraction module in the multi-task human body analysis model is adopted to extract the characteristics of the human body image, and the single-task characteristics aiming at each single task are obtained; performing attention interaction on each single task characteristic by adopting a cross-task attention module to obtain an attention characteristic aiming at each single task; and (4) performing task analysis on each attention characteristic by adopting a task analysis module so as to obtain a human body analysis result of each single task. Therefore, the application of the multi-task model in the human body analysis scene is realized, the limitation of single-task characteristics is made up through the cross-task attention module, the semantic information concerned by the single task is enhanced, and the accuracy, robustness and generalization of each single-task human body analysis are further improved.

Drawings

In order to more clearly illustrate the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a schematic flow chart of an image analysis method provided by the present invention;

FIG. 2 is a schematic diagram of a shared feature extraction process provided by the present invention;

FIG. 3 is a schematic diagram of a single task feature extraction process provided by the present invention;

FIG. 4 is a flow diagram of attention interaction provided by the present invention;

FIG. 5 is a schematic flow diagram of task analysis provided by the present invention;

FIG. 6 is a schematic flow chart of a fine tuning phase of the model training method provided by the present invention;

FIG. 7 is a schematic structural diagram of an image analysis apparatus provided in the present invention;

fig. 8 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The existing multi-task learning has no attempt under a human body analysis scene, but is mostly applied to the field of search advertisement recommendation and used for determining customer service most suitable for a user and products recommended to the user by the customer service, so that the product recommendation effect is improved.

In addition, most of the behavior recognition tasks in the existing human body analysis scene are based on 3D features, the space and time sequence information are comprehensively utilized, and the behavior recognition task is suitable for behavior recognition at a video level, but the parameter number and the calculation amount are larger than those based on 2D features, so that light deployment on edge-end equipment is difficult.

In view of this, the embodiment of the present invention provides an image analysis method, which can be applied in a monitoring scene multi-task human body analysis scene and has a high recognition accuracy. Fig. 1 is a schematic flow chart of an image analysis method provided by the present invention, as shown in fig. 1, the method includes:

step 110, determining a human body image to be analyzed.

Specifically, the human body image to be analyzed is an image including a human body that needs to be subjected to a multi-task human body analysis. The human body image can be acquired by an image acquisition device, and human body target detection can be performed based on an original human body image acquired by the image acquisition device, and a human body frame obtained by the target detection is taken as a human body image to be analyzed, for example, a yolov5 detector can be used for identifying the position of a human body in the original human body image.

And 120, extracting the features of the human body image based on a feature extraction module in the multitask human body analysis model to obtain single task features respectively corresponding to at least two single tasks.

The multitask human body analysis model is obtained based on a sample human body image and at least two single task labels corresponding to the sample human body image.

Specifically, the multitask human body analysis model may be a deep learning model, and the human body image to be analyzed is input into the multitask human body analysis model to obtain an analysis result for each single task output by the multitask human body analysis model.

The multitask human body analysis model comprises a feature extraction module, wherein the feature extraction module is used for extracting features of an input human body image to obtain single task features corresponding to at least two single tasks respectively. It can be understood that the number of the obtained single task features is the same as that of the single tasks, and for each single task, a single task feature corresponding to the single task is obtained. For example, the single tasks requiring human body analysis include 4 tasks of human body behavior recognition, clothing recognition, human body posture estimation and human body analysis, and after feature extraction, human body behavior features, clothing features, human body posture features and human body analysis features are obtained respectively.

It should be noted that, before step 120 is executed, data enhancement may be performed on the human body image to be analyzed, and specifically, a series of data enhancement schemes such as random rotation, random scaling, random cropping, random color dithering, normalization, and normalization may be included.

Before step 120 is executed, the multitask human body analysis model may be trained in advance, and specifically, the model training may be performed by the following method:

firstly, collecting a large number of sample human body images and at least two single task labels corresponding to the sample human body images; each single task label can be a real label marked manually or a pseudo label marked by a model. And then, inputting the sample human body image into the initial model for supervised training, so that the multi-task human body analysis model obtained by training can better perform human body analysis aiming at each single task.

And step 130, performing attention interaction on the single-task features respectively corresponding to the at least two single tasks based on the cross-task attention module in the multi-task human body analysis model to obtain the attention features respectively corresponding to the at least two single tasks.

Specifically, on the basis of obtaining the single task features corresponding to each single task, in order to enable the tasks to learn for reference, attention interaction can be performed on each single task feature, so as to obtain the attention features corresponding to each single task. For example, the single task features corresponding to the posture estimation and human body analysis tasks can obtain more refined semantic information, which is beneficial to improving the accuracy of other classification tasks (such as behavior recognition and clothing recognition tasks). Meanwhile, the single task characteristics corresponding to the behavior recognition task and the clothing recognition task have more global semantic information, and the analysis accuracy of the tasks (such as posture estimation and human body analysis tasks) which pay more attention to the local pixel level and region level information is improved.

The multi-task human body analysis model further comprises a cross-task attention module which is used for carrying out attention interaction on each single-task feature, so that features among different tasks can be learned and supplemented mutually, and the use field of the model is greatly expanded.

Because each single task feature can learn the features related to the task of the user from other single task features, the attention feature corresponding to each single task obtained through the learning can further improve the accuracy of each single task analysis.

And 140, performing corresponding single-task analysis on the attention features respectively corresponding to the at least two single tasks based on a task analysis module in the multi-task human body analysis model to obtain analysis results respectively corresponding to the at least two single tasks.

Specifically, on the basis of obtaining the attention feature corresponding to each individual task, a task analysis module in the multi-task human body analysis model may be used to perform task analysis corresponding to the individual task on the attention feature corresponding to each individual task feature, so as to obtain an analysis result corresponding to each individual task.

Preferably, the task analysis module herein may include at least two single-task analysis modules, and each single-task analysis module performs task analysis for a corresponding single task.

For example, human behavior recognition results corresponding to the human behavior recognition task, such as behavior categories of making a call, smoking, falling down, etc., can be obtained; obtaining a clothing recognition result corresponding to the clothing recognition task, such as an upper clothing attribute category, a lower clothing attribute category or a head attribute category; obtaining a human body posture analysis result corresponding to the human body posture estimation task, such as a thermodynamic diagram of human body joint points; and obtaining human body analysis results corresponding to the human body analysis task, such as human body part types of the head, the trunk, the thighs or the calves.

According to the method provided by the embodiment of the invention, a feature extraction module in a multi-task human body analysis model is adopted to extract features of a human body image, and single-task features aiming at each single task are obtained; performing attention interaction on each single task characteristic by adopting a cross-task attention module to obtain an attention characteristic aiming at each single task; and (4) performing task analysis on each attention characteristic by adopting a task analysis module so as to obtain a human body analysis result of each single task. Therefore, the application of the multi-task model in the human body analysis scene is realized, the limitation of single-task characteristics is made up through the cross-task attention module, the semantic information concerned by the single task is enhanced, and the accuracy, robustness and generalization of each single-task human body analysis are further improved.

Based on any of the above embodiments, step 130 specifically includes:

131, performing feature splicing on the single task features respectively corresponding to the at least two single tasks based on a feature splicing module in the cross-task attention module to obtain splicing features;

and 132, performing self-attention transformation on the spliced features based on a self-attention module in the cross-task attention module, and splitting transformation features obtained through the self-attention transformation to obtain attention features corresponding to at least two single tasks respectively.

Specifically, attention interaction is performed on each single task feature to obtain the attention feature corresponding to each single task, and the attention feature can be achieved through feature splicing and feature transformation.

The cross-task attention module in the multi-task human body analysis model comprises a feature splicing module and a self-attention module, wherein the feature splicing module is used for performing feature splicing on single-task features respectively corresponding to at least two single tasks to obtain splicing features.

On the basis, the self-attention module can be used for carrying out self-attention transformation on the splicing characteristics, and can also be used as a routing mechanism to route more task characteristics with high correlation to the current task and route less task characteristics with low correlation to the current task. And splitting the transformation characteristics obtained by the self-attention transformation to obtain the attention characteristics respectively corresponding to at least two single tasks.

In one embodiment, the feature map of the 256 channels, which may be shaped into a vector of (256 × 7) dimensions, may represent 7 × 7 pixels for each single task. There are 4 single tasks, the resulting stitched features can be represented as a (4, 256 × 7) dimensional set of vectors, the 4 vectors are weighted based on inner product similarity, inter-task features are fused, and the transformed features resulting from attention transformation can be represented as a (4, 256 × 7) set of vectors. And then splitting the transformation characteristics to obtain attention characteristics corresponding to 4 single tasks respectively, and shaping into 4 (256, 7 and 7) characteristic graphs.

Based on any of the above embodiments, step 120 specifically includes:

step 121, performing shared feature extraction on the human body image based on a shared feature extraction module in the feature extraction module to obtain shared features of the human body image;

and step 122, respectively performing feature extraction of at least two single tasks on the shared features based on at least two single task feature extraction modules in the feature extraction modules to obtain single task features respectively corresponding to the at least two single tasks.

Specifically, feature extraction is performed on a human body image, and shared feature extraction and single-task feature extraction can be achieved. Correspondingly, the feature extraction module in the multitask human body analysis model comprises a shared feature extraction module and at least two single-task feature extraction modules, shared features of the human body image are obtained through the shared feature extraction module in the feature extraction module, the shared features are input into each single-task feature extraction module, and corresponding single-task features output by each single-task feature extraction module are obtained.

Each single-task feature extraction module is independent, and the single-task feature extraction module which is suitable for the characteristics of each single task is designed according to the characteristics of each single task.

In one embodiment, the behavior in the behavior recognition task, such as a fall, needs to be judged by contacting information in the environment, and a Non-Local module, which is self-entry among all pixels in a space, is used for modeling spatial global information. And because the task difficulty is low, the clothes identification task and the human body analysis task only use 1 lightweight BottleNeck module for feature extraction. And in the posture estimation task, 2 lightweight BottleNeck modules are used for feature extraction due to the great task difficulty.

Based on any one of the above embodiments, taking 4 single tasks of human behavior recognition, clothing recognition, human posture estimation and human body analysis as examples, a multitask-based image analysis method is provided, and the method includes:

images under different tasks in different fields are collected, and the yolov5 detector model is used for identifying the position of the human body as the human body image to be analyzed.

And based on a shared feature extraction module in the multitask human body analysis model, carrying out shared feature extraction on the human body image to obtain shared features of the human body image.

After a series of data enhancement schemes such as random rotation, random scaling, random cutting, random color dithering, normalization, standardization and the like are carried out on an input human body image, a human body image with 224 x 224 pixels is obtained, and the human body image enters a shared feature extraction module shared by multiple tasks.

The shared feature extraction module in the embodiment of the present invention adopts a ResNet18 architecture, and first performs 3x3 convolution, ReLU activation, BatchNorm, and maxporoling operations to upsample a 3-channel (RGB) image to 64 channels, and spatially downsample to 56 × 56 pixels. Four more BasicBlock passes, where the first BasicBlock down samples the feature map channel to 32, and the space holds 56 x 56 pixels. 3 BasicBlock up-samples the feature map channels by 2 times and down-samples the feature map channels by 2 times. Thus, the channel dimension of the feature map is up-sampled by 8 times from 32, and the spatial dimension is down-sampled by 8 times from 56 × 56, so that a feature map of 7 × 7 pixels and 256 channels is obtained, and is denoted as (256, 7, 7), which is used as a shared feature of the human body image.

Fig. 2 is a schematic diagram of a shared feature extraction process provided by the present invention, and as shown in fig. 2, a human body image first passes through convolutional layer1 (conv 1), convolutional layer2 (conv 2), convolutional layer3 (conv 3), and the sizes of convolutional cores are all 3 × 3. Each convolution layer is followed by a batch of normalization layers (BatchNorm layers), wherein the batch of normalization layers is composed of a Batchnorm layer and a scale layer, the Batchnorm layer is responsible for normalizing the input by 0 mean 1 variance, the scale layer is responsible for scaling and translating the input, the mean and variance of the Batchnorm layer come from the input, and the scaling and translating parameters of the scale layer need to be learned from the data. The Batch Normalization effectively eliminates covariate transfer inside the network by normalizing the network input, accelerates the convergence of the network, is also a regularization mechanism, and effectively prevents the overfitting of the network. The BatchNorm layer is followed by an activation function layer, here ReLU, a commonly used effective nonlinear activation function, to avoid the gradient vanishing.

The signature size after convolution, BatchNorm, and ReLU is 64 channels, 112 pixels long and 112 pixels wide, noted (64, 112, 112). The output signature (64, 56, 56) is then obtained by convolution with a convolution kernel size of 3x3 and a step size of 2 through a pooling layer (Maxpooling layer).

And then passes through layer1, which comprises 2 BasicBlock layers, each of which extracts features, wherein the 2 nd BasicBlock layer down samples the number of channels of the feature map. An output characteristic (32, 56, 56) is obtained. Then, after passing through layer2, layer3 and layer4, which all comprise 2 BasicBlock layers, the number of channels of the feature map is up-sampled by 2 times and the spatial side length is down-sampled by 2 times while feature extraction is carried out. And 3, the channel is up-sampled by 8 times and the side length is down-sampled by 8 times in total by 3 layers to obtain an output characteristic diagram of (256, 7, 7) and obtain the shared characteristic of the human body image.

Fig. 3 is a schematic diagram of a single-task feature extraction process provided by the present invention, and as shown in fig. 3, 4 single-task feature extraction modules in a multi-task human body analysis model are used to respectively perform 4 single-task feature extraction on the shared features, so as to obtain 4 single-task features respectively corresponding to the single tasks. Wherein,

(1) and the behavior recognition task features are obtained from the feature maps of (256, 7, 7) through a behavior recognition task feature extraction module. Specifically, the method can be based on self-attribute global fusion in space through 1 Non-Local module, a bottleeck mode is adopted in a channel, downsampling is carried out firstly, then upsampling is carried out, a (256-channel, 7-pixel and 7-pixel) feature diagram is obtained, and then the feature diagram is shaped into a vector with the dimension of 256 x 7.

(2) The clothing recognition task features are obtained from the feature maps of (256, 7, 7) through a clothing recognition task feature extraction module. Specifically, the channel may pass through 1 bowlerick module, and the channel is down-sampled and up-sampled, and then output as a feature map of (256, 7, 7), and then shaped into a vector of "256 × 7" dimension.

(3) The human body posture estimation task obtains human body posture estimation task characteristics from the characteristic diagram of (256, 7, 7) through a human body posture estimation task characteristic extraction module, specifically, the human body posture estimation task characteristics can pass through 2 BottleReck modules, channels are sampled from top to bottom, the output is the characteristic diagram of (256, 7, 7), and then the characteristic diagram is shaped into a '256 x 7' dimensional vector.

(4) The human body analysis task is characterized in that a human body analysis task passes through a human body analysis task feature extraction module and specifically can pass through 1 BottleReck module from the feature map of (256, 7, 7), channels are firstly sampled in a down mode and then sampled in an up mode, the output is the feature map of (256, 7, 7), and then the feature map is shaped into a vector with the dimension of '256 x 7'.

Fig. 4 is a schematic flow diagram of attention interaction provided by the present invention, and as shown in fig. 4, the feature splicing module in the cross-task attention module based on the multi-task human body analysis model performs feature splicing on the single-task features respectively corresponding to 4 single tasks to obtain spliced features, and splices the feature vectors of the 4 tasks into a (4, 256 × 7) vector group.

And performing self-attention transformation on the spliced features and splitting transformation features obtained by the self-attention transformation on the self-attention transformation by a self-attention module in the cross-task attention module based on the multi-task human body analysis model to obtain feature maps (256, 7 and 7) which are respectively corresponding to the attention features of the 4 single tasks.

Fig. 5 is a schematic flow chart of task analysis provided by the present invention, and as shown in fig. 5, based on a task analysis module in a multitask human body analysis model, the attention features corresponding to the 4 single-task features are subjected to corresponding single-task analysis, so as to obtain analysis results corresponding to the 4 single tasks.

(1) And (3) a behavior identification task, namely shaping the feature vector into a (256, 7, 7) feature map, down-sampling to 1 x 1 pixel by using a MaxPoint layer with the kernel size of 7 x 7, shaping into a 256-dimensional feature vector, obtaining the probability with the dimension size of 'behavior category number' by using 2 full-connection layers, and outputting num _ actions probability values, namely the predicted values of all the behavior categories.

(2) The clothing recognition task obtains 256-dimensional feature vectors through shaping and a Max Point layer, and outputs 3 probability vectors of head attributes, coat attributes and lower garment attributes through 3 parallel full-connection layers, wherein the dimensions are num _ heads, num _ ups and num _ bottoms respectively.

(3) And (3) performing a human body posture estimation task, namely performing deconvolution operation 3 times, BatchNorm operation 3 times and ReLU operation 3 times on the feature map obtained after shaping, performing upsampling on the feature map to (64, 56, 56) through a deconv _ layers module, and performing final _ layer operation including convolution downsampling operation to output the feature map as a thermodynamic diagram of (num _ joints, 56, 56), wherein num _ joints represents the number of joint points.

(4) And (3) performing a human body analysis task, using the (256, 7, 7) feature map obtained after shaping, passing through a deconv _ layers module which comprises 3 deconvolution operations, 3 BatchNorm operations and 3 ReLU operations, upsampling the feature map to (64, 56, 56), and passing through a final _ layer which comprises a convolution downsampling operation, outputting the feature map to be a category activation map of (num _ parts, 56, 56), wherein num _ parts represents the number of categories of the human body part.

The image analysis method provided by the embodiment of the invention is based on a multi-task human body analysis model, shares the shared image characteristics of a low-level cross-task basis, extracts the unique task characteristics of each task, and finally shares the high-level characteristics of a task semantic hierarchy by using an attention mechanism, so that other tasks can assist the identification of the current task, and the accuracy of the identification of at least two single tasks is improved.

Based on any of the above embodiments, the multitask human body analysis model is obtained by training based on the following steps:

step 210, inputting a sample human body image into an initial multitask student model to obtain student prediction results respectively corresponding to at least two single tasks output by the initial multitask student model;

step 220, respectively inputting the sample human body images into at least two single-task teacher models to obtain teacher prediction results respectively corresponding to at least two single tasks output by each single-task teacher model;

and step 230, performing distillation training on the initial multitask student model based on the student prediction result and the teacher prediction result corresponding to the at least two single tasks respectively and the labels of the at least two single tasks corresponding to the sample human body image to obtain a multitask human body analysis model.

In particular, the multitask human body analysis model can adopt a knowledge distillation form for distillation training. The distillation training system includes an initial multitask student model and at least two single task teacher models. The initial multitask student model can adopt the model structure described in the above embodiment, and at least two single-task teacher models can select any model structure and represent excellent models on respective tasks.

In the training process, inputting a sample human body image into an initial multi-task student model to obtain student prediction results respectively corresponding to at least two single tasks output by the initial multi-task student model; and respectively inputting the sample human body images into the at least two single-task teacher models to obtain teacher prediction results respectively corresponding to the at least two single tasks output by each single-task teacher model. And (3) constructing distillation loss between the student prediction results and the teacher prediction results, freezing parameters of at least two single-task teacher models during training, and updating the parameters of the multi-task student models only.

The student prediction result and the teacher prediction result may specifically include prediction probabilities of various behavior categories, prediction probabilities of various clothing attributes, regression targets of various human body joint points, prediction probabilities of pixel-by-pixel human body analysis, and the like.

If the classification task is based on behavior identification, clothing attribute identification or human body analysis, the output probability vectors of the initial multitask student model and the single-task teacher model are normalized to be smooth by setting distillation temperature, then the outputs of the initial multitask student model and the single-task teacher model are regarded as two different distributions, and the distillation loss based on the cross entropy loss function is constructed by taking the cross entropy of the two distributions as a target. Thus, the output of the two models on the classification task is aligned.

If the regression task is based on human posture estimation, the teacher prediction result and the student prediction result are aligned by joint point characteristic thermodynamic diagrams by adopting an L2 loss function.

Based on any one of the above embodiments, the at least two single task labels corresponding to the sample human body image include a real label and a pseudo label, and the pseudo label is determined based on the at least two single task teacher models.

Specifically, the real label means that a label marked on the sample image is real and is usually labeled manually, and the pseudo label in the embodiment of the invention is obtained by reasoning from a single-task teacher model which is trained on respective tasks.

Firstly, various human body images are collected, manual labeling of labels of each single task data set is carried out, and labels of other tasks are generated by a single task teacher model. For example, the human body image used by the behavior recognition task has artificially labeled behavior recognition real labels (fall, fight, smoke, telephone, etc.), and also labels for clothing recognition, posture estimation and human body analysis are needed, and the labels are obtained by the trained neural network reasoning on the respective tasks and are called pseudo labels. Thus, the body image and the label of each data set can be combined into one supervised multitask data set. And inputting the sample human body image into the multitask student model to obtain a student prediction result output by the multitask student model.

For the output of the sample image with the true label, the classification task aligns using a cross-entropy loss function and an L2 loss function. For example, the cross-entropy loss function used by behavior recognition tasks, clothing recognition tasks, and human body analysis tasks on the real label can be expressed as follows:

wherein N is the number of classification categories,

indicating whether the sample is in j category, if yes, 1, and if not, 0;

the probability of the j category output for the multi-tasking student model.

The representative classification task is based on the cross entropy loss function of the real label.

Taking the regression task as an example of the pose estimation task, the L2 loss function used by the pose estimation task on the real tag can be expressed as follows:

wherein A is the number of pixels of the thermal image of the joint point, K is the number of the estimated joint points,

for the label value of the kth node of the sample at the jth position,

the kth joint point output for the multitasking student model returns a value at the jth position.

The L2 loss function representing the pose estimation task.

For the output of a sample image with a false label, the distillation loss used for classification tasks, such as behavior recognition, clothing recognition, body analysis tasks, can be expressed as follows:

wherein N is the number of classification categories;

outputting j category classification probability values of the single-task teacher model under the category j through a softmax function at the distillation temperature T;

outputting j category classification probability values of a multitask student model under category j at a distillation temperature T through a softmax function;

classification of the single-task teacher model and subsequent j-th class output;

(ii) a kth class of output after the single task teacher model classification layer;

output of the jth class after the classification layer for the multi-tasking student model;

the output of the jth class after the hierarchy is sorted for the multi-tasking student model.

The representative classification task is based on the KL divergence loss function of the soft label (single task teacher model output under temperature T control).

According to the method provided by the embodiment of the invention, a plurality of single-task teacher models can be flexibly replaced in the training stage, and each subtask target can be met to the greatest extent; the real label and the pseudo label are utilized, cross-task labeling is not needed manually, and cost is reduced; the real label and the pseudo label are used in a mixed mode, so that the data volume sensed by the initial multi-task student model is enlarged, and the multi-task human body analysis model obtained through training has certain multi-task recognition capability; meanwhile, a single-task teacher and an initial multi-task student model controlled by distillation temperature are adopted, so that the output value ranges are effectively aligned, and the tendencies of all tasks are balanced.

Based on any of the above embodiments, step 230 specifically includes:

231, performing distillation training on the initial multitask student model based on the student prediction result and the teacher prediction result respectively corresponding to the at least two single tasks and the at least two single task labels corresponding to the sample human body image to obtain a pre-training model for multitask human body analysis;

step 232, inputting the sample human body image into a pre-training model to obtain prediction results respectively corresponding to at least two single tasks output by the pre-training model;

and 233, fine-tuning the pre-training model based on the loss determined by the prediction result and the real label in the at least two single-task labels and the punishment items respectively corresponding to the at least two single tasks to obtain the multi-task human body analysis model.

Specifically, since the above embodiment uses the pseudo label in order to enlarge the sample size in the model training stage, and when noise is introduced by using the pseudo label, it is necessary to perform fine tuning on the model in order to further improve the accuracy of the multi-task human body analysis model.

Step 231 is executed first to obtain a pre-trained model of the multi-task human body analysis model, and on the basis, fine tuning is performed on the obtained pre-trained model.

In the fine tuning phase, only manually labeled real tags on each single task are used. Fig. 6 is a schematic flow chart of a fine-tuning stage of the model training method provided by the present invention, and as shown in fig. 6, downstream task human body images in each field are respectively and manually labeled to obtain a plurality of single-task human body analysis data sets.

In each fine-tuning round of the pre-training model, 4 groups of data are respectively sampled for a behavior recognition task, a clothing recognition task, a posture estimation task and a human body analysis task, and then are respectively reasoned for 4 times to obtain 4 loss functions which are respectively recorded as: l is ₁ ，L ₂ ，L ₃ ，L ₄ 。

Then introduces a multitasking penalty based on task uncertainty. Assuming that the output value of the regression task obeys normal distribution, the task loss is punished based on the variance of the normal distribution, and the square term of the temperature T is added to the classification task as a punishment term. The penalty term here may be a fixed value observed empirically or may be a learnable value.

On the basis, the loss function of the pre-training model in the fine tuning stage can be expressed as follows:

wherein

For pre-training multi-tasking of modelsA function of the loss of service,

cross entropy loss function for behavior recognition tasks

The square of the temperature is a penalty term,

cross entropy loss function for garment identification tasks

The temperature of (a) is set to be,

cross entropy loss function for human body parsing task

The temperature of (a) is set to be,

the variance of the normal distribution to which the task input values follow is estimated for the pose. The log functions are regularization terms of the loss functions, and the network is effectively prevented from over-fitting some task.

When the multiple task loss functions are weighted, a large penalty is imposed on the task with high uncertainty, and the task weight is reduced; and applying small punishment to the tasks with small uncertainty, and increasing the task weight.

In each round of training iteration, 4 task data are sampled, 4 times of forward propagation are executed, the forward propagation is aligned with the task labels, 4 single-task loss functions are obtained, uncertainty is calculated, loss weight is calculated, finally a total loss function based on uncertainty measurement is obtained, and 1 time of backward propagation is executed.

In the fine tuning stage of the pre-training model, because each downstream task data and label come from different fields and time-consuming and expensive multi-task manual labeling is not carried out, in order to ensure that the model keeps high performance on the downstream tasks in each field, the method is different from the prior method that each task only finely tunes a unique characteristic decoding module, a combined fine tuning method is adopted for global parameters of the pre-training model, multi-task loss based on uncertainty measurement is designed, and the weight among the tasks is balanced; cross-domain knowledge is combined, inter-domain common supervision labels which are difficult to label manually do not need to be used, and cost is reduced.

Based on any one of the embodiments, an image analysis method is provided, which includes:

and S1, collecting various human body images, manually labeling a part of the human body images to form a real label, and labeling a part of the human body images to form a pseudo label, and combining the pseudo label and the pseudo label to form a multi-task human body analysis data set.

And S2, collecting the sub data sets corresponding to the single tasks, wherein the sub data sets are independent. Sorting 4 types of single task data sets: behavior recognition, clothing recognition, posture estimation and human body analysis. And 4 single-task teacher models are trained by selecting suitable deep learning models for each single task and serve as teacher models for knowledge distillation.

S3, constructing an initial multitask student model, wherein the initial multitask student model comprises the following steps:

the shared feature extraction module is used for extracting shared features of the human body images to obtain the shared features of the human body images;

the independent individual task feature extraction module is used for respectively extracting the features of the individual tasks from the shared features to obtain individual task features corresponding to the 4 individual tasks;

the cross-task attention module is used for performing attention interaction on the single-task characteristics corresponding to the 4 single tasks respectively to obtain the attention characteristics corresponding to the 4 single tasks respectively;

and the task analysis module is used for respectively carrying out task analysis on the attention characteristics corresponding to the 4 single tasks to obtain analysis results corresponding to the 4 single tasks.

S4, in the pre-training stage, 4 single-task data sets are combined, and missing multi-task labels are obtained by reasoning of other trained single-task teacher models, so that a human body analysis multi-task data set is constructed. And (3) adopting a knowledge distillation form, supervising a distillation loss function based on temperature control, and carrying out distillation training on the initial multi-task human body analysis model to obtain a pre-training model.

And S5, entering a cross-task fine adjustment stage, and fine adjusting the pre-training model based on the loss determined by the prediction result of the pre-training model and the real labels in the 4 single-task labels and the penalty items corresponding to the 4 single tasks respectively to obtain the multi-task human body analysis model.

S6, performing multi-task human body analysis based on the trained multi-task human body analysis model, wherein the multi-task human body analysis comprises the following steps: at least two of human behavior recognition, clothing recognition, human posture estimation and human body analysis.

The following describes the image analysis apparatus provided by the present invention, and the image analysis apparatus described below and the image analysis method described above may be referred to in correspondence with each other.

Based on any of the above embodiments, fig. 7 is a schematic structural diagram of an image analysis apparatus provided by the present invention, as shown in fig. 7, the apparatus includes:

a human body image determination unit 710 for determining a human body image to be analyzed;

a feature extraction unit 720, configured to perform feature extraction on the human body image based on a feature extraction module in the multitask human body analysis model, so as to obtain single-task features corresponding to at least two single tasks respectively;

the attention interaction unit 730 is configured to perform attention interaction on the single-task features respectively corresponding to the at least two single tasks based on a cross-task attention module in the multi-task human body analysis model to obtain the attention features respectively corresponding to the at least two single tasks;

the task analysis unit 740 is configured to perform, based on a task analysis module in the multitask human body analysis model, task analysis on corresponding single tasks for the attention features corresponding to the at least two single tasks, so as to obtain analysis results corresponding to the at least two single tasks;

the multitask human body analysis model is obtained by training based on a sample human body image and at least two single task labels corresponding to the sample human body image.

The image analysis device provided by the embodiment of the invention adopts the characteristic extraction module in the multi-task human body analysis model to extract the characteristics of the human body image, so as to obtain the single task characteristics aiming at each single task; performing attention interaction on each single task characteristic by adopting a cross-task attention module to obtain an attention characteristic aiming at each single task; and (4) performing task analysis on each attention characteristic by adopting a task analysis module so as to obtain a human body analysis result of each single task. Therefore, the application of the multi-task model in the human body analysis scene is realized, the limitation of single-task characteristics is made up through the cross-task attention module, the semantic information concerned by the single task is enhanced, and the accuracy, robustness and generalization of each single-task human body analysis are further improved.

Based on any of the above embodiments, the attention interacting unit 730 is further configured to:

Based on any of the above embodiments, the feature extraction unit 720 is further configured to:

Based on any of the above embodiments, the image analysis apparatus further comprises a model training unit configured to:

inputting the sample human body image into at least two single-task teacher models respectively to obtain teacher prediction results corresponding to the at least two single tasks output by each single-task teacher model respectively;

Based on any of the above embodiments, the at least two single task labels corresponding to the sample human body image include a real label and a pseudo label, and the pseudo label is determined based on the at least two single task teacher models.

Based on any of the embodiments above, the model training unit is further configured to:

Fig. 8 illustrates a physical structure diagram of an electronic device, and as shown in fig. 8, the electronic device may include: a processor (processor) 810, a communication Interface 820, a memory 830 and a communication bus 840, wherein the processor 810, the communication Interface 820 and the memory 830 communicate with each other via the communication bus 840. The processor 810 may invoke logic instructions in the memory 830 to perform an image analysis method comprising:

determining a human body image to be analyzed;

based on a feature extraction module in a multitask human body analysis model, performing feature extraction on the human body image to obtain single-task features corresponding to at least two single tasks respectively;

In addition, the logic instructions in the memory 830 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of executing the image analysis method provided by the above methods, the method comprising:

determining a human body image to be analyzed;

In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor, implements an image analysis method provided by the above methods, the method comprising:

determining a human body image to be analyzed;

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on at least two network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on the understanding, the above technical solutions substantially or otherwise contributing to the prior art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the various embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An image analysis method, comprising:

determining a human body image to be analyzed;

2. The image analysis method according to claim 1, wherein the performing attention interaction on the single-task features respectively corresponding to the at least two single tasks based on a cross-task attention module in the multi-task human body analysis model to obtain the attention features respectively corresponding to the at least two single tasks comprises:

3. The image analysis method according to claim 1, wherein the extracting a feature of the human body image by a feature extraction module in the multi-task human body analysis-based model to obtain single-task features corresponding to at least two single tasks, comprises:

4. The image analysis method of claim 1, wherein the multitask human body analysis model is trained based on the following steps:

5. The image analysis method of claim 4, wherein the at least two single task labels corresponding to the sample human body image comprise a real label and a pseudo label, and the pseudo label is determined based on the at least two single task teacher models.

6. The image analysis method of claim 5, wherein the performing distillation training on the initial multitask student model based on the student predicted result and the teacher predicted result corresponding to the at least two single tasks, respectively, and the at least two single task labels corresponding to the sample human body image to obtain the multitask human body analysis model comprises:

7. The image analysis method according to any one of claims 1 to 6, wherein the multitask human body analysis comprises: at least two of human behavior recognition, clothing recognition, human posture estimation and human body analysis.

8. An image analysis apparatus, comprising:

the task analysis unit is used for performing corresponding single-task analysis on the attention characteristics respectively corresponding to the at least two single tasks based on a task analysis module in the multi-task human body analysis model to obtain analysis results respectively corresponding to the at least two single tasks;

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the image analysis method according to any one of claims 1 to 7 when executing the program.

10. A non-transitory computer-readable storage medium on which a computer program is stored, the computer program when executed by a processor implementing the image analysis method according to any one of claims 1 to 7.