CN109284733B

CN109284733B - Shopping guide negative behavior monitoring method based on yolo and multitask convolutional neural network

Info

Publication number: CN109284733B
Application number: CN201811197781.2A
Authority: CN
Inventors: 赵云波; 林建武; 李灏; 宣琦
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2018-10-15
Filing date: 2018-10-15
Publication date: 2021-02-02
Anticipated expiration: 2038-10-15
Also published as: CN109284733A

Abstract

A shopping guide negative behavior monitoring method based on yolo and a multitask convolutional neural network comprises the steps of firstly training a pedestrian detection model based on yolo, pre-training the model by using ImageNet and voc2007 data sets, and then finely tuning the model by using monitoring scene images; then constructing a multitask convolution neural network based on ResNet50, and training the network by using manually labeled multi-label image data; and then, reading a market monitoring picture by using a rtsp protocol, detecting a pedestrian in the picture by using a pedestrian detection model, inputting a pedestrian image into a multitask convolutional neural network, identifying whether the pedestrian is shopping guide, sits idle or not, and plays a mobile phone, so as to judge whether passive behaviors exist in the shopping guide, and storing the 'serious negative' and 'general negative' shopping guide pictures locally. And finally, the pedestrian detection network based on yolo and the multitask convolutional neural network are used for effectively monitoring and recording shopping guide negative behaviors.

Description

Shopping guide negative behavior monitoring method based on yolo and multitask convolutional neural network

Technical Field

The invention relates to a method for monitoring shopping guide negative behaviors in the field of new retail sales.

Background

With the increase of labor cost, in a shopping mall, recruitment of more shopping guides means cost increase. However, some shopping guides have negative work behaviors, such as "playing mobile phone", "sitting nearby with customers", and the like, which results in waste of human resources. In order to avoid unnecessary expenditure, effective attendance management of shopping guides in a shopping mall is very important.

The common attendance system can only record the shopping guide's time of going to and fro, cannot automatically analyze whether the shopping guide has a negative work condition during the time of going to and fro, and cannot record the picture of the shopping guide when the shopping guide works negatively. Aiming at the requirement, the invention utilizes the computer vision technology to carry out image recognition and analysis on the images collected by monitoring which are ubiquitous in the shopping mall.

For pedestrian detection, in the existing method, a directional gradient histogram is used as a descriptor for pedestrian detection, and an SVM is used for classification, so that the method is not very high in precision and is easy to detect by mistake. In recent years, the deep convolutional neural network is applied to pedestrian detection, the accuracy of the pedestrian detection is greatly improved, however, due to the fitting problem of cross data sets in transfer learning, the method is lack of robustness under the monitoring view angle.

Aiming at attribute identification, the convolutional neural network achieves the effect which cannot be compared with the traditional method in the precision of attribute classification. In recent years, CNN model frameworks such as VGG, ResNet, densneet, etc. have been widely used. However, an original ResNet can classify only one attribute, and multiple attributes need to train multiple models, which greatly increases the computational burden.

Therefore, no complete solution exists for a monitoring system for identifying and recording the shopping guide negative behaviors.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a shopping guide negative behavior monitoring method based on yolo and multitask convolutional neural networks.

In order to achieve the aim, the invention designs a shopping guide negative behavior monitoring system based on yolo and multitask convolutional neural networks. Firstly, training a yolo-based pedestrian detection model and a ResNe 50-based multitask convolutional neural network; further, for monitoring the images sampled at fixed time, detecting pedestrians by using a yolo-based detection model; furthermore, a multitask convolutional neural network based on ResNet50 is used for identifying various attributes and behaviors of shopping guides in a shopping mall, judging whether negative behaviors exist or not, and recording pictures of the shopping guides making the negative behaviors in the form of pictures. The problem of the passive behavior detection of the shopping guide and the automatic attendance checking of the working condition is solved to a certain extent. The method can be applied to aspects of a check-out system, shopping guide management, shop operation and the like in a new retail scene.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a shopping guide negative behavior monitoring method based on yolo and multitask convolutional neural networks comprises the following steps:

step 1, training a yolo-based pedestrian detection model: constructing a pedestrian detection model based on yolo, pre-training a classification model by using an ImageNet data set, pre-training a detection model by using a voc2007 data set, and finely tuning the model by using a monitoring visual angle data set:

step 2, training a multitask convolution neural network based on ResNet 50: constructing a multitask convolutional neural network based on ResNet50, and training the multitask convolutional neural network based on ResNet 50;

and 3, shopping guide negative behavior recording: reading a monitoring picture, detecting pedestrians in a shopping mall, identifying the attributes of the pedestrians, and recording a shopping guide negative behavior picture;

compared with the prior art, the technical scheme of the invention has the advantages that:

(1) the pedestrian detection model trained by the invention can perform robust pedestrian detection under the monitoring visual angle of a shopping mall;

(2) the multi-task convolutional neural network trained by the method can simultaneously identify a plurality of attributes of the pedestrian, and keeps high precision and robustness;

(3) the invention expands the attendance system to record the negative behavior in the working process, and not only records the late arrival and early departure of attendance, thereby improving the attendance system.

Drawings

FIG. 1 is a schematic diagram of the yolo pre-trained classification model of the present invention;

FIG. 2 is a schematic diagram of a yolo-based pedestrian detection model of the present invention;

FIG. 3 is a schematic diagram of a multitask convolutional neural network based on ResNet50 of the present invention;

FIG. 4 is a flow chart of the method of the present invention;

Detailed Description

In order to facilitate the understanding and implementation of the present invention for those of ordinary skill in the art, the present invention is further described in detail below with reference to the accompanying drawings and examples.

Example 1:

a passive behavior monitoring system for shopping guide based on yolo and multitask convolutional neural network comprises the following steps:

(1) training a yolo-based pedestrian detection model;

step 11: constructing a pedestrian detection model based on yolo;

the invention uses the training mode and the network structure of the yolo second generation for reference, and improves the network structure on the basis, so that the model is more robust in the monitoring view of the invention. Specifically, the network structure of the original yolo-v2 includes 19 convolutional layers and 5 maximum pooling layers, the method of layer jump fusion is used in the invention, 13 convolutional layers and 4 maximum pooling layers are used in the first stage of feature extraction, 7 convolutional layers are used in the second stage, 1 maximum pooling layer is arranged between the first stage and the second stage, and the size of the feature graph output in the first stage is adjusted to be consistent with the size of the feature graph output in the second stage. And then fusing the two oversized feature maps together in a superposition mode to form the input of the stage three. Stage three has two modes, one is a classification network, and the mode is used when the model is pre-trained, specifically, the mode is a 3 x 3 convolution layer and a full connection layer, and the number of neurons in the full connection layer is equal to the classification number; the second mode is a detection network, which is used for training the detection network after loading the pre-training parameters of the first mode, specifically, a layer of convolution layer of 3 × 3 is added, a layer of convolution layer of 1 × 1 is added, the number of convolution kernels is related to the detection category, and the specific numerical values are as follows: anchlors number × (5+ number of detection categories).

The classification network for mode one, as shown in fig. 1, is described in detail below:

stage one: the size of an input image is 448 multiplied by 3, the first layer of the first stage is a convolution layer with the convolution kernel size of 3 multiplied by 32, and batch normalization, ReLu nonlinear activation and 2 multiplied by 2 maximum pooling operations are carried out on the layer; the second layer of the first stage is a convolution layer with convolution kernel size of 3 multiplied by 64, and batch normalization, ReLu nonlinear activation and 2 multiplied by 2 maximum pooling operation are carried out on the layer; the third layer of the first stage is a convolution layer with convolution kernel size of 3 multiplied by 128, and batch normalization and ReLu nonlinear activation operation are carried out on the layer; the fourth layer of the first stage is a convolution layer with the convolution kernel size of 1 multiplied by 64, and batch normalization and ReLu nonlinear activation operation are carried out on the layer; the fifth layer of the first stage is a convolution layer with the convolution kernel size of 3 multiplied by 128, and batch normalization and ReLu nonlinear activation operation are carried out on the convolution layer; the sixth layer of the first stage is a convolution layer with convolution kernel size of 3 multiplied by 256, and batch normalization and ReLu nonlinear activation operation are carried out on the layer; the seventh layer of the first stage is a convolution layer with convolution kernel size of 1 multiplied by 128, and batch normalization and ReLu nonlinear activation operation are carried out on the layer; the eighth layer of the first stage is a convolution layer with convolution kernel size of 3 × 3 × 256, and batch normalization, ReLu nonlinear activation and 2 × 2 maximum pooling operation are performed on the layer; the ninth layer of the first stage is a convolution layer with convolution kernel size of 3 multiplied by 512, and batch normalization and ReLu nonlinear activation operation are carried out on the layer; the tenth layer of the first stage is a convolution layer with convolution kernel size of 1 multiplied by 256, and batch normalization and ReLu nonlinear activation operation are carried out on the layer; the eleventh layer of the first stage is a convolution layer with convolution kernel size of 3 multiplied by 512, and batch normalization and ReLu nonlinear activation operation are carried out on the convolution layer; the twelfth layer of the first stage is a convolution layer with the convolution kernel size of 1 multiplied by 256, and batch normalization and ReLu nonlinear activation operation are carried out on the layer; the thirteenth layer of the first stage is a convolution layer with the convolution kernel size of 3 multiplied by 512, and batch normalization and ReLu nonlinear activation operation are carried out on the layer; the output signature of phase one is denoted output 1.

And a second stage: firstly, performing 2 x 2 maximum pooling operation on a characteristic diagram output by a stage I, wherein a first layer of the stage II is a convolution layer with a convolution kernel of 3 x 1024, and performing batch normalization and ReLu nonlinear activation operation on the layer; the second layer of the second stage is a convolution layer with the convolution kernel size of 1 multiplied by 512, and batch normalization and ReLu nonlinear activation operation are carried out on the layer; the third layer of the second stage is a convolution layer with convolution kernel size of 3 multiplied by 1024, and batch normalization and ReLu nonlinear activation operation are carried out on the layer; the fourth layer of the second stage is a convolution layer with convolution kernel size of 1 multiplied by 512, and batch normalization and ReLu nonlinear activation operation are carried out on the layer; the fifth layer of the second stage is a convolution layer with the convolution kernel size of 3 multiplied by 1024, and batch normalization and ReLu nonlinear activation operation are carried out on the convolution layer; the sixth layer of the second stage is a convolution layer with convolution kernel size of 3 multiplied by 1024, and batch normalization and ReLu nonlinear activation operation are carried out on the layer; the seventh layer of the second stage is a convolution layer with convolution kernel size of 3 multiplied by 1024, and batch normalization and ReLu nonlinear activation operation are carried out on the layer; the output characteristic diagram of the second stage is denoted as output 2;

and a third stage: the output feature map output1 of the stage one is convolved by 1 × 1 × 64, and then the size of the output feature map output1 of the stage one is adjusted to be the same as that of the output feature map output2 of the stage two, and the adjusted feature map is marked as output1_ 1. The feature map output1_1 is superimposed with output2 to form a new feature map output 3. Then, 3 × 3 × 1024 convolution, batch normalization, ReLu nonlinear activation operation are performed on the fused feature map output3, and finally, a full connection layer of 1000 neurons is added and constrained by a softmax loss function.

The benefit of this layer jump operation is that output3 has both the fine features of output2 obtained after deep layer convolution and the basic features of output1_1 obtained after shallow layer convolution, so that the network precision is higher.

For the detection network of the mode two, as shown in fig. 2, except for the last two layers, the rest structures are the same as the classification model; the difference lies in that: 3 x 1024 convolution, batch normalization and ReLu nonlinear activation operation are carried out on the fusion characteristic output3 in the third stage in the detection network, then the full connection layer is removed, and the detection network is replaced with a 1 x 30 convolution layer, and finally a constraint model is lost through coordinate loss, confidence coefficient loss and category loss.

Step 12: pre-training a classification model by using an ImageNet data set;

good initialization parameters are an important ring of model convergence, and the detection data set has a small amount of data in each category due to the complicated labeling steps. Therefore, a classification model is trained by using the ImageNet data set, and the trained classification model parameters are used as initialization parameters of a common structure in the detection model.

The standard 1000 classified ImageNet dataset pictures were first randomly cropped, rotated and shifted in hue, saturation, exposure to obtain more available training data, adjusted to 224 x 224 images, trained for 160 epochs, and using the SGD optimizer, the initial learning rate was set to 0.1, momentum to 0.9, and weight decay to 0.0005.

Further, the network was fine-tuned with a larger size (448 × 448) and trained with a learning rate of 0.001 for 10 epochs.

Step 13: pre-training the detection model with the voc2007 dataset;

since the first few layers of the detection model are consistent with the classification network, the parameters of the classification network trained in step 12 are used as the initialization parameters of the common structure in the detection network. The Voc2007 dataset is a common detection dataset, and there are 20 classes of labeled detection objects, including pedestrian image data. Training only pedestrian image data, performing data enhancement operation on the pedestrian data, adjusting the image size to 448 multiplied by 448, training 160 epochs by an SGD optimizer, and setting the initial learning rate to 0.0001;

step 14: fine-tuning the model with the monitored perspective dataset;

most of the pedestrian data in the voc2007 are not pedestrian images under the monitoring view angle, so that the model trained in the step 13 is difficult to detect pedestrians in the monitoring picture of the shopping mall. Therefore, a data set in the BOT2018 new retail technology challenge match is selected for fine adjustment, and pedestrian images of the data set are collected from monitoring cameras in real market scenes. Performing data enhancement operations such as horizontal rotation, center random cropping and HSV (hue, saturation, value) space fine adjustment on the image of the data set, and adjusting the size to 448 x 448;

loading the model trained in the step 13, training 160 epochs by using an SGD optimizer, setting the initial learning rate to be 0.001, reducing the learning rate with the increase of the training times, setting the learning rate to be 0.001 when 0-5 epochs are used, setting the learning rate to be 0.0001 when 5-80 epochs are used, and setting the learning rate to be 0.00001 when 80-160 epochs are used.

(2) Training multitask convolution neural network based on ResNet50

Step 21: constructing a multitask convolutional neural network based on ResNet 50;

for the pedestrian detected in (1), the attribute of the pedestrian needs to be identified so as to judge whether the shopping guide has the behavior of negative work, and the attributes marked in the data set comprise: "customer" or "shopping guide", "male" or "female", "standing" or "sitting", "playing a mobile phone", or "not playing a mobile phone". These attributes are not related to each other and therefore can be considered as unrelated attributes.

ResNet50 is a network structure with excellent classification performance, however, an original ResNet50 is often not good when directly identifying multiple irrelevant attributes, and training a model for each attribute results in occupying extra computing resources. Therefore, aiming at the identification of shopping guide negative behaviors, the invention designs a multitask convolutional neural network based on ResNet50, and the structure is shown in FIG. 3.

Specifically, the last two layers (full-link layer and pooling layer) of the original ResNet50 are removed, four parallel full-link layers are spliced, the number of neurons in each full-link layer is 2, and the full-link layers respectively represent 8 attributes: "customer" and "shopping guide", "male" and "female", "standing" and "sitting", "playing mobile phone" and "not playing mobile phone", two attributes on the same full connection layer are associated attributes, and an attribute not on one full connection layer is an unrelated attribute. After each full connection, a Softmax layer is connected respectively. The calculation formula of the Softmax loss function is:

the final loss function value is the sum of four Softmax loss function values, namely:

Loss＝L₁+L₂+L₃+L₄ (2)

step 22: training a multitask convolutional neural network based on ResNet 50;

in the convolutional neural network, good initial parameters play an important role in the convergence of the network model, so the parameters except the last two layers in ResNet50 trained on an ImageNet data set are loaded as the initialization parameters of the multitask convolutional neural network. And (3) adopting labeled data of a BOT2018 new retail technology challenge match to the data, performing data enhancement operation on the data, such as image level inversion, center random clipping, HSV space enhancement and the like, to obtain more available training data, finally training by using an Adam optimization algorithm, setting the initial learning rate to be 0.0005, and training 40 epochs.

(3) Shopping guide negative behavior record

Step 31: reading a monitoring picture;

in a shopping mall, a monitoring system which is widely distributed provides data for the system, and additional equipment is not needed. Before reading the monitoring picture, we need to set two parameters: working time interval and monitoring sampling time. The method comprises the steps that an on-duty time interval is set, so that the system only focuses on the on-duty time, extra computing resources and false detection are reduced, and the purpose of the method is to detect whether the shopping guide has a negative behavior during the on-duty time, so that the behavior of the shopping guide during the off-duty time is out of consideration; the monitoring sampling time is set to control the frequency of reading the monitoring picture, so that extra computing resources can be reduced, detection is not needed at every moment, the smaller the sampling time is, the more the identification times are, the more strict the management is, but the greater the computing burden is, the larger the sampling time is, the less the identification times are, the less the computing burden is, but the management is loose. The default monitoring sampling time of the invention is from 9 am to 9 pm, and the sampling time is 1 sampling time in 30 seconds.

Specifically, the monitor picture is read through an rtsp protocol, and the frequency of the read picture is controlled through the system time and the sampling time of the computer.

Step 32: detecting pedestrians in a mall;

loading the pedestrian detection model trained in the step (1), reading the monitoring image in the step (31), normalizing the image, converting the image into a sensor, and then loading the sensor into the pedestrian detection model, wherein the pedestrian detection model can detect four coordinates of a pedestrian in the image, namely the upper coordinate, the lower coordinate, the left coordinate and the right coordinate, so that multiple persons can be detected on one image;

step 33: identifying a pedestrian attribute;

the pedestrian detected in step 32 may be shopping guide or customer, and we want to identify whether the shopping guide has negative behavior, and the multitask convolutional neural network designed by the invention can realize the function. And (3) loading the multitask convolution neural network in the step (2), taking the image data of the pedestrian detected in the step (32) as input data of the multitask convolution neural network, and outputting the value of the full connection layer, namely the confidence coefficient of the model on the certain attribute of the pedestrian. If the confidence level of outputting the shopping guide is higher than that of the customer, the pedestrian is identified as the shopping guide, if the confidence level of outputting the male is higher than that of the female, the pedestrian is identified as the male, if the confidence level of outputting the standing is higher than that of sitting, the pedestrian is identified as the standing, and if the confidence level of outputting the mobile phone playing is higher than that of not playing the mobile phone, the pedestrian is identified as the mobile phone playing. And vice versa.

Step 34: recording a shopping guide negative behavior picture;

a block diagram of the system is shown in fig. 4.

Specifically, during the business hours, we make a determination of shopping guide negative behavior for the pedestrian attributes identified in step 33. Firstly, the identity of the pedestrian is judged whether to belong to shopping guide, if the pedestrian belongs to shopping guide, the pedestrian analyzes the posture (standing or sitting) and the working state (whether playing a mobile phone), and further, whether a customer exists in a picture of the shopping guide or not is judged, and under the condition that the customer is in the field, the pedestrian has stricter requirements on the shopping guide. For example, when a shopping guide is playing a cell phone or sitting, and there is no customer in the screen at this time, we consider the shopping guide to be "generally passive"; when a shopping guide is playing a cell phone or sitting with a customer in the screen, we consider the shopping guide to be "severely passive". The determination of the degree of the negative behavior of the specific shopping guide is shown in table 1, and we save the screen in folder 1 for the "seriously negative" shopping guide, in folder 2 for the "general negative" shopping guide, and in no way for the "positive" shopping guide. The store owner can make corresponding penalties for the passive shopping guide based on the image frames of file 1 and folder 2.

TABLE 1 shopping guide negative behavior decision Table

Example 2:

(1) selecting experimental data

The invention uses a BOT2018 new retail technology to challenge a data set of a match, data are collected from a real market scene, and labels in image data comprise: "customer" and "shopping guide", "male" and "female", "standing" and "sitting", "playing mobile phone" and "not playing mobile phone". The method is characterized in that the images are divided into 5 scenes, 5000 images are provided, each image comprises shopping guides and customers in different numbers, the 5000 images are divided into a training set and a testing set according to the ratio of 9:1, and the training sets and the testing sets are extracted averagely. 1980 scenes 1, 937 scenes 2, 915 scenes 3, 356 scenes 4, 312 scenes 5 and 4500 scenes in the training set; 220 scenes 1, 105 scenes 2, 101 scenes 3, 40 scenes 4, and 34 scenes 5 were collected in the test set.

(2) Results of the experiment

After constructing the model by training a multitask convolutional neural network based on ResNet50 according to the step (2) in example 1, parameters of ResNet50 trained on ImageNet are loaded, 40 epochs are trained on a BOT market data set by an Adam optimizer, the initial learning rate is 0.0005, and the final precision on the test set is shown in Table 2:

TABLE 2 results of the experiment

The embodiments described in this specification are merely illustrative of implementations of the inventive concept and the scope of the present invention should not be considered limited to the specific forms set forth in the embodiments but rather by the equivalents thereof as may occur to those skilled in the art upon consideration of the present inventive concept.

Claims

1. A shopping guide negative behavior monitoring method based on yolo and multitask convolutional neural networks comprises the following steps:

(1) training a yolo-based pedestrian detection model;

step 11: constructing a pedestrian detection model based on yolo;

using a layer-skipping fusion mode, using 13 convolutional layers and 4 maximum pooling layers in a first stage of feature extraction, using 7 convolutional layers in a second stage, wherein 1 maximum pooling layer is arranged between the first stage and the second stage, and adjusting the size of a feature graph output in the first stage to be consistent with the size of a feature graph output in the second stage; then the two feature graphs with the oversize adjusted are fused together in a superposition mode to become input of a stage three; stage three has two modes, one is a classification network, and the mode is used when the model is pre-trained, specifically, the mode is a 3 x 3 convolution layer and a full connection layer, and the number of neurons in the full connection layer is equal to the classification number; the second mode is a detection network, which is used for training the detection network after loading the pre-training parameters of the first mode, specifically, a layer of convolution layer of 3 × 3 is added, a layer of convolution layer of 1 × 1 is added, the number of convolution kernels is related to the detection category, and the specific numerical values are as follows: anchlors number × (5+ number of detection categories);

step 12: pre-training a classification model by using an ImageNet data set;

training a classification model by using the ImageNet data set, and taking the trained classification model parameters as initialization parameters of a common structure in the detection model;

step 13: pre-training the detection model with the voc2007 dataset;

because the first few layers of the detection model are consistent with the classification network, the parameters of the classification network trained in the step 12 are used as the initialization parameters of the common structure in the detection network; the voc2007 data set is a common detection data set, and 20 types of labeled detection objects are provided, wherein the labeled detection objects comprise pedestrian image data; training only the image data of the pedestrians, performing data enhancement operation on the pedestrian data, adjusting the image size to 448 multiplied by 448, training 160 epochs by an SGD optimizer, and setting the initial learning rate to 0.0001;

step 14: fine-tuning the model with the monitored perspective dataset;

selecting a data set in a BOT2018 new retail technology challenge match for fine adjustment, wherein a pedestrian image of the data set is acquired from a monitoring camera in a real market scene; performing data enhancement operations such as horizontal rotation, center random cropping and HSV (hue, saturation, value) space fine adjustment on the image of the data set, and adjusting the size to 448 x 448;

loading the model trained in the step 13, training 160 epochs by using an SGD optimizer, setting the initial learning rate to be 0.001, reducing the learning rate along with the increase of the training times, setting the learning rate to be 0.001 when 0-5 epochs are used, setting the learning rate to be 0.0001 when 5-80 epochs are used, and setting the learning rate to be 0.00001 when 80-160 epochs are used;

(2) training a multitask convolutional neural network based on ResNet 50;

for the pedestrian detected in the step (1), the attribute of the pedestrian needs to be identified so as to judge whether the shopping guide has the behavior of negative work, and the attributes marked in the data set comprise: "customer" or "shopping guide", "male" or "female", "standing" or "sitting", "playing mobile phone" or "not playing mobile phone"; the attributes are not related to each other and are regarded as unrelated attributes;

aiming at the identification of shopping guide negative behaviors, a multitask convolutional neural network is designed based on ResNet 50;

specifically, the full-link layer and the pooling layer of the last two layers of the original ResNet50 are removed, four parallel full-link layers are spliced, the number of neurons of each full-link layer is 2, and the full-link layers respectively represent 8 attributes: "customer" and "shopping guide", "male" and "female", "standing" and "sitting", "playing mobile phone" and "not playing mobile phone", two attributes on the same full connection layer are associated attributes, and an attribute not on one full connection layer is an unrelated attribute; after each full connection, a softmax layer is respectively connected; the calculation formula of the Softmax loss function is:

Loss＝L₁+L₂+L₃+L₄ (2)

step 22: training a multitask convolutional neural network based on ResNet 50;

loading parameters except the last two layers in the ResNet50 trained on the ImageNet data set as initialization parameters of the multitask convolutional neural network; the method comprises the steps that label data of a BOT2018 new retail technology challenge race are adopted in a data set, data enhancement operation is conducted on the data to obtain more available training data, an Adam optimizer is used for training, the initial learning rate is set to be 0.0005, and 40 epochs are trained;

(3) shopping guide negative behavior records;

step 31: reading a monitoring picture;

in a shopping mall, a widely distributed monitoring system provides data; before reading the monitoring picture, two parameters need to be set: working time interval and monitoring sampling time; setting a working time interval, so that the system only focuses on the working time; setting monitoring sampling time, namely controlling the frequency of reading a monitoring picture;

reading the monitoring picture through an rtsp protocol, and controlling the reading picture through the system time of a computer;

step 32: detecting pedestrians in a mall;

loading the pedestrian detection model trained in the step (1), reading the monitoring image in the step 31, normalizing the image, converting the image into a Tensor, and then loading the Tensor into the pedestrian detection model, wherein the pedestrian detection model can detect four coordinates of a pedestrian in the image, namely the upper coordinate, the lower coordinate, the left coordinate and the right coordinate;

step 33: identifying a pedestrian attribute;

loading the multitask convolution neural network in the step (2), taking the image data of the pedestrian detected in the step (32) as input data of the multitask convolution neural network, and outputting the value of the full connection layer, namely the confidence coefficient of the model on the certain attribute of the pedestrian; if the confidence level of outputting the shopping guide is higher than the confidence level of outputting the customer, the pedestrian is identified as the shopping guide, if the confidence level of outputting the male is higher than the confidence level of outputting the female, the pedestrian is identified as the male, if the confidence level of outputting the standing is higher than the confidence level of sitting, the pedestrian is identified as the standing, and if the confidence level of outputting the mobile phone playing is higher than the confidence level of not playing the mobile phone, the pedestrian is identified as the mobile phone playing;

step 34: recording a shopping guide negative behavior picture;

in the business time interval, the pedestrian attribute identified in the step 33 is judged to be shopping guide negative behavior; firstly, judging whether the identity of the pedestrian belongs to shopping guide, if the identity belongs to the shopping guide, analyzing the posture and the working state of the pedestrian, and further judging whether a customer exists in a picture where the shopping guide is located, wherein the customer has a stricter requirement on the shopping guide under the condition that the customer is in the place; when a shopping guide is playing a mobile phone or sitting, and there is no customer in the picture at this time, the shopping guide is considered to be "generally passive"; when a shopping guide plays a mobile phone or sits and a customer is in a picture, the shopping guide is considered to be 'seriously passive'; the degree of specific shopping guide negative behavior is determined as follows: when there is a customer nearby, the behavior of the shopping guide sitting or playing the mobile phone is judged as "severely passive", and when standing and not playing the mobile phone, the behavior is judged as "active"; when there is no customer nearby, the behavior of the shopping guide sitting or playing the mobile phone is judged as "generally negative", and when standing and not playing the mobile phone, it is judged as "positive"; the screen is saved in the folder 1 for the "seriously passive" shopping guide, the screen is saved in the folder 2 for the "general passive" shopping guide, and the screen is not saved for the "active" shopping guide.