CN113312957A

CN113312957A - off-Shift identification method, device, equipment and storage medium based on video image

Info

Publication number: CN113312957A
Application number: CN202110267158.5A
Authority: CN
Inventors: 李斯; 赵齐辉
Original assignee: Dongpu Software Co Ltd
Current assignee: Dongpu Software Co Ltd
Priority date: 2021-03-11
Filing date: 2021-03-11
Publication date: 2021-08-27

Abstract

The invention relates to the technical field of monitoring and discloses a video image-based off-Shift identification method, device, equipment and storage medium. The method comprises the following steps: acquiring a video stream, and labeling station information in each frame of video image in the video stream to obtain a corresponding labeled image; generating a training sample set according to the labeled image, inputting the training sample set into a Retineet model for classification training, and obtaining an off-Shift recognition model; acquiring a plurality of frames of office scene images under the monitoring environment, inputting the images into the off-Shift recognition model for recognition, and outputting the off-Shift recognition result of each station under the corresponding monitoring environment. The off-duty recognition model based on the video images obtained through training is used for recognizing and monitoring the stations in the office scene, so that whether the staff are off duty or not is automatically monitored, and the monitoring efficiency is improved.

Description

off-Shift identification method, device, equipment and storage medium based on video image

Technical Field

The invention relates to the technical field of monitoring, in particular to a video image-based off-Shift identification method, device, equipment and storage medium.

Background

The remote video monitoring system can reach any corner of the world through a standard telephone line, a network, a mobile broadband and ISDN data line or a direct connection, and can control a pan/lens and store video monitoring images. The remote transmission monitoring system transmits a remote activity scene to a computer screen of a viewer through a common telephone line and has the function of reverse dialing an alarm to a receiving end when the alarm is triggered. The existing video monitoring system generally comprises a monitor and a monitoring terminal, wherein the monitor shoots a monitored object and transmits shot video to the monitoring terminal.

However, in the logistics management industry, since monitoring objects, such as office situations of various network points and distribution centers, are often flexibly arranged according to respective busy degrees, service interfacing is performed in different time periods. The inventor finds that in the process of realizing the invention, the existing monitoring technology monitors a certain preset time period fixedly, cannot monitor the office condition flexibly according to a monitored object in real time, and cannot meet the requirements of users.

Disclosure of Invention

The invention mainly aims to solve the technical problems that in the prior art, the video streaming analysis is not sufficient, more manpower is needed to supervise the office condition of a distribution center, and the supervision efficiency is low.

The invention provides an off-Shift identification method based on a video image, which comprises the following steps:

acquiring a video stream, and extracting video images in the video stream, wherein the video stream comprises at least two frames of video images;

identifying station information in the video image, and labeling the station information to obtain a labeled image, wherein the label at least comprises a first label indicating no person and a second label indicating a person;

generating a model training sample set according to the labeled image, inputting the model training sample set into a preset target Retineet model for off-Shift recognition training to obtain an off-Shift recognition model;

acquiring at least two frames of office scene images in a monitoring environment, inputting the at least two frames of office scene images into the off-Shift recognition model for recognition, and obtaining an off-Shift monitoring result of each station in the monitoring environment, wherein the at least two frames of office scene images are station images with continuous time.

Optionally, in a first implementation manner of the first aspect of the present invention, the identifying station information in the video image, and labeling the station information to obtain a labeled image includes:

acquiring an annotation request for annotating the video image;

identifying a station in the video image to obtain an area range of the station;

extracting a corresponding first workstation image from the video image based on the area range of the workstation;

performing feature extraction on the first station image to obtain image features of the first station image, wherein the image features comprise geometric features, texture features and semantic features;

determining image information of the first station image according to the image characteristics;

and marking the first station image according to the image information to obtain a marked image after marking.

Optionally, in a second implementation manner of the first aspect of the present invention, the obtaining at least two frames of office scene images in the monitoring environment, and inputting the at least two frames of office scene images into the off-Shift recognition model for recognition, and obtaining the off-Shift monitoring result of each workstation in the monitoring environment includes:

acquiring at least two frames of office scene images of a distribution center captured in real time, and inputting the at least two frames of office scene images into the off-Shift recognition model;

obtaining the area range of each station in the at least two frames of office scene images through the off-Shift identification model;

according to the area range of each station in the at least two frames of office scene images, respectively extracting a second station image corresponding to each station from the at least two frames of office scene images;

and inputting the second station image into the off-Shift recognition model, respectively carrying out face recognition on the second station image through the off-Shift recognition model, and outputting a corresponding off-Shift recognition result.

Optionally, in a third implementation manner of the first aspect of the present invention, the model training sample set includes a first training sample set, and before the model training sample set is input to a preset target retinet model for off-Shift recognition training, and an off-Shift recognition model is obtained, the method further includes:

building a Retianet network framework, and inputting the first training sample set into the Retianet network framework for training to obtain a Retianet model;

extracting a first label and a second label in the first training sample set, and calculating a cross entropy loss function of the first label and the second label;

and calculating a weight attenuation coefficient of the Retianet model through the cross entropy loss function and a preset gradient descent algorithm, and updating parameters of the Retianet model through the weight attenuation coefficient to obtain an optimized target Retianet model.

Optionally, in a fourth implementation manner of the first aspect of the present invention, the model training sample set further includes a second training sample set and a verification training sample set, the target retinet model includes N convolutional layers and M full-link layers, and the inputting the model training sample set into a preset target retinet model for off-shift recognition training to obtain an off-shift recognition model includes:

sequentially inputting the marked images in the verification training sample set into the N convolutional layers, respectively extracting station features corresponding to the first mark and the second mark of the marked images in the verification training sample set through the N convolutional layers, and combining the station features to obtain a station feature map;

inputting the station characteristic diagrams into M fully-connected layers, and performing station prediction on the station characteristic diagrams through the M fully-connected layers to obtain prediction labels of stations in the verification training sample set;

calculating a first loss value between the prediction label and a first label and a second label in the verification training sample set respectively according to the cross entropy loss function, and determining the iteration times Y of the target Retineet model based on the first loss value;

performing iterative training on the target Retianet model according to the iteration times and the second training sample set to obtain Y second loss values;

judging whether the first loss value and the Y second loss values are in a convergence relationship or not;

and if so, outputting the target Retianet model after the iterative training as an off Shift recognition model.

Optionally, in a fifth implementation manner of the first aspect of the present invention, the building a retinet network framework, and inputting the first training sample set into the retinet network framework for training, so as to obtain a retinet model includes:

constructing a Retianet network framework and configuring a loss function, an initial learning rate and iteration times of the Retianet network framework;

carrying out format conversion on the video image and the corresponding data set, and inputting the data set subjected to format conversion into the Retianet network framework for training;

and carrying out iterative correction on the Retianet network framework based on the loss function, the initial learning rate and the iteration times of the Retianet network framework to obtain a Retianet model.

The second aspect of the present invention provides an off Shift identification device based on video images, comprising:

the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a video stream and extracting video images in the video stream, and the video stream comprises at least two frames of video images;

the annotation module is used for identifying station information in the video image and annotating the station information to obtain an annotated image, wherein the annotation at least comprises a first annotation indicating no person and a second annotation indicating a person;

the training module is used for generating a model training sample set according to the labeled image, inputting the model training sample set into a preset target Retineet model for off-Shift recognition training, and obtaining an off-Shift recognition model;

the identification module is used for acquiring at least two frames of office scene images in a monitoring environment, inputting the at least two frames of office scene images into the off-Shift identification model for identification, and obtaining an off-Shift monitoring result of each station in the monitoring environment, wherein the at least two frames of office scene images are station images with continuous time.

Optionally, in a first implementation manner of the second aspect of the present invention, the tagging module is specifically configured to:

acquiring an annotation request for annotating the video image;

Optionally, in a second implementation manner of the second aspect of the present invention, the identification module is specifically configured to:

Optionally, in a third implementation manner of the second aspect of the present invention, the set of model training samples includes a first set of training samples, and the apparatus for off Shift recognition based on video images further includes:

the building module is used for building a Retinenet network framework and inputting the first training sample set into the Retinenet network framework for training to obtain a Retinenet model;

the calculation module is used for extracting a first label and a second label in the first training sample set and calculating a cross entropy loss function of the first label and the second label;

and the updating module is used for calculating the weight attenuation coefficient of the Retianet model through the cross entropy loss function and a preset gradient descent algorithm, and updating the parameters of the Retianet model through the weight attenuation coefficient to obtain the optimized target Retianet model.

Optionally, in a fourth implementation manner of the second aspect of the present invention, the model training sample set further includes a second training sample set and a verification training sample set, and the training module includes:

the characteristic extraction unit is used for sequentially inputting the marked images in the verification training sample set into the N convolutional layers, respectively extracting station characteristics corresponding to the first mark and the second mark of the marked images in the verification training sample set through the N convolutional layers, and combining the station characteristics to obtain a station characteristic diagram;

the prediction unit is used for inputting the station characteristic diagrams into M fully-connected layers and performing station prediction on the station characteristic diagrams through the M fully-connected layers to obtain a prediction label of each station in the verification training sample set;

the calculation unit is used for calculating a first loss value between the prediction label and a first label and a second label in the verification training sample set respectively according to the cross entropy loss function, and determining the iteration times Y of the target Retineet model based on the first loss value;

the training unit is used for carrying out iterative training on the target Retianet model according to the iteration times and the second training sample set to obtain Y second loss values;

the judging unit is used for judging whether the first loss value and the Y second loss values are in a convergence relationship or not;

Optionally, in a fifth implementation manner of the second aspect of the present invention, the building module is specifically configured to:

A third aspect of the present invention provides an off Shift identification apparatus based on video images, comprising: a memory having instructions stored therein and at least one processor, the memory and the at least one processor interconnected by a line;

the at least one processor invokes the instructions in the memory to cause the video image-based off Shift identification apparatus to perform the steps of the video image-based off Shift identification method described above.

A fourth aspect of the present invention provides a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to perform the steps of the above-mentioned video image based off Shift identification method.

According to the technical scheme provided by the invention, a corresponding marked image is obtained by acquiring a video stream and marking station information in each frame of video image in the video stream; generating a training sample set according to the labeled image, inputting the training sample set into a Retineet model for classification training, and obtaining an off-Shift recognition model; acquiring a plurality of frames of office scene images under the monitoring environment, inputting the images into the off-Shift recognition model for recognition, and outputting the off-Shift recognition result of each station under the corresponding monitoring environment. The off-duty recognition model based on the video images obtained through training is used for recognizing and monitoring the stations in the office scene, so that whether the staff are off duty or not is automatically monitored, and the monitoring efficiency is improved.

Drawings

FIG. 1 is a schematic diagram of a first embodiment of an off Shift identification method based on video images according to the present invention;

FIG. 2 is a schematic diagram of a second embodiment of the off Shift identification method based on video images according to the present invention;

FIG. 3 is a schematic diagram of a third embodiment of the off Shift identification method based on video images according to the present invention;

FIG. 4 is a schematic diagram of a fourth embodiment of the off Shift identification method based on video images according to the present invention;

FIG. 5 is a schematic diagram of a fifth embodiment of the off Shift identification method based on video images according to the present invention;

FIG. 6 is a schematic diagram of a first embodiment of an off Shift identification apparatus based on video images according to the present invention;

FIG. 7 is a schematic diagram of a second embodiment of an off Shift identification apparatus based on video images according to the present invention;

fig. 8 is a schematic diagram of an embodiment of the off Shift identification apparatus based on video image according to the present invention.

Detailed Description

The embodiment of the invention provides an off-Shift identification method, device, equipment and storage medium based on video images, which comprises the steps of firstly obtaining a video stream, and labeling station information in each frame of video image in the video stream to obtain a corresponding labeled image; generating a training sample set according to the labeled image, inputting the training sample set into a Retineet model for classification training, and obtaining an off-Shift recognition model; acquiring a plurality of frames of office scene images under the monitoring environment, inputting the images into the off-Shift recognition model for recognition, and outputting the off-Shift recognition result of each station under the corresponding monitoring environment. The off-duty recognition model based on the video images obtained through training is used for recognizing and monitoring the stations in the office scene, so that whether the staff are off duty or not is automatically monitored, and the monitoring efficiency is improved.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," or "having," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

For ease of understanding, a specific flow of an embodiment of the present invention is described below, and referring to fig. 1, a first embodiment of an off Shift identification method based on video images in an embodiment of the present invention includes:

101. acquiring a video stream, and extracting a video image in the video stream;

it is understood that the executing subject of the present invention may be an off Shift identification device based on video image, and may also be a terminal or a server, which is not limited herein. The embodiment of the present invention is described by taking a server as an execution subject.

In this embodiment, the office scene video refers to an office video of an office area of the distribution center within a preset time period (working time), and the office scene video of the loading scene of the logistics vehicle is firstly shot by a camera or other devices. For example, all monitoring videos are accessed to a local area network, so all cameras can be accessed through a DSS platform, the DSS has a screenshot function, office scene images of a distribution center shot by screenshot are stored in a bmp mode, about 900 (or more) sample images are taken, according to the range of a station to be recognized in the images, face recognition is carried out on the area range of the station to be recognized, and whether workers are on the station or not is judged.

102. Identifying station information in the video image, and labeling the station information to obtain a labeled image;

in this embodiment, the video image is input into preset image annotation software for display. Labelme software is preferred as the image annotation software. And selecting the vehicle in the image by using the closed line connected with the head through an interactive device in a manual mode. And the server defines the license plate area of the vehicle in the video image according to the position coordinate corresponding to the closed line to obtain an image containing the marked license plate area range, namely marked information. And finally, writing the labeling information into a blank file with a preset JSON format, thereby obtaining a data set with the JSON format.

103. Generating a model training sample set according to the labeled image, inputting the model training sample set into a preset target Retineet model for off-Shift recognition training to obtain an off-Shift recognition model;

in this embodiment, the manufactured data set is classified according to the optimal proportion by the script code, and is divided into a training sample set, a verification sample set, and a test sample set, wherein the proportion is 60%, 30%, and 10% respectively. The Retianet model needs to preprocess the picture and convert the RGB format into the BGR format; meanwhile, the size of the picture is adjusted to 224 × 3, the picture is normalized, and then the normalized picture is input into an optimized Retianet model for training.

Normalization is also called data normalization processing and mainly comprises the steps of most value normalization and mean variance normalization. Generally, when we acquire a training model for training, the data sizes are very different, for example, some features may be 200000000, and some features are 0.00123, which is very time consuming when we perform calculation of large numbers, and the calculation result is also abnormally large. In addition, the weight distribution is not uniform, and the weight obtained by a large number may be larger. Therefore, the large number is not the most critical factor for determining the result of the data, and the result becomes the most important factor because the large number is large, so that the problem of mapping all the data into the same scale, namely, normalizing the image data, is predicted to occur.

In this embodiment, the retinet model includes 16 Convolutional layers and 3 fully-connected layers, each Convolutional layer (Convolutional layer) in the Convolutional neural network is composed of a plurality of Convolutional units, and parameters of each Convolutional unit are optimized through a back propagation algorithm. The convolution operation aims to extract different input features, the convolution layer at the first layer can only extract some low-level features such as edges, lines, angles and other levels, and more layers of networks can iteratively extract more complex features from the low-level features. The effect of the convolutional layer is local perception, which is that, rather than identifying the whole picture at once when we see a picture, the convolutional layer firstly locally perceives each feature in the picture, and then performs comprehensive operation on the local part at a higher level, so as to obtain global information.

The fully connected layers (FC) are a tiled structure (1 × 4096) composed of many neurons, and the function is mainly to realize classification, and play a role of a classifier in the whole convolutional neural network. If we say that operations such as convolutional layers, pooling layers, and activation function layers map raw data to hidden layer feature space, the fully-connected layer serves to map the learned "distributed feature representation" to the sample label space. In practical use, the fully-connected layer may be implemented by a convolution operation: fully connected layers that are fully connected to the previous layer can be converted into convolutions with convolution kernels of 1 x 1; and the fully-connected layer of which the front layer is the convolution layer can be converted into the global convolution with the convolution kernel h x w, wherein h and w are the height and width of the convolution result of the front layer respectively. Taking VGG-16 as an example, for the input of 224 × 3, the final layer of convolution can obtain an output of 7 × 512, and if the subsequent layer is a layer of FC with 4096 neurons, the full-connection operation process can be implemented by global convolution with a convolution kernel of 7 × 512 × 4096, where the parameters of the convolution kernel are as follows: the convolution operation is performed to obtain an output of 1 × 4096, where "filter size is 7, padding is 0, stride is 1, D _ in is 512, and D _ out is 4096".

The full link layer, hereinafter referred to as FC. Can act as a "firewall" during the migration of model representation capabilities. Specifically, if the model pre-trained on ImageNet is assumed to be, ImageNet can be regarded as a source domain (source domain in migration learning). Fine tuning (fine tuning) is the most common migratory learning technique in the deep learning field. For fine-tuning, if the image in the target domain is very different from the image in the source domain (e.g., the target domain image is not an object-centered image, but a landscape photograph, compared to ImageNet), the result after fine-tuning for the FC-free network is inferior to that of the FC-containing network. The FC may therefore be viewed as a "firewall" of model representation capabilities, and in particular where the source domain differs significantly from the target domain, the FC may maintain a large model capacity to ensure migration of the model representation capabilities.

104. And acquiring at least two frames of office scene images under the monitoring environment, and inputting the at least two frames of office scene images into the off-Shift recognition model for recognition to obtain an off-Shift monitoring result of each station under the monitoring environment.

In this embodiment, the live image of the office scene in the office scene of the dispatch center captured in real time is input into the off-post recognition model based on the video image, and the workstation in the live image is recognized according to the size and shape information of the workstation in the image. For example, by performing a series of processing on the picture, the size of the corresponding station in the picture is obtained, and whether the object in the station is a human face is obtained, and whether the worker on the station is off duty is determined.

In the embodiment of the invention, a corresponding marked image is obtained by acquiring a video stream and marking station information in each frame of video image in the video stream; generating a training sample set according to the labeled image, inputting the training sample set into a Retineet model for classification training, and obtaining an off-Shift recognition model; acquiring a plurality of frames of office scene images under the monitoring environment, inputting the images into the off-Shift recognition model for recognition, and outputting the off-Shift recognition result of each station under the corresponding monitoring environment. The off-duty recognition model based on the video images obtained through training is used for recognizing and monitoring the stations in the office scene, so that whether the staff are off duty or not is automatically monitored, and the monitoring efficiency is improved.

Referring to fig. 2, a second embodiment of the off Shift identification method based on video images according to the embodiment of the present invention includes:

201. acquiring a video stream, and extracting a video image in the video stream;

202. building a Retianet network framework and configuring a loss function, an initial learning rate and iteration times of the Retianet network framework;

in this embodiment, a retinet network framework is established, which includes 16 convolutional layers and 3 full-connection layers, and specifically includes an input layer, a 64-channel conv2 convolutional layer 1, a 64-channel conv2 convolutional layer 2, a pool _1 convolutional layer, a 128-channel conv3 convolutional layer 1, a 128-channel conv3 convolutional layer 2, a pool _2 convolutional layer, a 256-channel conv4 convolutional layer 1, a 256-channel conv4 convolutional layer 2, a 256-channel conv4 convolutional layer 3, a 256-channel conv4 convolutional layer 4, a pool _3 convolutional layer, a 512-channel conv5 convolutional layer 1, a 512-channel conv5 convolutional layer 2, a 512-channel conv5 convolutional layer 3, a 512-channel conv5 convolutional layer 4, a pool _4 pooling layer, a 512-channel conv5 convolutional layer, a 512-channel conv5 convolutional layer 6, a 512-channel conv2 convolutional layer 7, a 512-channel conv5 convolutional layer, a pool _5, a full-connection layer 638, a full-connection layer 8, and a full-connection layer. Where conv denotes the convolutional layer, FC denotes the fully-connected layer, conv3 denotes the convolutional layer using 3 × 3 filters, conv3-64 denotes depth 64, maxpool denotes maximum pooling.

Before training, defining a loss function, an initial learning rate and iteration times of the Retianet network framework. And calculating a new weight coefficient through the loss function, updating the weight coefficient, and finishing one-time training iteration. The network repeats the process, completes iteration of all images for a fixed number of times, updates the weight when the calculated value of the loss function is lower, and finishes training until reaching the preset iteration number so as to obtain the Retianet network framework and the weight.

203. Carrying out format conversion on the video image and the corresponding data set, and inputting the data set subjected to format conversion into a Retianet network framework for training;

in this embodiment, the retinet network framework needs to pre-process the picture, convert RGB to BGR, resize to 224 × 3, subtract the average value trained on ImageNet from each pixel in the picture, then train, start with the pre-trained model trained on ImageNet, choose batch _ size equal to 4. 20 epochs were trained and model information was stored in h5 format.

Epoch refers to when a complete data set passes through the neural network once and back once, a process called > Epoch once. (i.e., all training samples have been propagated in a forward direction and a backward direction in the neural network) then, one Epoch is the process of training all training samples once. However, when the number of samples (i.e. all training samples) of an Epoch is too large (for a computer), it needs to be divided into a plurality of small blocks, i.e. into a plurality of batchs for training. Wherein, Batch refers to dividing the whole training sample into several batches, Batch _ Size (Batch Size): size of each batch of samples.

204. Iteratively correcting the Retianet network framework based on the loss function, the initial learning rate and the iteration times of the Retianet network framework to obtain a Retianet model;

in this embodiment, the loss function (loss function) is used to measure the degree of inconsistency between the predicted value and the true value of the model, and is a non-negative true value function, and the smaller the value of the loss function is, the higher the accuracy of the retinet model is. And updating the loss function according to a predefined loss function, and generating an optimized Retianet model according to the loss function.

Iterative training is a model training mode in deep learning and is used for optimizing a model. The iterative training in this step is realized by the following steps: firstly, constructing a target loss function of a Retineet model, and performing cyclic training by adopting an optimization algorithm, such as an SGD (stochastic gradient descent) optimization algorithm; in each cyclic training process, all video images are sequentially read in, the current loss function of the Retianet model is calculated, the gradient descending direction is determined based on an optimization algorithm, the target loss function is gradually reduced and reaches a stable state, and the optimization of each parameter in the constructed network model is realized.

The loss function convergence refers to that a loss function is close to 0, for example, less than 0.1, and the like, that is, a value output by the VGG19 model for a given sample (positive sample or negative sample) is close to 0.5, it is considered that the retinet cannot distinguish the positive sample from the negative sample, that is, the output of the retinet model is converged, that is, training is stopped, and a model parameter of the last training is used as a parameter of the VGG19 model, so that the retinet model is obtained.

205. Extracting a first label and a second label in a first training sample set, and calculating a cross entropy loss function of the first label and the second label;

in this embodiment, a corresponding label result is determined according to the video image, where the label refers to a state of whether a person is on a station in the picture, for example, the person is on the station and the person is not on the station.

The cross entropy can measure the difference degree of two different probability distributions in the same random variable, and is expressed as the difference between the real probability distribution and the predicted probability distribution in machine learning. The smaller the value of the cross entropy, the better the model prediction effect. The cross entropy is usually matched with softmax in the classification problem, softmax processes the output result to enable the sum of the predicted values of a plurality of classifications to be 1, and then loss is calculated through the cross entropy.

In this embodiment, since the loss function is a function for measuring the degree of inconsistency between the predicted value and the true value obtained by the retinet model, the smaller the loss function is, the better the performance of the retinet model is, and therefore, the loss function can be optimized by calculating the gradient of the loss function until the loss function reaches the minimum value.

As an embodiment, the gradient of the loss function may be calculated by a gradient descent method, and it is determined whether the parameters of the convolutional network layer in the retinet model need to be updated; if the label is updated, the label result is obtained in a recycling mode, and the loss function is calculated until the loss function reaches the minimum value.

206. Calculating a weight attenuation coefficient of the Retianet model through a cross entropy loss function and a preset gradient descent algorithm, and updating parameters of the Retianet model through the weight attenuation coefficient to obtain an optimized target Retianet model;

in this embodiment, after obtaining a gradient and a weight attenuation coefficient corresponding to the loss function according to the cross entropy loss function and the preset gradient descent algorithm, and updating the parameter of the retinet model according to the gradient value, it is required whether the loss function meets a preset convergence condition, where the preset convergence condition refers to that the loss function reaches a minimum value, and specifically, the preset convergence condition may be a preset number of times or a preset value set according to experience. That is, the parameters of the Retianet model are updated through the weight attenuation coefficients; and when the iteration times of the Retianet model reach preset times or the loss function reaches a preset value, stopping updating the parameters of the Retianet model to obtain the target Retianet model.

207. Identifying station information in the video image, and labeling the station information to obtain a labeled image;

208. generating a model training sample set according to the labeled image, inputting the model training sample set into a preset target Retineet model for off-Shift recognition training to obtain an off-Shift recognition model;

209. and acquiring at least two frames of office scene images under the monitoring environment, and inputting the at least two frames of office scene images into the off-Shift recognition model for recognition to obtain an off-Shift monitoring result of each station under the monitoring environment.

Step 207-.

Referring to fig. 3, a third embodiment of the off Shift identification method based on video images according to the embodiment of the present invention includes:

301. acquiring a video stream, and extracting a video image in the video stream;

in this embodiment, the office scene video refers to an office video of an office area of the distribution center within a preset time period (working time), and the office scene video of the distribution center is firstly shot by a camera or other devices. For example, all the monitoring videos are accessed to a local area network, so that all the cameras can be accessed through the DSS platform.

302. Acquiring an annotation request for annotating a video image;

in this embodiment, the server receives a picture tagging request triggered by the user terminal through the target application, analyzes the picture tagging request, and obtains a picture to be tagged corresponding to the picture tagging request from the terminal. The image marking request and the image to be marked are in one-to-one correspondence, and the only image to be marked can be determined through the image marking request.

The target application comprises application programs currently used by a user, the application programs do not provide a picture marking function, and when the user needs to perform picture marking operation, the target application programs can trigger picture marking requests according to picture marking requirements of the user and send the picture marking requests to the Web server.

303. Identifying stations in the video image to obtain the area range of the stations;

in this embodiment, after a pre-trained station recognition model is called, each frame of image in an office scene video is obtained through modes such as real-time snapshot or screenshot, the image includes an office scene image, and then the office scene image is input into the station recognition model. The vehicle recognition model can identify the stations in the scene image through a circular, rectangular or other-shaped frame, so as to obtain the area range of each station in the office scene image.

304. Extracting a corresponding first station image from the video image based on the area range of the station;

in this embodiment, according to the area range of each station in the office scene image, the area range of each station in the office scene image is cut out from the office scene image, so as to extract the station image corresponding to each station, where the station image is a part of the video image.

305. Performing feature extraction on the first station image to obtain image features of the first station image;

in this embodiment, the features include a bottom layer geometric feature, a middle layer texture feature and a high layer semantic feature of the picture in the convolutional layer, the bottom layer geometric feature is a geometric shape and a geometric size of each object in the picture, the middle layer texture feature is used for distinguishing categories of each object, such as plants, animals, buildings and the like, and the high layer semantic feature is matting according to meanings expressed by the objects in the picture, that is, distinguishing the same object in the picture. The object types in the pictures can be more accurately expressed and distinguished by extracting the hierarchical features in the pictures, and the pictures are labeled based on the object types in the pictures. For example, a picture includes: ground, traffic lines, sidewalks, pedestrians, buildings, trees, and other infrastructure. Geometric features such as ground geometry and size, traffic line shape size, tree shape and size. The textural features are the shape and size of the traffic line, ground, tree.

And inputting the picture to be marked into a deep convolutional neural network, after the deep convolutional neural network acquires the picture to be marked, carrying out convolution on the picture to be marked, and extracting the hierarchical features of the picture to be marked in each convolutional layer and each pooling layer through convolution, wherein the hierarchical features comprise features passing through each convolutional layer and each pooling layer under different scales. Because in the neural network result, under the same scale, there is one pooling layer and more than one convolution layer. The convolution layer is used for carrying out feature extraction on an input picture, and the pooling layer is as follows: compressing the input feature diagram, so that the feature diagram is reduced and the network computation complexity is simplified; on one hand, feature compression is carried out, and main features are extracted.

306. Determining image information of the first station image according to the image characteristics;

307. Marking the first station image according to the image information to obtain a marked image after marking;

in this embodiment, an express mail in the image is selected by using the closed line connected to the head through an interactive device in an artificial manner. And the interactive equipment sends the position coordinates corresponding to the closed lines to the server. And the server defines the station area in the video image according to the position coordinate to obtain an image containing the range of the marked station area, so that the example segmentation marking of the station image is realized. And the image containing the area range of the labeling station is the required labeling image.

In this embodiment, the annotation image is written into a blank file in a preset JSON format, so as to obtain a data set in the JSON format.

308. Generating a model training sample set according to the labeled image, inputting the model training sample set into a preset target Retineet model for off-Shift recognition training to obtain an off-Shift recognition model;

309. and acquiring at least two frames of office scene images under the monitoring environment, and inputting the at least two frames of office scene images into the off-Shift recognition model for recognition to obtain an off-Shift monitoring result of each station under the monitoring environment.

Steps

301 and 308 and 309 in this embodiment are similar to

steps

101 and 103 and 104 in the first embodiment, and are not described herein again.

Referring to fig. 4, a fourth embodiment of the off Shift identification method based on video images according to the embodiment of the present invention includes:

401. acquiring a video stream, and extracting a video image in the video stream;

402. identifying station information in the video image, and labeling the station information to obtain a labeled image;

403. sequentially inputting the marked images in the verification training sample set into the N convolutional layers, respectively extracting station characteristics corresponding to the first mark and the second mark of the marked images in the verification training sample set through the N convolutional layers, and combining the station characteristics to obtain a station characteristic diagram;

in this embodiment, the labeled images included in the verification training sample set are sequentially input into the N convolutional layers of the retinet model, and feature extraction is performed on the labeled images through the corresponding N convolutional layers, so as to obtain a feature map of the labeled images. The convolutional layer learns the loss from the previous layer of features to the next layer of features, namely residual errors, through a mode of adding the identical quick links, so that the accumulation layer can learn new features on the basis of input features, and more image features can be extracted.

404. Inputting the station characteristic diagrams into M full-connection layers, and performing station prediction on the station characteristic diagrams through the M full-connection layers to obtain prediction labels of all stations in the verification training sample set;

in this embodiment, the feature maps corresponding to the verification training sample images are input to M full connection layers of the retinet model, and for example, when the labels include image categories (for example, the stations are divided into two cases, that is, a person is on a station and a person is not on a station), the retinet model can be used to classify the verification training sample images, so as to obtain prediction categories of the verification training sample images, that is, prediction labels.

In an embodiment, after preprocessing the verification training sample image, the "performing label prediction on the verification training sample image by using a retinet model" may include: and performing label prediction on the preprocessed target training sample image by adopting a Retianet model. The retinet model may include an output layer, and the output layer may include a plurality of output functions, each output function being configured to output a prediction result of a corresponding tag, such as a category, such as a prediction tag, a prediction probability corresponding to the prediction tag, and the like. For example, the output layer of the retinet model may include m output functions, such as Sigmoid functions, where m is the number of labels corresponding to the multi-label image training sample set, for example, when a label is a category, m is the number of categories of the multi-label image training sample set, and m is a positive integer. Wherein the output of each output function, e.g. Sigmoid function, may comprise the probability value, i.e. the predicted probability, that a given training sample image belongs to a certain label, e.g. an object class.

The retinet model may be a model based on a deep learning Network, such as a convolutional Neural Network, for example, a ResNet (Residual Neural Network) model, and the accuracy of the model is greatly improved. In an embodiment, in the original residual network structure, the convolution kernel size of the first convolution layer in the convolution branch is 1 × 1 and the convolution step is 2, and the convolution kernel size of the second convolution layer is 3 × 3 and the convolution step is 1, so that when the first convolution layer performs a convolution operation, one feature point is skipped between two convolution processes, which may cause a loss of the feature network, and thus, the residual network may be structurally improved.

405. Calculating first loss values between the prediction labels and a first label and between the prediction labels and a second label in the verification training sample set respectively according to a cross entropy loss function, and determining the iteration times Y of the target Retineet model based on the first loss values;

in this embodiment, the time sequence for obtaining the cross entropy loss function corresponding to the sample label is not limited by the sequence number, and may be set at a corresponding time sequence position in the model training process according to actual requirements, for example, after the training sample image is selected, the cross entropy loss function corresponding to the classification of the license plate on the vehicle in the training sample image in the training sample set may be obtained. The positive label is a label identical to the sample label of the training sample image, for example, when the label is a class j, the positive label is of the same class as the class j of the training sample image; the negative label is a label different from the sample label of the training sample image, for example, when the label is a class j, the negative label is a class different from the class j of the training sample image.

In the embodiment of the invention, the cross entropy loss function can comprise positive label loss and negative label loss, and the positive label loss and the negative label loss can be obtained based on the label prediction probability of the training sample image and the sample label. For each type of sample label of a training sample image, such as the ith training sample image Xi, the embodiment of the present invention may adopt a cross entropy function corresponding to the sample label to perform convergence.

406. Performing iterative training on the target Retianet model according to the iteration times and a second training sample set to obtain Y second loss values;

in this embodiment, a cross entropy loss function corresponding to each sample of the training sample image may be obtained, and then, the prediction label and the sample label of the training sample image are converged based on the cross entropy loss function, so as to train the model parameters of the model, and obtain the trained deep neural network model. Specifically, in one embodiment, the cross entropy loss of the prediction label and the sample label of the training sample image is obtained according to a cross entropy loss function; and training model parameters in the Retianet model according to the cross entropy loss.

407. Judging whether the first loss value and the Y second loss values are in a convergence relation or not, if so, outputting the target Retianet model after iterative training as an off-Shift recognition model;

in this embodiment, a back propagation algorithm may be used in cooperation with a random gradient descent algorithm with momentum to train the model; for example, a cross entropy loss reduction gradient (which can be obtained by deriving a loss function) between a prediction label of a training sample image and a sample label can be obtained according to the cross entropy loss function, and then model parameters in the deep neural network model are trained based on the cross entropy loss reduction gradient; specifically, the model parameters may be updated based on the cross entropy loss falling gradient and the learning rate corresponding to the model parameters (i.e., the learning rate corresponding to the layer where the model parameters are located).

408. And acquiring at least two frames of office scene images under the monitoring environment, and inputting the at least two frames of office scene images into the off-Shift recognition model for recognition to obtain an off-Shift monitoring result of each station under the monitoring environment.

The

steps

401, 402, 408 in this embodiment are similar to the

steps

101, 102, 104 in the first embodiment, and are not described herein again.

In the embodiment of the invention, the corresponding marked image is obtained by acquiring the video stream and marking the station information in each frame of video image in the video stream; generating a training sample set according to the labeled image, inputting the training sample set into a Retineet model for classification training, and obtaining an off-Shift recognition model; acquiring a plurality of frames of office scene images under the monitoring environment, inputting the images into the off-Shift recognition model for recognition, and outputting the off-Shift recognition result of each station under the corresponding monitoring environment. The off-duty recognition model based on the video images obtained through training is used for recognizing and monitoring the stations in the office scene, so that whether the staff are off duty or not is automatically monitored, and the monitoring efficiency is improved.

Referring to fig. 5, a fifth embodiment of the off Shift identification method based on video images according to the embodiment of the present invention includes:

501. acquiring a video stream, and extracting a video image in the video stream;

502. identifying station information in the video image, and labeling the station information to obtain a labeled image;

503. generating a model training sample set according to the labeled image, inputting the model training sample set into a preset target Retineet model for off-Shift recognition training to obtain an off-Shift recognition model;

504. at least two frames of office scene images of a distribution center are captured in real time, and the at least two frames of office scene images are input into an off-Shift recognition model;

in this embodiment, the off Shift recognition model is obtained by pre-training. The off-Shift recognition model is mainly used for converting the station information of the station to be recognized in the station image into a form which can be recognized by a machine.

505. Obtaining the area range of each station in at least two frames of office scene images through the off-Shift recognition model;

in this embodiment, there is a station in the office scene image, and there is information such as whether staff works at the station on the station.

And after receiving the office scene image, the off-Shift recognition model recognizes the station information of the station to be recognized in the office scene image and converts the station information into a form which can be recognized by the server. And then the server is used for processing the area range of each station in the office scene image.

506. According to the area range of each station in at least two frames of office scene images, respectively extracting a second station image corresponding to each station from the at least two frames of office scene images;

in this embodiment, according to the area range of each station in the office scene image, the second station image corresponding to each station is extracted from the office scene image, and then the area range of each station in the second station image is cut out, so that the second station image corresponding to each station is extracted.

507. And inputting the second station image into the off-Shift recognition model, carrying out face recognition on the second station image through the off-Shift recognition model, and outputting a corresponding off-Shift recognition result.

In this embodiment, after the workstation image is extracted, the workstation image is input to other models, such as a face recognition model, to obtain corresponding face information.

In this embodiment, a face recognition model that can recognize a face at a workstation is preferred. Through the face recognition model, whether a face is recognized on the station can be recognized, and then the server determines that the staff on the station is not off duty according to the face information. Therefore, the server issues an instruction and sends the identification result to the monitoring identification center.

The steps 501-503 in this embodiment are similar to the steps 101-103 in the first embodiment, and are not described herein again.

With reference to fig. 6, the off Shift identification method based on video images in the embodiment of the present invention is described above, and an off Shift identification apparatus based on video images in the embodiment of the present invention is described below, where a first embodiment of the off Shift identification apparatus based on video images in the embodiment of the present invention includes:

the acquiring module 601 is configured to acquire a video stream and extract video images in the video stream, where the video stream includes at least two frames of video images;

the annotation module 602 is configured to identify station information in the video image, and annotate the station information to obtain an annotated image, where the annotation at least includes a first annotation indicating that no person is present and a second annotation indicating that a person is present;

the training module 603 is configured to generate a model training sample set according to the labeled image, and input the model training sample set into a preset target retinet model for off-Shift recognition training to obtain an off-Shift recognition model;

the identifying module 604 is configured to obtain at least two frames of office scene images in the monitoring environment, input the at least two frames of office scene images into the off-Shift identifying model for identification, and obtain an off-Shift monitoring result of each station in the monitoring environment, where the at least two frames of office scene images are station images with continuous time.

Referring to fig. 7, a second embodiment of the off Shift identification apparatus based on video image according to the embodiment of the present invention specifically includes:

In this embodiment, the labeling module 602 is specifically configured to:

acquiring an annotation request for annotating the video image;

In this embodiment, the identifying module 604 is specifically configured to:

In this embodiment, the off Shift identification apparatus based on video image further includes:

a building module 605, configured to build a retinet network framework, and input the first training sample set into the retinet network framework for training to obtain a retinet model;

a calculating module 606, configured to extract a first label and a second label in the first training sample set, and calculate a cross entropy loss function of the first label and the second label;

and the updating module 607 is configured to calculate a weight attenuation coefficient of the retinet model through the cross entropy loss function and a preset gradient descent algorithm, and update parameters of the retinet model through the weight attenuation coefficient to obtain an optimized target retinet model.

In this embodiment, the training module 603 includes:

a feature extraction unit 6031, configured to sequentially input the labeled images in the verification training sample set into the N convolutional layers, respectively extract station features corresponding to the first label and the second label of the labeled images in the verification training sample set through the N convolutional layers, and combine the station features to obtain a station feature map;

a prediction unit 6032, configured to input the station feature maps into M full-connection layers, and perform station prediction on the station feature maps through the M full-connection layers to obtain prediction labels of stations in the verification training sample set;

a calculating unit 6033, configured to calculate, according to the cross entropy loss function, first loss values between the prediction labels and the first labels and the second labels in the verification training sample set, respectively, and determine the iteration times Y of the target retinet model based on the first loss values;

a training unit 6034, configured to perform iterative training on the target retinet model according to the iteration number and the second training sample set, to obtain Y second loss values;

a determining unit 6035 configured to determine whether the first loss value and the Y second loss values are in a convergent relationship; and if so, outputting the target Retianet model after the iterative training as an off Shift recognition model.

In this embodiment, the building module 605 is specifically configured to:

Fig. 6 and 7 describe the video image based off Shift identification apparatus in the embodiment of the present invention in detail from the perspective of the modular functional entity, and the video image based off Shift identification apparatus in the embodiment of the present invention is described in detail from the perspective of hardware processing.

Fig. 8 is a schematic structural diagram of an off Shift identification apparatus based on video image 800, which may have a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 810 (e.g., one or more processors) and a memory 820, and one or more storage media 830 (e.g., one or more mass storage devices) storing an application 833 or data 832. Memory 820 and storage medium 830 may be, among other things, transient or persistent storage. The program stored on storage medium 830 may include one or more modules (not shown), each of which may include a sequence of instruction operations for video image-based off Shift identification apparatus 800. Still further, processor 810 may be configured to communicate with storage medium 830 to execute a series of instruction operations in storage medium 830 on video image-based off Shift identification apparatus 800 to implement the steps of the video image-based off Shift identification method provided by the above-described method embodiments.

The video image-based off Shift identification apparatus 800 may also include one or more power supplies 840, one or more wired or wireless network interfaces 850, one or more input-output interfaces 860, and/or one or more operating systems 831, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, and the like. Those skilled in the art will appreciate that the video image based off Shift identification device configuration illustrated in FIG. 8 does not constitute a limitation of the video image based off Shift identification devices provided herein and may include more or fewer components than those illustrated, or some components in combination, or a different arrangement of components.

The present invention also provides a computer-readable storage medium, which may be a non-volatile computer-readable storage medium, and may also be a volatile computer-readable storage medium, where instructions are stored, and when the instructions are executed on a computer, the instructions cause the computer to execute the steps of the off Shift identification method based on video images.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An off Shift identification method based on video image, characterized in that the off Shift identification method based on video image comprises:

2. The off-Shift identification method based on video images according to claim 1, wherein the identifying the station information in the video images and labeling the station information to obtain a labeled image comprises:

acquiring an annotation request for annotating the video image;

3. The off-Shift identification method based on video images according to claim 2, wherein the obtaining at least two frames of office scene images in the monitoring environment and inputting the at least two frames of office scene images into the off-Shift identification model for identification to obtain the off-Shift monitoring result of each station in the monitoring environment comprises:

4. The off-Shift recognition method based on video images of claim 1, wherein the model training sample set comprises a first training sample set, and before the model training sample set is input into a preset target Retianet model for off-Shift recognition training, an off-Shift recognition model is obtained, the method further comprises:

5. The video image-based off-Shift recognition method according to claim 4, wherein the model training sample set further includes a second training sample set and a verification training sample set, the target Retineet model includes N convolutional layers and M fully-connected layers, and the inputting the model training sample set into a preset target Retineet model for off-Shift recognition training to obtain an off-Shift recognition model includes:

6. The off-Shift recognition method based on video images according to claim 4 or 5, wherein the constructing a Retianet network framework and the inputting the first training sample set into the Retianet network framework for training to obtain a Retianet model comprises:

7. An off Shift identification apparatus based on a video image, comprising:

8. The video image-based off Shift identification apparatus according to claim 7, wherein said video image-based off Shift identification apparatus further comprises:

9. An off Shift identification apparatus based on a video image, characterized in that the apparatus comprises: a memory having instructions stored therein and at least one processor, the memory and the at least one processor interconnected by a line;

the at least one processor invoking the instructions in the memory to cause the video image based off Shift identification apparatus to perform the steps of the video image based off Shift identification method of any of claims 1-6.

10. A computer-readable storage medium, having a computer program stored thereon, wherein the computer program, when being executed by a processor, carries out the steps of the video image based off Shift identification method according to any one of claims 1 to 6.