CN114943452A

CN114943452A - Workflow detection method based on double-flow structure enhanced detector

Info

Publication number: CN114943452A
Application number: CN202210574109.0A
Authority: CN
Inventors: 胡海洋; 张敏; 李忠金
Original assignee: Hangzhou Dianzi University Shangyu Science and Engineering Research Institute Co Ltd
Current assignee: Hangzhou Dianzi University Shangyu Science and Engineering Research Institute Co Ltd
Priority date: 2022-05-24
Filing date: 2022-05-24
Publication date: 2022-08-26

Abstract

The invention discloses a workflow detection method based on a double-flow structure enhanced detector. The present invention combines a characterization detector and a motion detector in a dual stream configuration to predict moving objects appearing on a frame. Meanwhile, an anchor frame improvement sub-module with characteristic alignment characteristics is introduced into the detector, and an adaptive anchor frame cube is output according to the candidate frames detected on each frame. In order to improve the capturing capability of the moving target, a hierarchical aggregation strategy is applied in the detector to improve the discrimination of the characteristic diagram of the middle layer of the model. In addition, the layer regularization is used for reducing the internal covariate offset phenomenon between the internal layers of the detector, so that the whole training process is more efficient and stable. And finally, based on the extracted significant features, generating branches by utilizing an air-time domain action pipeline to finish classification and positioning regression of production operation behaviors. The invention can be deployed in a factory production scene, detects the whole production operation process in real time and realizes the detection of the production operation behavior of workers.

Description

Workflow detection method based on double-flow structure enhanced detector

Technical Field

The invention belongs to the technical field of workflow detection, and particularly relates to a workflow detection method based on a double-flow structure enhanced detector.

Background

At present, the main measures for factory intelligent modification include: and installing a digital CPS system, laying sensor equipment, erecting a production monitoring camera at multiple visual angles, and the like, and monitoring and recording the whole production process in real time. However, since the monitoring operation manager only focuses on the production safety management, the analysis and process improvement of the production process are not considered, and the monitoring operation manager is limited by the energy and physical factors of the workers, and the whole production process cannot be observed and monitored. In addition, most manufacturing enterprises pay more attention to profits, and professional personnel are not required to effectively analyze industrial big data such as CPS (cyber physical system), sensor equipment and production logs, so that massive industrial big data are not fully mined and used.

The use of computer vision technology in production has received increasing attention from researchers over the past decade. Although many advanced detection models are developed in laboratories, in a complex manufacturing environment, factors such as fast moving speed, short moving time delay, occlusion and view angle change exist. Therefore, there is also a large exploration space for the behavior recognition in a factory production environment. Most of the detection models develop detection from the frame level or video segment level, and then the detection candidate frames at the frame level or segment level are tracked or linked along the time sequence to form a space-time action pipeline (action tube). Although these methods based on object detection achieve some effect, they do not fully utilize motion information between consecutive frames, but process video frames as a series of independent RGB pictures, and when applied to an industrial production environment, detection errors often occur.

Disclosure of Invention

The invention aims to solve the problem that the performance improvement of a detector in the existing workflow identification model is often neglected, and provides a workflow detection method based on a double-flow structure enhanced detector to realize accurate detection of operation behaviors.

The invention adopts the following specific technical scheme:

a workflow detection method based on a dual-stream structure enhanced detector comprises the following steps:

s1, aiming at each workflow video in the workflow video data set, dividing each workflow video into a group according to each K frame to form a series of video segments; for each video segment, calculating a streaming image by using a TVL1 algorithm to serve as a first input image, and sampling a frame of video frame from the video segment to serve as a second input image; forming a training sample by using a group of first input images, second input images and positions and class labels of artificially labeled targets in the images; forming a training data set by a series of training samples;

s2, training a double-flow structure enhanced detector by using continuous training samples in the training data set, so that the detector can detect the target position and the type from an image;

the dual-stream structure enhanced detector comprises two parallel motion detectors and characterization detectors as well as a classification layer and a regression layer;

the network structures of the motion detector and the characterization detector are the same, and both the motion detector and the characterization detector comprise anchor frame improvement sub-modules and a main network consisting of a feature extraction network, a first convergence layer, a second convergence layer and a feature stacking layer, but the detector inputs of the motion detector and the characterization detector are different, wherein the detector input of the motion detector is the first input image, and the detector input of the characterization detector is the second input image;

the anchor frame improvement submodule is used for generating a correction anchor frame and assisting the training of the double-flow structure enhanced detector;

in the backbone network, the feature extraction network is composed of 11 cascaded convolution blocks, wherein the first 5 convolution blocks are the first 5 convolution blocks of the VGG-16 model, and the last 6 convolution blocks are new convolution blocks; the 6 th convolution block only contains one layer of convolution, wherein the convolution kernel size is 38 × 512, the activation function is ReLU, and the anchor frame number of each pixel position is set to be 4; the 7 th volume block only contains one layer of convolution, wherein the convolution kernel size is 19 × 1024, the activation function is ReLU, and the anchor frame number of each pixel position is set to be 6; the 8 th convolution block comprises two layers of convolution, wherein the convolution kernel size is 10 × 512, the activation function is ReLU, and the anchor frame number of each pixel position is set to be 6; the 9 th convolution block comprises two layers of convolution, wherein the sizes of convolution kernels are 5 × 256, the activation function is ReLU, and the anchor frame number of each pixel position is set to be 6; the 10 th convolution block comprises two layers of convolution, wherein the convolution kernel size is 3 × 256, the activation function is ReLU, and the anchor frame number of each pixel position is set to be 4; the 11 th convolution block comprises two layers of convolution, wherein the sizes of convolution kernels of the layers are all 1 × 256, the activation function is ReLU, and the number of anchor frames of each pixel position is set to be 4; in the first convergence layer, extracting the convolution characteristics of the 3 rd layer in the 4 th volume block, the convolution characteristics of the 3 rd layer in the 5 th volume block and the convolution characteristics of the 7 th volume block from the feature extraction network, performing layer regularization transformation on the extracted three characteristics respectively, and then aggregating the three characteristics, and finally forming low-level edge characteristics through full-connection operation to serve as output characteristics of the first convergence layer; in the second convergence layer, extracting convolution characteristics of a 7 th convolution block and convolution characteristics of a 2 nd layer in an 8 th convolution block from the characteristic extraction network, performing layer regularization transformation on the two extracted characteristics respectively, and then aggregating the two extracted characteristics, and finally forming high-level semantic characteristics through full-connection operation to serve as output characteristics of the second convergence layer; in the feature stacking layer, stacking the output features of the first convergence layer, the second convergence layer and the 8 th convolution block, the 9 th convolution block, the 10 th convolution block and the 11 th convolution block in the feature extraction network to form stacking features which are used as the output of a motion detector or a characterization detector where a main network is located;

the characteristics output by the motion detector and the characteristics output by the characterization detector are spliced and then subjected to layer regularization transformation, the characteristics subjected to layer regularization transformation are respectively input into the classification layer and the regression layer, the classification layer outputs the class labels of the anchor frames, and the regression layer outputs the position coordinates of the anchor frames;

s3, extracting a workflow video to be detected containing a complete workflow from a production operation video, dividing the workflow video to be detected into a group according to every K frames to form a series of video segments, calculating a workflow image as a first input image for each video segment by using a TVL1 algorithm, and sampling a frame of video frame from the video segments as a second input image; respectively inputting a first input image and a second input image into the trained double-current structure enhanced detector, and respectively outputting an anchor frame type label and a position coordinate of a detection target through a classification layer and a regression layer to form a detection result of a video segment;

and S4, based on the detection results of all video segments in the workflow video to be detected, generating classification and time positioning regression results of the production operation behaviors in the workflow video to be detected by utilizing an action pipeline generation algorithm.

Preferably, in the anchor frame improvement submodule, a feature map f output by a fifth-layer volume block in a feature extraction network is obtained first _vgg Then in the feature map f _vgg Setting k as 9 kinds of anchor frames at each position point, and performing convolution operation on each anchor frame and a 3 multiplied by 3 window to form a 512-dimensional vector; and inputting the 512-dimensional vector into two parallel full-connection layers, wherein one full-connection layer outputs a score for judging whether the anchor frame belongs to the foreground or the background, and the other full-connection layer outputs the position coordinate of the anchor frame, so that the corrected anchor frame is obtained.

Preferably, the size parameter p of the convolution kernel is 3.

Preferably, when the dual-flow structure enhanced detector is trained, the total loss function is set as:

wherein:

representing the total number of positive samples, wherein i and j are respectively a category index and a counting index in the training sample;

is a binary variable, and is characterized in that,when the anchor frame sample is a positive sample, the anchor frame sample is 1, otherwise, the anchor frame sample is 0; l is _conf A classification loss function is expressed, formulated as:

wherein: phi and

respectively representing an anchor frame cube positive sample set and a negative sample set;

a confidence score representing that the predicted anchor box cube belongs to the tag y;

a confidence score representing that the predicted anchor box cube belongs to the background;

L _reg expressing the regression loss function, and formulating as:

wherein: t represents the total frame number of the workflow video, (x, y) is the center of each anchor frame in the action micropipe, and w and h are the width and height of the anchor frame respectively;

representing a frame f _t Q coordinates after returning of the upper anchor frame;

representing an anchor frame coordinate true value representing a label;

Loss _ARS the penalty function representing the anchor box improvement submodule is:

wherein: i denotes the index of the anchor box in each training sample batch, p _i Indicating the probability that anchor box i contains an action target,

indicating a binary label when the anchor frame is a positive sample

Otherwise, the value is 0; o _i And

respectively the predicted coordinate offset of the anchor frame and the corresponding target offset; l is a radical of an alcohol _cls Representing a cross entropy loss function; l is _reg Represents the smooth SmoothL1(·) loss function; n is a radical of _cls Represents the total number of classified samples, N _reg Represents the total number of regression samples, if and only if

The samples then participate in position regression.

Preferably, 90% of samples in the training data set are selected as a training set, the rest samples are selected as a verification set, the double-flow structure enhanced detector is trained and tested, and the double-flow structure enhanced detector is optimized continuously according to result parameters.

Preferably, the production operation video is derived from video monitoring equipment in a factory production scene in real time, and the whole production operation process is detected in real time through the double-flow structure enhanced detector, so that the detection of the production operation behaviors of workers is realized.

Preferably, the position coordinates of the anchor frame are expressed as (x, y, w, h), where (x, y) represents the coordinates of the upper left corner of the anchor frame, and (w, h) represents the width and height of the anchor frame.

Preferably, in the feature stack layer, 6 output features are stacked in the channel dimension to form a stacked feature.

Preferably, the anchor frame improvement submodule is enabled only in the model training phase and is skipped in the model inference phase.

Preferably, each of the video segments includes 8 frames.

Compared with the prior art, the invention has the following beneficial effects:

the invention provides a workflow detection method based on a double-flow structure enhanced detector, which combines a characterization detector and a motion detector in a double-flow structure mode to predict a motion target appearing on a frame. Meanwhile, in order to improve the preset 3D anchor frames with different scales, an anchor frame improvement submodule with a characteristic alignment characteristic is introduced into the detector, and an adaptive cubic anchor frame is output according to the candidate frames detected on each frame. In order to improve the capturing capability of the moving target, a hierarchical aggregation strategy is applied in the detector to improve the discrimination of the characteristic diagram of the middle layer of the model. In addition, the layer regularization is used for reducing the internal covariate offset phenomenon between the internal layers of the detector, so that the whole training process is more efficient and stable. And finally, based on the extracted significant features, generating branches by utilizing an air-time domain action pipeline to finish classification and positioning regression of production operation behaviors. The invention can be deployed in a factory production scene, detects the whole production operation process in real time and realizes the detection of the production operation behavior of workers.

Drawings

Fig. 1 is a flow chart of a workflow detection method based on a dual-stream structure enhanced detector.

Fig. 2 is a model structure diagram of a dual-stream structure enhanced detector.

Fig. 3 is a schematic diagram of an anchor frame improvement submodule.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. The technical characteristics in the embodiments of the present invention can be combined correspondingly without mutual conflict.

In the description of the present invention, it is to be understood that the terms "first", "second", and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or to implicitly indicate the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature.

For convenience of description, some of the concept definitions and notations of the present invention will be explained first:

conv4_ 3: representing a third layer convolution in a third convolution block in the detector;

conv5_ 3: representing a third layer of convolutions in a fifth convolution block in the detector;

conv 7: representing the seventh convolution block in the detector

Conv8_ 2: representing the second layer convolution in the eighth convolution block in the detector;

Cat_l ₁ 、Cat_l ₂ : respectively representing a first convergence layer and a second convergence layer generated by the polymerization;

n action micropipes of type c;

micro tube for indicating action

And candidate micropipes

The time sequence overlapping degree between the two is more than tau;

nms: representing a non-maximum suppression algorithm;

(x, y, w, h): a candidate box, i.e., an anchor box, is represented, where (x, y) represents the top left corner coordinates and (w, h) represents the width and height of the box. The anchor frame is an outsourced rectangular frame of the object on the graph, and the anchor frames on a set of successive graphs form an anchor frame cube.

The invention aims at researching the operation behavior detection technology in the production flow in a complex manufacturing environment. Since various operating machines, transport vehicles, other auxiliary equipment tools, and the like are often present in a production workshop, they are easily shielded from each other. In addition, the operations of different procedures have similarity, and the same procedure is produced by different workers and has larger difference and the like. In addition, factors such as illumination change and view angle conversion make it challenging to perform operation behavior recognition in a manufacturing environment. On the basis of the existing work, the invention provides a double-flow structure enhanced detector for deeply researching and analyzing production action from a plurality of key problems of CPS workflow identification and analysis, action identification, target detection and the like, and solves the problem of detection of operation action of workers. The dual stream structure enhanced detector combines a characterization detector and a motion detector in a dual stream structure to predict the moving objects present on a frame. Meanwhile, in order to improve the preset 3D anchor frames with different scales, an anchor frame improvement submodule with a characteristic alignment characteristic is introduced into the detector, and an adaptive cubic anchor frame is output according to the detected anchor frame on each frame. In order to improve the capturing capability of the moving target, a hierarchical aggregation strategy is applied in the detector to improve the discrimination of the characteristic diagram of the middle layer of the model. In addition, the layer regularization is used for reducing the internal covariate offset phenomenon between the internal layers of the detector, so that the whole training process is more efficient and stable. And finally, based on the extracted significant features, generating branches by utilizing an air-time domain action pipeline to finish classification and positioning regression of production operation behaviors.

As shown in fig. 1, in a preferred embodiment of the present invention, there is provided a workflow detection method based on a dual-stream structure enhanced detector, which includes:

s1, aiming at each workflow video in the workflow video data set, dividing each workflow video into a group according to each K frame to form a series of video segments; for each video segment, calculating a light stream image as a first input image by using a TVL1 algorithm, and sampling a frame of video frame from the video segment as a second input image; forming a training sample by using a group of first input images, second input images and positions and category labels of artificially marked targets in the images; a training data set is constructed from a series of training samples.

It should be noted that the number of video frames included in each video segment may be optimized according to practical application, and K is preferably 8.

And S2, training the double-flow structure enhanced detector by using continuous training samples in the training data set, so that the double-flow structure enhanced detector can detect the target position and the type from the image.

As shown in fig. 2, the dual stream structure enhanced detector includes two parallel motion detectors and characterization detectors as well as a classification layer and a regression layer.

The motion detector and the characterization detector have the same network structure and both comprise an anchor frame improvement submodule and a backbone network, wherein the backbone network consists of a feature extraction network, a first convergence layer, a second convergence layer and a feature stacking layer. However, although the network structure is the same, the detector input of the motion detector is the first input image, and the detector input of the characterization detector is the second input image.

The backbone network structure is the same for the motion detector and the characterization detector. Specific implementation forms of the feature extraction network, the first convergence layer, the second convergence layer and the feature stack layer of the backbone networks in the two networks are described in detail below.

The feature extraction network is designed on the basis of the VGG-16 model, and the method is that the first five volume blocks of the VGG-16 model are reserved, and all full connection layers of the VGG-16 model are replaced by 6 new volume blocks. Therefore, the feature extraction network consists of 11 cascaded convolution blocks, where the first 5 convolution blocks are the first 5 convolution blocks of the VGG-16 model itself, and the last 6 convolution blocks are new convolution blocks. The detailed parameters of the newly designed 6 convolution blocks are as follows:

1) the 6 th convolution block only contains one layer of convolution, wherein the convolution kernel size is 38 × 512, the activation function is ReLU, and the anchor frame number of each pixel position is set to be 4;

2) the 7 th convolution block contains only one convolution, wherein the convolution kernel size is 19 × 1024, the activation function is ReLU, and the anchor frame number of each pixel position is set to 6;

3) the 8 th convolution block comprises two layers of convolution, wherein the convolution kernel size is 10 × 512, the activation function is ReLU, and the anchor frame number of each pixel position is set to be 6;

4) the 9 th convolution block comprises two layers of convolution, wherein the sizes of convolution kernels are 5 × 256, the activation function is ReLU, and the anchor frame number of each pixel position is set to be 6;

5) the 10 th convolution block comprises two layers of convolution, wherein the convolution kernel size is 3 × 256, the activation function is ReLU, and the anchor frame number of each pixel position is set to be 4;

1) the 11 th convolution block contains two layers of convolutions, where the layer convolution kernel size is 1 × 256, the activation function is ReLU, and the anchor frame count per pixel location is set to 4.

The first convergence layer (i.e., Cat _ l) ₁ Layer), extracting the convolution characteristics of the 3 rd layer in the 4 th volume block, the convolution characteristics of the 3 rd layer in the 5 th volume block and the convolution characteristics of the 7 th volume block from the characteristic extraction network, performing layer regularization transformation on the extracted three characteristics respectively, then aggregating, and finally forming low-level edge characteristics through full connection operation and using the low-level edge characteristics as the output characteristics of the first aggregation layer.

The second convergence layer (i.e., Cat _ l) ₂ Layer), extracting convolution characteristics of a 7 th volume block and convolution characteristics of a 2 nd layer in an 8 th volume block from the characteristic extraction network, performing layer regularization transformation on the two extracted characteristics respectively, then aggregating the two extracted characteristics, and finally forming high-level semantic characteristics through full connection operation to serve as output characteristics of a second aggregation layer.

In the feature stacking layer, the output features of the first convergence layer, the second convergence layer and the 8 th convolution block, the 9 th convolution block, the 10 th convolution block and the 11 th convolution block in the feature extraction network are stacked to form stacking features which are used as the output of a motion detector or a characterization detector where the main network is located.

Note that in the feature stack layer, preferably 6 output features are stacked in the channel dimension to form a stacked feature.

And after splicing the features output by the motion detector and the features output by the characterization detector, performing layer regularization transformation on the spliced features, respectively inputting the features subjected to the layer regularization transformation into a classification layer and a regression layer, outputting a class label of the anchor frame by the classification layer, and outputting a position coordinate of the anchor frame by the regression layer.

It should be noted that the classification layer and the regression layer can be implemented by two different fully connected layers.

Considering that some existing target detection methods directly perform classification and position regression based on a predefined anchor frame, however, a preset anchor frame cube specifies a fixed size and size, is difficult to handle operation behaviors with large delay variation, lacks flexibility, and often results in inaccurate detection results when short and fast actions occur. Therefore, the anchor frame improvement submodule added in the invention is used for generating a correction anchor frame and assisting the training of the dual-flow structure enhanced detector. Therefore, the anchor frame improvement submodule is only enabled in the model training phase and can be directly skipped in the model reasoning phase. The specific form of the anchor frame improvement submodule can be adjusted according to the actual situation.

In order to reduce the internal covariate offset between convolutional layers and make the training process faster and more stable, the detector of the present invention designs a layer regularization transformation, that is: layer regularization transformation with jump connection is introduced before and after a specific convergence layer (splicing feature maps of different levels for detecting targets of different sizes); while the layer regularization transform is also applied before the final classification and regression layer.

As a better implementation form of the embodiment of the invention, the detection precision is improved by aligning convolution operation in the anchor frame improvement submodule. Suppose (x, y, w, h) represents a candidate box, where (x, y) represents the top left corner coordinate and (w, h) represents the width and height of the box. For k anchor frames, the alignment convolution is performed in two steps as follows:

(a) first, in the feature map f _vgg The region within the anchor frame is divided equally into p × p small blocks based on the position coordinates of the modified anchor frame, where p is the size of the convolution kernel. In the ith row, the coordinates of the center position of the j column block can be calculated as:

(b) and secondly, multiplying the characteristic value of the center of each small block by a corresponding convolution kernel parameter, and generating a characteristic value aligned with the correction anchor frame through weighting transformation to form a 512-dimensional intermediate vector.

The 512-dimensional intermediate vector is input into two parallel fully-connected layers to carry out 2k confidence score predictions and 4k position coordinate (x, y, w, h) regression, and finally a corrected candidate region frame is obtained.

In the present invention, as shown in fig. 3, the specific implementation of the anchor frame improvement submodule can be implemented as follows:

firstly, obtaining a feature map f output by a fifth layer volume block in a feature extraction network _vgg Then on the feature map f _vgg Each anchor point on k is set to 9 anchor boxes (anchor boxes), and each anchor box is convolved with a 3 × 3 window (sliding window) (512) to form a 512-dimensional vector. That is, on the third layer feature map in the fifth volume block of the dual-stream structure, a small network is designed to slide on the volume layer, that is, a 3 × 3 window is adopted to fully connect the input feature maps, and then the sliding window is mapped to a 512-dimensional intermediate vector (512-dimensional layer).

Then, the 512-dimensional vector is input into two parallel full-link layers, one full-link layer (cls layer) outputs a score (2k scores) for judging whether the anchor frame belongs to the foreground or the background by performing full-link operation twice on the 512-dimensional vector, and the other full-link layer (reg layer) outputs the position coordinates (4k scores) of the anchor frame, so that the position coordinates of the corrected anchor frame are obtained.

As a preferred implementation form of the embodiment of the present invention, the size parameter p of the convolution kernel may be set to 3.

It should be noted that the production operation video is derived from the video monitoring equipment in the factory production scene in real time, so that the whole production operation process is detected in real time through the double-flow structure enhanced detector, and the detection of the production operation behaviors of workers is realized. Of course, the production operation video may also be derived from an offline video, which is not limited thereto.

The training mode of the double-flow structure enhanced detector can refer to model training in the prior art so as to meet the final prediction accuracy. The training data set can be called training set and test set according to the actual division. As a better implementation form of the embodiment of the invention, 90% of samples in the training data set are selected as a training set, the rest are selected as a verification set, the double-flow structure enhanced detector is trained and tested, and the double-flow structure enhanced detector is optimized continuously according to the result parameters.

When the double-current structure enhanced detector is trained, a total loss function of the detector needs to be reasonably set, and three losses need to be considered in the total loss function: the first part loss is the final classification loss of the model, the second part loss is the final position coordinate regression loss of the model, and the third part loss is the auxiliary loss brought by the anchor frame improvement submodule.

As a preferred implementation form of the embodiment of the present invention, the total loss function L of the dual-flow structure enhanced detector during training may be set as:

wherein:

is a binary variable, is 1 when the anchor frame sample is a positive sample, and is 0 otherwise; l is a radical of an alcohol _conf Represents the classification loss function, formulated as:

wherein: phi and

L _reg expressing the regression loss function, and formulating as:

representing an anchor frame coordinate true value representing a label;

indicating a binary label when the anchor frame is a positive sample

Otherwise, the value is 0; o _i And

respectively the predicted coordinate offset of the anchor frame and the corresponding target offset; l is _cls Representing a cross entropy loss function; l is _reg Represents the smooth SmoothL1(·) loss function; n is a radical of _cls Represents the total number of classified samples, N _reg Represents the total number of regression samples, if and only if

The samples then participate in position regression.

The enhanced detector based on the double-flow structure for solving the workflow identification problem mainly comprises three key functional components: the device comprises a double-flow feature extraction module, an anchor frame improvement submodule, a multi-level feature map aggregation strategy and layer regularization transformation.

The double-flow characteristic extraction module mainly extracts motion characteristics from two levels of spatial flow (single-frame image) and optical flow (optical flow image), respectively designs a characterization detector and a motion detector, and integrates the characterization detector and the motion detector in a frame to detect a motion target appearing on a video frame. The design of the module refers to an SSD model and is responsible for detecting moving objects with different scales on a specific characteristic diagram. Meanwhile, the method inherits the advantages of the one-stage and two-stage detection frames, and not only can obtain higher accuracy than the two-stage detection method under the condition of not constructing a very deep convolution layer, but also can keep the high efficiency of the one-stage method.

The anchor frame improvement submodule solves the problem that delay variation of different action behaviors in production operation is large, and when candidate nomination sections with different scales are predicted, based on the same feature representation, dislocation exists between the time sequence range of the feature and the span of the anchor frame, so that related information cannot be captured. The anchor frame improvement sub-module is thus able to correct the resulting series of fixed anchor frame cubes of different sizes and proportions to output a more accurate target detection frame for subsequent linking of production action pipelines.

The multi-level feature map aggregation strategy and the layer regularization transformation are respectively used for solving the problem that a target main body is lost in the detection process and reducing the internal covariate deviation between convolution layers, so that the training process is faster and more stable. Because the direct stacking of the multi-level feature maps cannot fully utilize the complementarity between the low-level edge information and the high-level semantic information, the introduction of the multi-level feature map aggregation strategy can better detect the object with large scale change, and is particularly effective in small object detection. The layer regularization transformation can sharply reduce the covariate offset inside the convolutional layer, so that the detector can be trained at a higher learning rate without paying too much attention to the initial parameter setting.

The double-flow structure enhanced detector provided by the invention solves some problems of the workflow detector used in the existing complex manufacturing environment. The performance improvement of the conventional detector is often ignored, and when rapid motion occurs in a complex manufacturing environment, a weak detector often loses a detection target on consecutive frames, which directly hinders subsequent motion pipeline generation, and finally results in an unsatisfactory detection result. In addition, on video clips, a general-purpose detector cannot model the change process of the motion, because the region of interest is defined only by using a fixed 3D anchor cube, and the rapid change of the position and the size of a moving object in time sequence cannot be well adapted.

The above-described embodiments are merely preferred embodiments of the present invention, which should not be construed as limiting the invention. Various changes and modifications may be made by one of ordinary skill in the pertinent art without departing from the spirit and scope of the present invention. Therefore, the technical scheme obtained by adopting the mode of equivalent replacement or equivalent transformation is within the protection scope of the invention.

Claims

1. A workflow detection method based on a double-flow structure enhanced detector is characterized by comprising the following steps:

s1, aiming at each workflow video in the workflow video data set, dividing each workflow video into a group according to each K frame to form a series of video segments; for each video segment, calculating a light stream image as a first input image by using a TVL1 algorithm, and sampling a frame of video frame from the video segment as a second input image; forming a training sample by using a group of first input images, second input images and positions and class labels of artificially labeled targets in the images; forming a training data set by a series of training samples;

s3, extracting a workflow video to be detected containing a complete workflow from a production operation video, dividing the workflow video to be detected into a group according to every K frames to form a series of video segments, calculating a workflow image as a first input image by using a TVL1 algorithm for each video segment, and sampling a frame of video frame from the video segments as a second input image; respectively inputting a first input image and a second input image into the trained double-current structure enhanced detector, and respectively outputting an anchor frame type label and a position coordinate of a detection target through a classification layer and a regression layer to form a detection result of a video segment;

2. The workflow detection method based on the dual-stream structure enhanced detector as claimed in claim 1, wherein in the anchor frame improvement submodule, the feature map f outputted by the fifth layer volume block in the feature extraction network is firstly obtained _vgg Then on the feature map f _vgg Each position point on the anchor block is provided with k which is 9 anchor frames,performing convolution operation on each anchor frame and a 3 multiplied by 3 window to form 512-dimensional vectors; and inputting the 512-dimensional vector into two parallel full-connection layers, wherein one full-connection layer outputs a score for judging whether the anchor frame belongs to the foreground or the background, and the other full-connection layer outputs the position coordinate of the anchor frame, so that the corrected anchor frame is obtained.

3. The dual-stream structure-enhanced detector-based workflow detection method of claim 2, wherein the size parameter p of the convolution kernel is 3.

4. The workflow detection method based on the dual-stream structure enhanced detector as claimed in claim 1, wherein the total loss function of the dual-stream structure enhanced detector is set as:

wherein:

is a binary variable, is 1 when the anchor frame sample is a positive sample, and is 0 otherwise; l is _conf A classification loss function is expressed, formulated as:

wherein: phi and

L _reg expressing the regression loss function, and formulating as:

wherein: t represents the total frame number of the workflow video, (x, y) is the center of each anchor frame in the action micropipe, and w and h are the width and the height of the anchor frame respectively;

representing the true value of the coordinate of the anchor frame representing the annotation;

representing two class labels when the anchor frame is a positive sample

Otherwise, the value is 0; o _i And

The samples then participate in position regression.

5. The workflow detection method based on the dual-stream structure enhanced detector as claimed in claim 1, wherein 90% of samples in the training data set are selected as a training set, and the rest are selected as a verification set, and the dual-stream structure enhanced detector is trained and tested, and is tuned and optimized according to the result parameters, so as to continuously optimize the dual-stream structure enhanced detector.

6. The workflow detection method based on the dual-flow structure enhanced detector as claimed in claim 1, wherein the production operation video is real-time sourced from video monitoring equipment in a factory production scene, and the detection of the production operation behavior of workers is realized by real-time detection of the whole process of the production operation through the dual-flow structure enhanced detector.

7. The workflow detection method based on the dual-stream structure enhanced detector as recited in claim 1, wherein the position coordinates of the anchor frame are expressed as (x, y, w, h), wherein (x, y) represents the upper left corner coordinates of the anchor frame, and (w, h) represents the width and height of the anchor frame.

8. The dual-stream architecture enhanced detector-based workflow detection method of claim 1, wherein 6 output features are stacked in channel dimension in the feature stack layer forming a stacked feature.

9. The dual-stream architecture enhanced detector-based workflow detection method of claim 1, wherein the anchor frame improvement sub-module is enabled only in a model training phase and is skipped in a model reasoning phase.

10. The dual-stream structure enhanced detector-based workflow detection method of claim 1 wherein each of said video segments comprises K-8 frames of video.