CN114943452A - Workflow detection method based on double-flow structure enhanced detector - Google Patents

Workflow detection method based on double-flow structure enhanced detector Download PDF

Info

Publication number
CN114943452A
CN114943452A CN202210574109.0A CN202210574109A CN114943452A CN 114943452 A CN114943452 A CN 114943452A CN 202210574109 A CN202210574109 A CN 202210574109A CN 114943452 A CN114943452 A CN 114943452A
Authority
CN
China
Prior art keywords
detector
convolution
layer
anchor frame
anchor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210574109.0A
Other languages
Chinese (zh)
Inventor
胡海洋
张敏
李忠金
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University Shangyu Science and Engineering Research Institute Co Ltd
Original Assignee
Hangzhou Dianzi University Shangyu Science and Engineering Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University Shangyu Science and Engineering Research Institute Co Ltd filed Critical Hangzhou Dianzi University Shangyu Science and Engineering Research Institute Co Ltd
Priority to CN202210574109.0A priority Critical patent/CN114943452A/en
Publication of CN114943452A publication Critical patent/CN114943452A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0633Workflow analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Engineering & Computer Science (AREA)
  • Strategic Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Economics (AREA)
  • Operations Research (AREA)
  • Game Theory and Decision Science (AREA)
  • Development Economics (AREA)
  • Marketing (AREA)
  • Educational Administration (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a workflow detection method based on a double-flow structure enhanced detector. The present invention combines a characterization detector and a motion detector in a dual stream configuration to predict moving objects appearing on a frame. Meanwhile, an anchor frame improvement sub-module with characteristic alignment characteristics is introduced into the detector, and an adaptive anchor frame cube is output according to the candidate frames detected on each frame. In order to improve the capturing capability of the moving target, a hierarchical aggregation strategy is applied in the detector to improve the discrimination of the characteristic diagram of the middle layer of the model. In addition, the layer regularization is used for reducing the internal covariate offset phenomenon between the internal layers of the detector, so that the whole training process is more efficient and stable. And finally, based on the extracted significant features, generating branches by utilizing an air-time domain action pipeline to finish classification and positioning regression of production operation behaviors. The invention can be deployed in a factory production scene, detects the whole production operation process in real time and realizes the detection of the production operation behavior of workers.

Description

Workflow detection method based on double-flow structure enhanced detector
Technical Field
The invention belongs to the technical field of workflow detection, and particularly relates to a workflow detection method based on a double-flow structure enhanced detector.
Background
At present, the main measures for factory intelligent modification include: and installing a digital CPS system, laying sensor equipment, erecting a production monitoring camera at multiple visual angles, and the like, and monitoring and recording the whole production process in real time. However, since the monitoring operation manager only focuses on the production safety management, the analysis and process improvement of the production process are not considered, and the monitoring operation manager is limited by the energy and physical factors of the workers, and the whole production process cannot be observed and monitored. In addition, most manufacturing enterprises pay more attention to profits, and professional personnel are not required to effectively analyze industrial big data such as CPS (cyber physical system), sensor equipment and production logs, so that massive industrial big data are not fully mined and used.
The use of computer vision technology in production has received increasing attention from researchers over the past decade. Although many advanced detection models are developed in laboratories, in a complex manufacturing environment, factors such as fast moving speed, short moving time delay, occlusion and view angle change exist. Therefore, there is also a large exploration space for the behavior recognition in a factory production environment. Most of the detection models develop detection from the frame level or video segment level, and then the detection candidate frames at the frame level or segment level are tracked or linked along the time sequence to form a space-time action pipeline (action tube). Although these methods based on object detection achieve some effect, they do not fully utilize motion information between consecutive frames, but process video frames as a series of independent RGB pictures, and when applied to an industrial production environment, detection errors often occur.
Disclosure of Invention
The invention aims to solve the problem that the performance improvement of a detector in the existing workflow identification model is often neglected, and provides a workflow detection method based on a double-flow structure enhanced detector to realize accurate detection of operation behaviors.
The invention adopts the following specific technical scheme:
a workflow detection method based on a dual-stream structure enhanced detector comprises the following steps:
s1, aiming at each workflow video in the workflow video data set, dividing each workflow video into a group according to each K frame to form a series of video segments; for each video segment, calculating a streaming image by using a TVL1 algorithm to serve as a first input image, and sampling a frame of video frame from the video segment to serve as a second input image; forming a training sample by using a group of first input images, second input images and positions and class labels of artificially labeled targets in the images; forming a training data set by a series of training samples;
s2, training a double-flow structure enhanced detector by using continuous training samples in the training data set, so that the detector can detect the target position and the type from an image;
the dual-stream structure enhanced detector comprises two parallel motion detectors and characterization detectors as well as a classification layer and a regression layer;
the network structures of the motion detector and the characterization detector are the same, and both the motion detector and the characterization detector comprise anchor frame improvement sub-modules and a main network consisting of a feature extraction network, a first convergence layer, a second convergence layer and a feature stacking layer, but the detector inputs of the motion detector and the characterization detector are different, wherein the detector input of the motion detector is the first input image, and the detector input of the characterization detector is the second input image;
the anchor frame improvement submodule is used for generating a correction anchor frame and assisting the training of the double-flow structure enhanced detector;
in the backbone network, the feature extraction network is composed of 11 cascaded convolution blocks, wherein the first 5 convolution blocks are the first 5 convolution blocks of the VGG-16 model, and the last 6 convolution blocks are new convolution blocks; the 6 th convolution block only contains one layer of convolution, wherein the convolution kernel size is 38 × 512, the activation function is ReLU, and the anchor frame number of each pixel position is set to be 4; the 7 th volume block only contains one layer of convolution, wherein the convolution kernel size is 19 × 1024, the activation function is ReLU, and the anchor frame number of each pixel position is set to be 6; the 8 th convolution block comprises two layers of convolution, wherein the convolution kernel size is 10 × 512, the activation function is ReLU, and the anchor frame number of each pixel position is set to be 6; the 9 th convolution block comprises two layers of convolution, wherein the sizes of convolution kernels are 5 × 256, the activation function is ReLU, and the anchor frame number of each pixel position is set to be 6; the 10 th convolution block comprises two layers of convolution, wherein the convolution kernel size is 3 × 256, the activation function is ReLU, and the anchor frame number of each pixel position is set to be 4; the 11 th convolution block comprises two layers of convolution, wherein the sizes of convolution kernels of the layers are all 1 × 256, the activation function is ReLU, and the number of anchor frames of each pixel position is set to be 4; in the first convergence layer, extracting the convolution characteristics of the 3 rd layer in the 4 th volume block, the convolution characteristics of the 3 rd layer in the 5 th volume block and the convolution characteristics of the 7 th volume block from the feature extraction network, performing layer regularization transformation on the extracted three characteristics respectively, and then aggregating the three characteristics, and finally forming low-level edge characteristics through full-connection operation to serve as output characteristics of the first convergence layer; in the second convergence layer, extracting convolution characteristics of a 7 th convolution block and convolution characteristics of a 2 nd layer in an 8 th convolution block from the characteristic extraction network, performing layer regularization transformation on the two extracted characteristics respectively, and then aggregating the two extracted characteristics, and finally forming high-level semantic characteristics through full-connection operation to serve as output characteristics of the second convergence layer; in the feature stacking layer, stacking the output features of the first convergence layer, the second convergence layer and the 8 th convolution block, the 9 th convolution block, the 10 th convolution block and the 11 th convolution block in the feature extraction network to form stacking features which are used as the output of a motion detector or a characterization detector where a main network is located;
the characteristics output by the motion detector and the characteristics output by the characterization detector are spliced and then subjected to layer regularization transformation, the characteristics subjected to layer regularization transformation are respectively input into the classification layer and the regression layer, the classification layer outputs the class labels of the anchor frames, and the regression layer outputs the position coordinates of the anchor frames;
s3, extracting a workflow video to be detected containing a complete workflow from a production operation video, dividing the workflow video to be detected into a group according to every K frames to form a series of video segments, calculating a workflow image as a first input image for each video segment by using a TVL1 algorithm, and sampling a frame of video frame from the video segments as a second input image; respectively inputting a first input image and a second input image into the trained double-current structure enhanced detector, and respectively outputting an anchor frame type label and a position coordinate of a detection target through a classification layer and a regression layer to form a detection result of a video segment;
and S4, based on the detection results of all video segments in the workflow video to be detected, generating classification and time positioning regression results of the production operation behaviors in the workflow video to be detected by utilizing an action pipeline generation algorithm.
Preferably, in the anchor frame improvement submodule, a feature map f output by a fifth-layer volume block in a feature extraction network is obtained first vgg Then in the feature map f vgg Setting k as 9 kinds of anchor frames at each position point, and performing convolution operation on each anchor frame and a 3 multiplied by 3 window to form a 512-dimensional vector; and inputting the 512-dimensional vector into two parallel full-connection layers, wherein one full-connection layer outputs a score for judging whether the anchor frame belongs to the foreground or the background, and the other full-connection layer outputs the position coordinate of the anchor frame, so that the corrected anchor frame is obtained.
Preferably, the size parameter p of the convolution kernel is 3.
Preferably, when the dual-flow structure enhanced detector is trained, the total loss function is set as:
Figure BDA0003660010610000031
wherein:
Figure BDA0003660010610000032
representing the total number of positive samples, wherein i and j are respectively a category index and a counting index in the training sample;
Figure BDA0003660010610000041
is a binary variable, and is characterized in that,when the anchor frame sample is a positive sample, the anchor frame sample is 1, otherwise, the anchor frame sample is 0; l is conf A classification loss function is expressed, formulated as:
Figure BDA0003660010610000042
wherein: phi and
Figure BDA0003660010610000043
respectively representing an anchor frame cube positive sample set and a negative sample set;
Figure BDA0003660010610000044
a confidence score representing that the predicted anchor box cube belongs to the tag y;
Figure BDA0003660010610000045
a confidence score representing that the predicted anchor box cube belongs to the background;
L reg expressing the regression loss function, and formulating as:
Figure BDA0003660010610000046
wherein: t represents the total frame number of the workflow video, (x, y) is the center of each anchor frame in the action micropipe, and w and h are the width and height of the anchor frame respectively;
Figure BDA0003660010610000047
representing a frame f t Q coordinates after returning of the upper anchor frame;
Figure BDA0003660010610000048
representing an anchor frame coordinate true value representing a label;
Loss ARS the penalty function representing the anchor box improvement submodule is:
Figure BDA0003660010610000049
wherein: i denotes the index of the anchor box in each training sample batch, p i Indicating the probability that anchor box i contains an action target,
Figure BDA00036600106100000410
indicating a binary label when the anchor frame is a positive sample
Figure BDA00036600106100000411
Otherwise, the value is 0; o i And
Figure BDA00036600106100000412
respectively the predicted coordinate offset of the anchor frame and the corresponding target offset; l is a radical of an alcohol cls Representing a cross entropy loss function; l is reg Represents the smooth SmoothL1(·) loss function; n is a radical of cls Represents the total number of classified samples, N reg Represents the total number of regression samples, if and only if
Figure BDA00036600106100000413
The samples then participate in position regression.
Preferably, 90% of samples in the training data set are selected as a training set, the rest samples are selected as a verification set, the double-flow structure enhanced detector is trained and tested, and the double-flow structure enhanced detector is optimized continuously according to result parameters.
Preferably, the production operation video is derived from video monitoring equipment in a factory production scene in real time, and the whole production operation process is detected in real time through the double-flow structure enhanced detector, so that the detection of the production operation behaviors of workers is realized.
Preferably, the position coordinates of the anchor frame are expressed as (x, y, w, h), where (x, y) represents the coordinates of the upper left corner of the anchor frame, and (w, h) represents the width and height of the anchor frame.
Preferably, in the feature stack layer, 6 output features are stacked in the channel dimension to form a stacked feature.
Preferably, the anchor frame improvement submodule is enabled only in the model training phase and is skipped in the model inference phase.
Preferably, each of the video segments includes 8 frames.
Compared with the prior art, the invention has the following beneficial effects:
the invention provides a workflow detection method based on a double-flow structure enhanced detector, which combines a characterization detector and a motion detector in a double-flow structure mode to predict a motion target appearing on a frame. Meanwhile, in order to improve the preset 3D anchor frames with different scales, an anchor frame improvement submodule with a characteristic alignment characteristic is introduced into the detector, and an adaptive cubic anchor frame is output according to the candidate frames detected on each frame. In order to improve the capturing capability of the moving target, a hierarchical aggregation strategy is applied in the detector to improve the discrimination of the characteristic diagram of the middle layer of the model. In addition, the layer regularization is used for reducing the internal covariate offset phenomenon between the internal layers of the detector, so that the whole training process is more efficient and stable. And finally, based on the extracted significant features, generating branches by utilizing an air-time domain action pipeline to finish classification and positioning regression of production operation behaviors. The invention can be deployed in a factory production scene, detects the whole production operation process in real time and realizes the detection of the production operation behavior of workers.
Drawings
Fig. 1 is a flow chart of a workflow detection method based on a dual-stream structure enhanced detector.
Fig. 2 is a model structure diagram of a dual-stream structure enhanced detector.
Fig. 3 is a schematic diagram of an anchor frame improvement submodule.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. The technical characteristics in the embodiments of the present invention can be combined correspondingly without mutual conflict.
In the description of the present invention, it is to be understood that the terms "first", "second", and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or to implicitly indicate the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature.
For convenience of description, some of the concept definitions and notations of the present invention will be explained first:
conv4_ 3: representing a third layer convolution in a third convolution block in the detector;
conv5_ 3: representing a third layer of convolutions in a fifth convolution block in the detector;
conv 7: representing the seventh convolution block in the detector
Conv8_ 2: representing the second layer convolution in the eighth convolution block in the detector;
Cat_l 1 、Cat_l 2 : respectively representing a first convergence layer and a second convergence layer generated by the polymerization;
Figure BDA0003660010610000064
n action micropipes of type c;
Figure BDA0003660010610000061
micro tube for indicating action
Figure BDA0003660010610000062
And candidate micropipes
Figure BDA0003660010610000063
The time sequence overlapping degree between the two is more than tau;
nms: representing a non-maximum suppression algorithm;
(x, y, w, h): a candidate box, i.e., an anchor box, is represented, where (x, y) represents the top left corner coordinates and (w, h) represents the width and height of the box. The anchor frame is an outsourced rectangular frame of the object on the graph, and the anchor frames on a set of successive graphs form an anchor frame cube.
The invention aims at researching the operation behavior detection technology in the production flow in a complex manufacturing environment. Since various operating machines, transport vehicles, other auxiliary equipment tools, and the like are often present in a production workshop, they are easily shielded from each other. In addition, the operations of different procedures have similarity, and the same procedure is produced by different workers and has larger difference and the like. In addition, factors such as illumination change and view angle conversion make it challenging to perform operation behavior recognition in a manufacturing environment. On the basis of the existing work, the invention provides a double-flow structure enhanced detector for deeply researching and analyzing production action from a plurality of key problems of CPS workflow identification and analysis, action identification, target detection and the like, and solves the problem of detection of operation action of workers. The dual stream structure enhanced detector combines a characterization detector and a motion detector in a dual stream structure to predict the moving objects present on a frame. Meanwhile, in order to improve the preset 3D anchor frames with different scales, an anchor frame improvement submodule with a characteristic alignment characteristic is introduced into the detector, and an adaptive cubic anchor frame is output according to the detected anchor frame on each frame. In order to improve the capturing capability of the moving target, a hierarchical aggregation strategy is applied in the detector to improve the discrimination of the characteristic diagram of the middle layer of the model. In addition, the layer regularization is used for reducing the internal covariate offset phenomenon between the internal layers of the detector, so that the whole training process is more efficient and stable. And finally, based on the extracted significant features, generating branches by utilizing an air-time domain action pipeline to finish classification and positioning regression of production operation behaviors.
As shown in fig. 1, in a preferred embodiment of the present invention, there is provided a workflow detection method based on a dual-stream structure enhanced detector, which includes:
s1, aiming at each workflow video in the workflow video data set, dividing each workflow video into a group according to each K frame to form a series of video segments; for each video segment, calculating a light stream image as a first input image by using a TVL1 algorithm, and sampling a frame of video frame from the video segment as a second input image; forming a training sample by using a group of first input images, second input images and positions and category labels of artificially marked targets in the images; a training data set is constructed from a series of training samples.
It should be noted that the number of video frames included in each video segment may be optimized according to practical application, and K is preferably 8.
And S2, training the double-flow structure enhanced detector by using continuous training samples in the training data set, so that the double-flow structure enhanced detector can detect the target position and the type from the image.
As shown in fig. 2, the dual stream structure enhanced detector includes two parallel motion detectors and characterization detectors as well as a classification layer and a regression layer.
The motion detector and the characterization detector have the same network structure and both comprise an anchor frame improvement submodule and a backbone network, wherein the backbone network consists of a feature extraction network, a first convergence layer, a second convergence layer and a feature stacking layer. However, although the network structure is the same, the detector input of the motion detector is the first input image, and the detector input of the characterization detector is the second input image.
The backbone network structure is the same for the motion detector and the characterization detector. Specific implementation forms of the feature extraction network, the first convergence layer, the second convergence layer and the feature stack layer of the backbone networks in the two networks are described in detail below.
The feature extraction network is designed on the basis of the VGG-16 model, and the method is that the first five volume blocks of the VGG-16 model are reserved, and all full connection layers of the VGG-16 model are replaced by 6 new volume blocks. Therefore, the feature extraction network consists of 11 cascaded convolution blocks, where the first 5 convolution blocks are the first 5 convolution blocks of the VGG-16 model itself, and the last 6 convolution blocks are new convolution blocks. The detailed parameters of the newly designed 6 convolution blocks are as follows:
1) the 6 th convolution block only contains one layer of convolution, wherein the convolution kernel size is 38 × 512, the activation function is ReLU, and the anchor frame number of each pixel position is set to be 4;
2) the 7 th convolution block contains only one convolution, wherein the convolution kernel size is 19 × 1024, the activation function is ReLU, and the anchor frame number of each pixel position is set to 6;
3) the 8 th convolution block comprises two layers of convolution, wherein the convolution kernel size is 10 × 512, the activation function is ReLU, and the anchor frame number of each pixel position is set to be 6;
4) the 9 th convolution block comprises two layers of convolution, wherein the sizes of convolution kernels are 5 × 256, the activation function is ReLU, and the anchor frame number of each pixel position is set to be 6;
5) the 10 th convolution block comprises two layers of convolution, wherein the convolution kernel size is 3 × 256, the activation function is ReLU, and the anchor frame number of each pixel position is set to be 4;
1) the 11 th convolution block contains two layers of convolutions, where the layer convolution kernel size is 1 × 256, the activation function is ReLU, and the anchor frame count per pixel location is set to 4.
The first convergence layer (i.e., Cat _ l) 1 Layer), extracting the convolution characteristics of the 3 rd layer in the 4 th volume block, the convolution characteristics of the 3 rd layer in the 5 th volume block and the convolution characteristics of the 7 th volume block from the characteristic extraction network, performing layer regularization transformation on the extracted three characteristics respectively, then aggregating, and finally forming low-level edge characteristics through full connection operation and using the low-level edge characteristics as the output characteristics of the first aggregation layer.
The second convergence layer (i.e., Cat _ l) 2 Layer), extracting convolution characteristics of a 7 th volume block and convolution characteristics of a 2 nd layer in an 8 th volume block from the characteristic extraction network, performing layer regularization transformation on the two extracted characteristics respectively, then aggregating the two extracted characteristics, and finally forming high-level semantic characteristics through full connection operation to serve as output characteristics of a second aggregation layer.
In the feature stacking layer, the output features of the first convergence layer, the second convergence layer and the 8 th convolution block, the 9 th convolution block, the 10 th convolution block and the 11 th convolution block in the feature extraction network are stacked to form stacking features which are used as the output of a motion detector or a characterization detector where the main network is located.
Note that in the feature stack layer, preferably 6 output features are stacked in the channel dimension to form a stacked feature.
And after splicing the features output by the motion detector and the features output by the characterization detector, performing layer regularization transformation on the spliced features, respectively inputting the features subjected to the layer regularization transformation into a classification layer and a regression layer, outputting a class label of the anchor frame by the classification layer, and outputting a position coordinate of the anchor frame by the regression layer.
It should be noted that the classification layer and the regression layer can be implemented by two different fully connected layers.
Considering that some existing target detection methods directly perform classification and position regression based on a predefined anchor frame, however, a preset anchor frame cube specifies a fixed size and size, is difficult to handle operation behaviors with large delay variation, lacks flexibility, and often results in inaccurate detection results when short and fast actions occur. Therefore, the anchor frame improvement submodule added in the invention is used for generating a correction anchor frame and assisting the training of the dual-flow structure enhanced detector. Therefore, the anchor frame improvement submodule is only enabled in the model training phase and can be directly skipped in the model reasoning phase. The specific form of the anchor frame improvement submodule can be adjusted according to the actual situation.
In order to reduce the internal covariate offset between convolutional layers and make the training process faster and more stable, the detector of the present invention designs a layer regularization transformation, that is: layer regularization transformation with jump connection is introduced before and after a specific convergence layer (splicing feature maps of different levels for detecting targets of different sizes); while the layer regularization transform is also applied before the final classification and regression layer.
As a better implementation form of the embodiment of the invention, the detection precision is improved by aligning convolution operation in the anchor frame improvement submodule. Suppose (x, y, w, h) represents a candidate box, where (x, y) represents the top left corner coordinate and (w, h) represents the width and height of the box. For k anchor frames, the alignment convolution is performed in two steps as follows:
(a) first, in the feature map f vgg The region within the anchor frame is divided equally into p × p small blocks based on the position coordinates of the modified anchor frame, where p is the size of the convolution kernel. In the ith row, the coordinates of the center position of the j column block can be calculated as:
Figure BDA0003660010610000091
(b) and secondly, multiplying the characteristic value of the center of each small block by a corresponding convolution kernel parameter, and generating a characteristic value aligned with the correction anchor frame through weighting transformation to form a 512-dimensional intermediate vector.
The 512-dimensional intermediate vector is input into two parallel fully-connected layers to carry out 2k confidence score predictions and 4k position coordinate (x, y, w, h) regression, and finally a corrected candidate region frame is obtained.
In the present invention, as shown in fig. 3, the specific implementation of the anchor frame improvement submodule can be implemented as follows:
firstly, obtaining a feature map f output by a fifth layer volume block in a feature extraction network vgg Then on the feature map f vgg Each anchor point on k is set to 9 anchor boxes (anchor boxes), and each anchor box is convolved with a 3 × 3 window (sliding window) (512) to form a 512-dimensional vector. That is, on the third layer feature map in the fifth volume block of the dual-stream structure, a small network is designed to slide on the volume layer, that is, a 3 × 3 window is adopted to fully connect the input feature maps, and then the sliding window is mapped to a 512-dimensional intermediate vector (512-dimensional layer).
Then, the 512-dimensional vector is input into two parallel full-link layers, one full-link layer (cls layer) outputs a score (2k scores) for judging whether the anchor frame belongs to the foreground or the background by performing full-link operation twice on the 512-dimensional vector, and the other full-link layer (reg layer) outputs the position coordinates (4k scores) of the anchor frame, so that the position coordinates of the corrected anchor frame are obtained.
As a preferred implementation form of the embodiment of the present invention, the size parameter p of the convolution kernel may be set to 3.
S3, extracting a workflow video to be detected containing a complete workflow from a production operation video, dividing the workflow video to be detected into a group according to every K frames to form a series of video segments, calculating a workflow image as a first input image for each video segment by using a TVL1 algorithm, and sampling a frame of video frame from the video segments as a second input image; respectively inputting a first input image and a second input image into the trained double-current structure enhanced detector, and respectively outputting an anchor frame type label and a position coordinate of a detection target through a classification layer and a regression layer to form a detection result of a video segment;
and S4, based on the detection results of all video segments in the workflow video to be detected, generating classification and time positioning regression results of the production operation behaviors in the workflow video to be detected by utilizing an action pipeline generation algorithm.
It should be noted that the production operation video is derived from the video monitoring equipment in the factory production scene in real time, so that the whole production operation process is detected in real time through the double-flow structure enhanced detector, and the detection of the production operation behaviors of workers is realized. Of course, the production operation video may also be derived from an offline video, which is not limited thereto.
The training mode of the double-flow structure enhanced detector can refer to model training in the prior art so as to meet the final prediction accuracy. The training data set can be called training set and test set according to the actual division. As a better implementation form of the embodiment of the invention, 90% of samples in the training data set are selected as a training set, the rest are selected as a verification set, the double-flow structure enhanced detector is trained and tested, and the double-flow structure enhanced detector is optimized continuously according to the result parameters.
When the double-current structure enhanced detector is trained, a total loss function of the detector needs to be reasonably set, and three losses need to be considered in the total loss function: the first part loss is the final classification loss of the model, the second part loss is the final position coordinate regression loss of the model, and the third part loss is the auxiliary loss brought by the anchor frame improvement submodule.
As a preferred implementation form of the embodiment of the present invention, the total loss function L of the dual-flow structure enhanced detector during training may be set as:
Figure BDA0003660010610000111
wherein:
Figure BDA0003660010610000112
representing the total number of positive samples, wherein i and j are respectively a category index and a counting index in the training sample;
Figure BDA0003660010610000113
is a binary variable, is 1 when the anchor frame sample is a positive sample, and is 0 otherwise; l is a radical of an alcohol conf Represents the classification loss function, formulated as:
Figure BDA0003660010610000114
wherein: phi and
Figure BDA0003660010610000115
respectively representing an anchor frame cube positive sample set and a negative sample set;
Figure BDA0003660010610000116
a confidence score representing that the predicted anchor box cube belongs to the tag y;
Figure BDA0003660010610000117
a confidence score representing that the predicted anchor box cube belongs to the background;
L reg expressing the regression loss function, and formulating as:
Figure BDA0003660010610000118
wherein: t represents the total frame number of the workflow video, (x, y) is the center of each anchor frame in the action micropipe, and w and h are the width and height of the anchor frame respectively;
Figure BDA0003660010610000119
representing a frame f t Q coordinates after returning of the upper anchor frame;
Figure BDA00036600106100001110
representing an anchor frame coordinate true value representing a label;
Loss ARS the penalty function representing the anchor box improvement submodule is:
Figure BDA00036600106100001111
wherein: i denotes the index of the anchor box in each training sample batch, p i Indicating the probability that anchor box i contains an action target,
Figure BDA00036600106100001112
indicating a binary label when the anchor frame is a positive sample
Figure BDA00036600106100001113
Otherwise, the value is 0; o i And
Figure BDA00036600106100001114
respectively the predicted coordinate offset of the anchor frame and the corresponding target offset; l is cls Representing a cross entropy loss function; l is reg Represents the smooth SmoothL1(·) loss function; n is a radical of cls Represents the total number of classified samples, N reg Represents the total number of regression samples, if and only if
Figure BDA00036600106100001115
The samples then participate in position regression.
The enhanced detector based on the double-flow structure for solving the workflow identification problem mainly comprises three key functional components: the device comprises a double-flow feature extraction module, an anchor frame improvement submodule, a multi-level feature map aggregation strategy and layer regularization transformation.
The double-flow characteristic extraction module mainly extracts motion characteristics from two levels of spatial flow (single-frame image) and optical flow (optical flow image), respectively designs a characterization detector and a motion detector, and integrates the characterization detector and the motion detector in a frame to detect a motion target appearing on a video frame. The design of the module refers to an SSD model and is responsible for detecting moving objects with different scales on a specific characteristic diagram. Meanwhile, the method inherits the advantages of the one-stage and two-stage detection frames, and not only can obtain higher accuracy than the two-stage detection method under the condition of not constructing a very deep convolution layer, but also can keep the high efficiency of the one-stage method.
The anchor frame improvement submodule solves the problem that delay variation of different action behaviors in production operation is large, and when candidate nomination sections with different scales are predicted, based on the same feature representation, dislocation exists between the time sequence range of the feature and the span of the anchor frame, so that related information cannot be captured. The anchor frame improvement sub-module is thus able to correct the resulting series of fixed anchor frame cubes of different sizes and proportions to output a more accurate target detection frame for subsequent linking of production action pipelines.
The multi-level feature map aggregation strategy and the layer regularization transformation are respectively used for solving the problem that a target main body is lost in the detection process and reducing the internal covariate deviation between convolution layers, so that the training process is faster and more stable. Because the direct stacking of the multi-level feature maps cannot fully utilize the complementarity between the low-level edge information and the high-level semantic information, the introduction of the multi-level feature map aggregation strategy can better detect the object with large scale change, and is particularly effective in small object detection. The layer regularization transformation can sharply reduce the covariate offset inside the convolutional layer, so that the detector can be trained at a higher learning rate without paying too much attention to the initial parameter setting.
The double-flow structure enhanced detector provided by the invention solves some problems of the workflow detector used in the existing complex manufacturing environment. The performance improvement of the conventional detector is often ignored, and when rapid motion occurs in a complex manufacturing environment, a weak detector often loses a detection target on consecutive frames, which directly hinders subsequent motion pipeline generation, and finally results in an unsatisfactory detection result. In addition, on video clips, a general-purpose detector cannot model the change process of the motion, because the region of interest is defined only by using a fixed 3D anchor cube, and the rapid change of the position and the size of a moving object in time sequence cannot be well adapted.
The above-described embodiments are merely preferred embodiments of the present invention, which should not be construed as limiting the invention. Various changes and modifications may be made by one of ordinary skill in the pertinent art without departing from the spirit and scope of the present invention. Therefore, the technical scheme obtained by adopting the mode of equivalent replacement or equivalent transformation is within the protection scope of the invention.

Claims (10)

1. A workflow detection method based on a double-flow structure enhanced detector is characterized by comprising the following steps:
s1, aiming at each workflow video in the workflow video data set, dividing each workflow video into a group according to each K frame to form a series of video segments; for each video segment, calculating a light stream image as a first input image by using a TVL1 algorithm, and sampling a frame of video frame from the video segment as a second input image; forming a training sample by using a group of first input images, second input images and positions and class labels of artificially labeled targets in the images; forming a training data set by a series of training samples;
s2, training a double-flow structure enhanced detector by using continuous training samples in the training data set, so that the detector can detect the target position and the type from an image;
the dual-stream structure enhanced detector comprises two parallel motion detectors and characterization detectors as well as a classification layer and a regression layer;
the network structures of the motion detector and the characterization detector are the same, and both the motion detector and the characterization detector comprise anchor frame improvement sub-modules and a main network consisting of a feature extraction network, a first convergence layer, a second convergence layer and a feature stacking layer, but the detector inputs of the motion detector and the characterization detector are different, wherein the detector input of the motion detector is the first input image, and the detector input of the characterization detector is the second input image;
the anchor frame improvement submodule is used for generating a correction anchor frame and assisting the training of the double-flow structure enhanced detector;
in the backbone network, the feature extraction network is composed of 11 cascaded convolution blocks, wherein the first 5 convolution blocks are the first 5 convolution blocks of the VGG-16 model, and the last 6 convolution blocks are new convolution blocks; the 6 th convolution block only contains one layer of convolution, wherein the convolution kernel size is 38 × 512, the activation function is ReLU, and the anchor frame number of each pixel position is set to be 4; the 7 th volume block only contains one layer of convolution, wherein the convolution kernel size is 19 × 1024, the activation function is ReLU, and the anchor frame number of each pixel position is set to be 6; the 8 th convolution block comprises two layers of convolution, wherein the convolution kernel size is 10 × 512, the activation function is ReLU, and the anchor frame number of each pixel position is set to be 6; the 9 th convolution block comprises two layers of convolution, wherein the sizes of convolution kernels are 5 × 256, the activation function is ReLU, and the anchor frame number of each pixel position is set to be 6; the 10 th convolution block comprises two layers of convolution, wherein the convolution kernel size is 3 × 256, the activation function is ReLU, and the anchor frame number of each pixel position is set to be 4; the 11 th convolution block comprises two layers of convolution, wherein the sizes of convolution kernels of the layers are all 1 × 256, the activation function is ReLU, and the number of anchor frames of each pixel position is set to be 4; in the first convergence layer, extracting the convolution characteristics of the 3 rd layer in the 4 th volume block, the convolution characteristics of the 3 rd layer in the 5 th volume block and the convolution characteristics of the 7 th volume block from the feature extraction network, performing layer regularization transformation on the extracted three characteristics respectively, and then aggregating the three characteristics, and finally forming low-level edge characteristics through full-connection operation to serve as output characteristics of the first convergence layer; in the second convergence layer, extracting convolution characteristics of a 7 th convolution block and convolution characteristics of a 2 nd layer in an 8 th convolution block from the characteristic extraction network, performing layer regularization transformation on the two extracted characteristics respectively, and then aggregating the two extracted characteristics, and finally forming high-level semantic characteristics through full-connection operation to serve as output characteristics of the second convergence layer; in the feature stacking layer, stacking the output features of the first convergence layer, the second convergence layer and the 8 th convolution block, the 9 th convolution block, the 10 th convolution block and the 11 th convolution block in the feature extraction network to form stacking features which are used as the output of a motion detector or a characterization detector where a main network is located;
the characteristics output by the motion detector and the characteristics output by the characterization detector are spliced and then subjected to layer regularization transformation, the characteristics subjected to layer regularization transformation are respectively input into the classification layer and the regression layer, the classification layer outputs the class labels of the anchor frames, and the regression layer outputs the position coordinates of the anchor frames;
s3, extracting a workflow video to be detected containing a complete workflow from a production operation video, dividing the workflow video to be detected into a group according to every K frames to form a series of video segments, calculating a workflow image as a first input image by using a TVL1 algorithm for each video segment, and sampling a frame of video frame from the video segments as a second input image; respectively inputting a first input image and a second input image into the trained double-current structure enhanced detector, and respectively outputting an anchor frame type label and a position coordinate of a detection target through a classification layer and a regression layer to form a detection result of a video segment;
and S4, based on the detection results of all video segments in the workflow video to be detected, generating classification and time positioning regression results of the production operation behaviors in the workflow video to be detected by utilizing an action pipeline generation algorithm.
2. The workflow detection method based on the dual-stream structure enhanced detector as claimed in claim 1, wherein in the anchor frame improvement submodule, the feature map f outputted by the fifth layer volume block in the feature extraction network is firstly obtained vgg Then on the feature map f vgg Each position point on the anchor block is provided with k which is 9 anchor frames,performing convolution operation on each anchor frame and a 3 multiplied by 3 window to form 512-dimensional vectors; and inputting the 512-dimensional vector into two parallel full-connection layers, wherein one full-connection layer outputs a score for judging whether the anchor frame belongs to the foreground or the background, and the other full-connection layer outputs the position coordinate of the anchor frame, so that the corrected anchor frame is obtained.
3. The dual-stream structure-enhanced detector-based workflow detection method of claim 2, wherein the size parameter p of the convolution kernel is 3.
4. The workflow detection method based on the dual-stream structure enhanced detector as claimed in claim 1, wherein the total loss function of the dual-stream structure enhanced detector is set as:
Figure FDA0003660010600000031
wherein:
Figure FDA0003660010600000032
representing the total number of positive samples, wherein i and j are respectively a category index and a counting index in the training sample;
Figure FDA0003660010600000033
is a binary variable, is 1 when the anchor frame sample is a positive sample, and is 0 otherwise; l is conf A classification loss function is expressed, formulated as:
Figure FDA0003660010600000034
wherein: phi and
Figure FDA0003660010600000035
respectively representing an anchor frame cube positive sample set and a negative sample set;
Figure FDA0003660010600000036
a confidence score representing that the predicted anchor box cube belongs to the tag y;
Figure FDA0003660010600000037
a confidence score representing that the predicted anchor box cube belongs to the background;
L reg expressing the regression loss function, and formulating as:
Figure FDA0003660010600000038
wherein: t represents the total frame number of the workflow video, (x, y) is the center of each anchor frame in the action micropipe, and w and h are the width and the height of the anchor frame respectively;
Figure FDA0003660010600000039
representing a frame f t Q coordinates after returning of the upper anchor frame;
Figure FDA00036600106000000310
representing the true value of the coordinate of the anchor frame representing the annotation;
Loss ARS the penalty function representing the anchor box improvement submodule is:
Figure FDA00036600106000000311
wherein: i denotes the index of the anchor box in each training sample batch, p i Indicating the probability that anchor box i contains an action target,
Figure FDA00036600106000000312
representing two class labels when the anchor frame is a positive sample
Figure FDA00036600106000000313
Otherwise, the value is 0; o i And
Figure FDA00036600106000000314
respectively the predicted coordinate offset of the anchor frame and the corresponding target offset; l is a radical of an alcohol cls Representing a cross entropy loss function; l is reg Represents the smooth SmoothL1(·) loss function; n is a radical of cls Represents the total number of classified samples, N reg Represents the total number of regression samples, if and only if
Figure FDA00036600106000000315
The samples then participate in position regression.
5. The workflow detection method based on the dual-stream structure enhanced detector as claimed in claim 1, wherein 90% of samples in the training data set are selected as a training set, and the rest are selected as a verification set, and the dual-stream structure enhanced detector is trained and tested, and is tuned and optimized according to the result parameters, so as to continuously optimize the dual-stream structure enhanced detector.
6. The workflow detection method based on the dual-flow structure enhanced detector as claimed in claim 1, wherein the production operation video is real-time sourced from video monitoring equipment in a factory production scene, and the detection of the production operation behavior of workers is realized by real-time detection of the whole process of the production operation through the dual-flow structure enhanced detector.
7. The workflow detection method based on the dual-stream structure enhanced detector as recited in claim 1, wherein the position coordinates of the anchor frame are expressed as (x, y, w, h), wherein (x, y) represents the upper left corner coordinates of the anchor frame, and (w, h) represents the width and height of the anchor frame.
8. The dual-stream architecture enhanced detector-based workflow detection method of claim 1, wherein 6 output features are stacked in channel dimension in the feature stack layer forming a stacked feature.
9. The dual-stream architecture enhanced detector-based workflow detection method of claim 1, wherein the anchor frame improvement sub-module is enabled only in a model training phase and is skipped in a model reasoning phase.
10. The dual-stream structure enhanced detector-based workflow detection method of claim 1 wherein each of said video segments comprises K-8 frames of video.
CN202210574109.0A 2022-05-24 2022-05-24 Workflow detection method based on double-flow structure enhanced detector Pending CN114943452A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210574109.0A CN114943452A (en) 2022-05-24 2022-05-24 Workflow detection method based on double-flow structure enhanced detector

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210574109.0A CN114943452A (en) 2022-05-24 2022-05-24 Workflow detection method based on double-flow structure enhanced detector

Publications (1)

Publication Number Publication Date
CN114943452A true CN114943452A (en) 2022-08-26

Family

ID=82909785

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210574109.0A Pending CN114943452A (en) 2022-05-24 2022-05-24 Workflow detection method based on double-flow structure enhanced detector

Country Status (1)

Country Link
CN (1) CN114943452A (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109543513A (en) * 2018-10-11 2019-03-29 平安科技(深圳)有限公司 Method, apparatus, equipment and the storage medium that intelligent monitoring is handled in real time
CN110909658A (en) * 2019-11-19 2020-03-24 北京工商大学 Method for recognizing human body behaviors in video based on double-current convolutional network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109543513A (en) * 2018-10-11 2019-03-29 平安科技(深圳)有限公司 Method, apparatus, equipment and the storage medium that intelligent monitoring is handled in real time
CN110909658A (en) * 2019-11-19 2020-03-24 北京工商大学 Method for recognizing human body behaviors in video based on double-current convolutional network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
MIN ZHANG等: "Action detection with two-stream", 《THE VISUAL COMPUTER》, vol. 39, 5 February 2022 (2022-02-05), pages 1193 - 1204 *

Similar Documents

Publication Publication Date Title
Liu et al. Exploiting unlabeled data in cnns by self-supervised learning to rank
CN112884064B (en) Target detection and identification method based on neural network
CN115171165A (en) Pedestrian re-identification method and device with global features and step-type local features fused
CN106980895A (en) Convolutional neural networks Forecasting Methodology based on rotary area
CN110210335B (en) Training method, system and device for pedestrian re-recognition learning model
Bhattacharyya et al. Deformable PV-RCNN: Improving 3D object detection with learned deformations
CN109800712B (en) Vehicle detection counting method and device based on deep convolutional neural network
Li et al. Coda: Counting objects via scale-aware adversarial density adaption
CN114821390B (en) Method and system for tracking twin network target based on attention and relation detection
Cepni et al. Vehicle detection using different deep learning algorithms from image sequence
CN114913150A (en) Intelligent identification method for concrete dam defect time sequence image
CN110532959B (en) Real-time violent behavior detection system based on two-channel three-dimensional convolutional neural network
CN116503399B (en) Insulator pollution flashover detection method based on YOLO-AFPS
CN114155474A (en) Damage identification technology based on video semantic segmentation algorithm
Avola et al. A shape comparison reinforcement method based on feature extractors and f1-score
Yang et al. BANDT: A border-aware network with deformable transformers for visual tracking
Niu et al. Boundary-aware RGBD salient object detection with cross-modal feature sampling
CN110070023B (en) Self-supervision learning method and device based on motion sequential regression
Gopal et al. Tiny object detection: Comparative study using single stage CNN object detectors
Liu et al. Find small objects in UAV images by feature mining and attention
CN114492755A (en) Target detection model compression method based on knowledge distillation
Kalva et al. Smart Traffic monitoring system using YOLO and deep learning techniques
CN116977859A (en) Weak supervision target detection method based on multi-scale image cutting and instance difficulty
CN117237751A (en) Training method, recognition method, system and equipment for grabbing detection model
Liu et al. Siamese network with bidirectional feature pyramid for small target tracking

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination