CN111104855B

CN111104855B - Workflow identification method based on time sequence behavior detection

Info

Publication number: CN111104855B
Application number: CN201911097168.8A
Authority: CN
Inventors: 胡海洋; 王庆文; 李忠金; 陈洁; 俞佳成; 张力; 余嘉伟; 周美玲; 陈振辉
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2019-11-11
Filing date: 2019-11-11
Publication date: 2023-09-12
Anticipated expiration: 2039-11-11
Also published as: CN111104855A

Abstract

The invention discloses a workflow identification method based on time sequence behavior detection. The invention provides a time sequence video sparse sampling method, which reduces useless data and accelerates the overall speed of a frame. Meanwhile, in order to accelerate the recognition speed and recognition accuracy, the three-dimensional residual error network is used for extracting the features, so that the speed and efficiency of space-time feature extraction are ensured. In order to avoid missing some candidate fragments in the time sequence candidate sub-network, the invention uses Soft-NMS to update NMS, thereby ensuring recall rate of detection result. Through the strategy, the framework provided by the invention is more suitable for workflow identification in a complex factory production environment. The method solves the time sequence positioning problem of actions in the video, effectively utilizes a large amount of intelligent monitoring videos generated in the factory environment, detects the types of activities in the video and the time slices of the activities in the video through the neural network, models the workflow, and further optimizes the whole production flow.

Description

Workflow identification method based on time sequence behavior detection

Technical Field

The invention belongs to computer vision, and relates to application of deep learning in the aspect of recognition technology of factory production operation behaviors, which is used for recognizing operation types of production operations and time slices of the production operations. In the current industrial production, video data with tens of thousands of valuable values are generated daily by intelligent monitoring, and in order to fully utilize the video data, a workflow identification method is designed to automatically extract features from a large amount of video data, and identify the type of the industrial production operation and the time slices of the industrial production operation.

Background

With the development of information technology and manufacturing technology, intelligent manufacturing has become an important trend in the field of industrial production. Workflow identification is also rapidly innovating as a big technical direction for intelligent manufacturing. Generally, a workflow is generally considered to be a sequence of independent activities. The traditional workflow identification technology mainly adopts a process mining technology, namely, related content of service execution is extracted and analyzed from a system log generated by a service process information system, and a service flow or production decision is adjusted in time.

Thanks to the development of computer vision technology, the current workflow identification mainly shoots various production activities on a production line through cameras in a production workshop, processes and calculates videos, and realizes rapid detection of industrial processes. The obvious light change exists in a factory production workshop, and the scene identified by the object motion shielding workflow has the specificity compared with the common scene, so that the traditional identification method relying on target object detection is difficult to be applicable. Because the monitoring video is a real-time video, workflow identification has a real-time requirement on identification speed.

Meanwhile, as the requirements of factory production on workflow identification are further improved, different tasks in the workflow often have different execution times, and no clear definition exists between the start of the task and the end of the task, and workflow identification based on behavior identification cannot perform time sequence positioning on activities in video. Therefore, the present invention changes the emphasis of workflow identification from behavior identification to time-series behavior detection. Unlike workflow identification based on behavior identification, workflow identification methods based on time series behavior detection also include positioning of an activity in time series, i.e., start time and end time of the activity. The key points of the task are mainly the following two points: 1. many methods adopt a framework for classifying candidate fragments, and for such methods, it is important to have higher quality of candidate fragments, i.e. to reduce the number of candidate fragments under the condition of ensuring that the recognition result is correct. 2. The category of the behavior can accurately obtain the category information of the time sequence segment.

However, the production operation behavior recognition technology has its complexity and specificity. Deep learning approaches have enjoyed tremendous success in the field of image processing, and many classification architectures based on convolutional neural networks have been designed to handle workflow recognition in unprocessed long videos. The invention designs a workflow method based on time sequence behavior detection to detect the category of actions in unprocessed long videos in factories and the time slices of actions.

Disclosure of Invention

The invention discloses a workflow identification method based on time sequence behavior detection. Workflow identification is complex and special compared to general video scenes because of frequent light changes of the background, severe occlusion between objects, various noise interferences, and long duration of workers in the manufacturing environment in which the workflow identification is located. Because of the complex environment of a factory, workers may be involved in a large number of unwanted video frames during a production campaign for a long period of time. Aiming at the phenomenon, the invention provides a time sequence video sparse sampling method, which reduces useless data and accelerates the overall speed of a frame. Meanwhile, in order to accelerate the recognition speed and recognition accuracy, the three-dimensional residual error network is used for extracting the features, so that the speed and efficiency of space-time feature extraction are ensured. In order to avoid missing some candidate fragments in the time sequence candidate sub-network, the invention uses Soft-NMS to update NMS, thereby ensuring recall rate of detection result. Through the strategy, the framework provided by the invention is more suitable for workflow identification in a complex factory production environment.

The method comprises the following specific steps:

and (1) processing the video to be processed by using a sparse sampling strategy, wherein the method comprises the steps of dividing continuous frames in the video into a section, and randomly sampling the inside of the section, so that the redundancy of the video is avoided.

And (2) extracting features by using a three-dimensional residual error network, wherein the three-dimensional residual error network is mainly used for reducing training time and reducing model size.

Step (3), obtaining candidate active fragments by using an anchor point mechanism to form anchor point fragments;

and (4) judging whether the candidate anchor segments contain actions or not through a classification network, and determining the boundaries of the anchor segments through a boundary regression network, so that a candidate list I is obtained.

And (5) removing the highly overlapped and low-confidence active fragments in the candidate list I by using a Soft-NMS method to obtain a final candidate list II.

Step (6), through a maximum pooling method, candidate features with any length are changed into features I with fixed dimensions of 512 x 1 x 4.

And (7) inputting the feature I with the fixed dimension into two full-connection layers simultaneously, wherein two continuous full-connection layers are connected with one softmax classifier for judging the activity category, and the other two continuous full-connection layers are connected with one regression layer for improving the time period of candidate activity occurrence.

And (8) modeling the workflow according to the obtained action category and the generated activity fragment thereof, thereby further optimizing the whole production flow.

The invention has the following beneficial effects:

the workflow identification method based on time sequence behavior detection provided by the invention is mainly characterized by innovation: 1) The input video is processed by a sparse sampling method; 2) Extracting features of the input video by using a three-dimensional residual neutral network; 3) The high overlap and low confidence candidate segments are processed using a Soft-NMS method.

In order to avoid redundant frames generated by performing a certain production activity for a long time, the input video is processed by the sparse sampling method provided by the invention. The three-dimensional residual neural network is used for reducing training time and narrowing the model. To avoid the occurrence of highly overlapping and low confidence candidate segments, the present invention uses a soft-NMS approach to improve the quality of the candidate segments.

The method solves the time sequence positioning problem of actions in the video, effectively utilizes a large amount of intelligent monitoring videos generated in the factory environment, detects the types of activities in the video and the time slices of the activities in the video through the neural network, models the workflow, and further optimizes the whole production flow.

Drawings

Fig. 1 is a schematic diagram of three-dimensional residual neural network construction.

Fig. 2 is a schematic diagram of an anchor mechanism employed in the present invention.

Fig. 3 is an overall flow from input to output of the present invention.

Detailed Description

The invention is further described below with reference to the drawings and examples.

Related concept definition and symbol description

f _t : representing a video frame of the video at time t.

a _k : representing the size of the kth anchor at a certain timing position.

Lcls: a multi-class softmax penalty function is represented for determining the class of active segments in the workflow.

Lreg: the L1 smoothing loss function is represented for optimizing the relative offset of the candidate segment and the real case.

PLIST: a candidate list containing confidence levels.

RLIST: the return list obtained after screening by soft-NMS.

ROI: a region of interest.

softmax: a multi-type classifier, the probability of each class is as follows:

as shown in fig. 1-3, a workflow identification method based on time sequence behavior detection specifically comprises the following implementation steps:

step (1), avoiding redundancy generated during long-time operation by a video sparse sampling mode, wherein the specific sampling mode is as follows:

1-1. Decomposing the original video into a sequence of successive video frames { f ₁ ,f ₂ ,f ₃ ,…,f _t }。

And 1-2, taking continuous 4 frames as an interval, and randomly reading one frame in one interval at a time, so that the video frames are prevented from being acquired at the same position at each time while the time sequence redundancy is avoided.

And 1-3, taking the obtained continuous random frames as training samples to be input into a three-dimensional residual neutral network.

And (2) extracting space-time characteristics by using a three-dimensional residual error network. How to increase the model speed while ensuring the model size, the invention adopts a three-dimensional residual neural network (generally Res 18) to extract the space-time characteristics of the input video frames, and in order to ensure the calculation efficiency and the end-to-end training, the time sequence candidate sub-network and the behavior classification sub-network share the space-time characteristics. (see FIG. 1)

2-1. The dimensions of the input video frame are compressed to 112 x 112 to maximize GPU performance.

2-2, avoiding the phenomenon of gradient disappearance or gradient explosion through a residual block, and increasing the depth of a network.

2-3. Input continuous RGB video frames of size 3 x 112 to a three-dimensional residual neural network, which outputs a final spatio-temporal signature of 512 x l/8 x 7.

Step (3), an anchor point mechanism is adopted to acquire anchor point fragments with different sizes (see figure 2).

3-1, aiming at space-time characteristics, the time sequence candidate sub-network can quickly generate anchor point fragments with different sizes, judge the probability that the video in the anchor point fragments is a target or background and be used for initially generating a candidate list I, wherein the expression formula of the anchor point fragments is as follows:

anchor＝{c _i ,l _i }

wherein c _i Representing the center position of the anchor segment, l _i Representing the length of the anchor segment in time sequence.

3-2, anchor segments are distributed in the time-space characteristics with the length of 8/L, and k anchor points are arranged at the time sequence position of each time-space characteristic, so that k anchor segment sequences with different lengths are arranged at the time sequence position of each time-space characteristic in the time sequence candidate sub-network. That is, the length increment sequence of the anchor point segment at a certain time sequence position is:

{a ₁ ,a ₂ ,a ₃ ,…,a _k }

ak is the kth anchor point at a certain time sequence position;

3-3. Read f frames per second (fps=f), then the overlap length of these anchor segments at the time sequence positions is:

{a ₁ *8/f,a ₂ *8/f,a ₃ *8/f,…,a _k *8/f}

the anchor segments of different lengths can be used to determine the time sequence position of an anchor segment through a boundary regression network.

Step (4), judging whether the anchor segments contain actions or not through a classification network and determining the anchor segments through a boundary regression network, so as to obtain a candidate list I; and (3) performing a series of operations on the space-time characteristics generated in the step (2-3), and taking the generated candidate activity fragments as the input of a behavior classification network of the next stage.

4-1. Adding a three-dimensional convolution kernel with the size of 3 x 3 to expand the space-time receptive field.

4-2. Adding a three-dimensional max-pooling kernel of size 1*H/16 x w/16 for generating a signature comprising only timing features.

4-3. After adding two 1 x 1 convolution kernels, the final feature map size is 512 x l/8 x 1.

4-4. Candidate list I (candidate list PLIST) is obtained by a boundary regression network and a behavior classification network. The specific loss function is as follows:

wherein N is _cls The normalized value is classified, namely batch processing quantity; n (N) _reg The regression normalization value, namely the number of anchor point fragments, i is the index value of the anchor point fragments in the feature map; λ is a weight value used to balance the two losses, and since cls term and reg term are almost equal in weight, λ takes 1.

And (5) removing the highly overlapped and low-confidence active fragments in the candidate list I (candidate list) by using a Soft-NMS method to obtain a final candidate list II (return list PLIST). The method adopts a certain linear method to reduce rather than directly clear, so that the method ensures the precision and simultaneously avoids missing some fragments with lower scores as much as possible, and the specific process is as follows:

5-1. Selecting a candidate active fragment M with the highest confidence from the candidate list PLIST, deleting the candidate active fragment M from the candidate list RLIST and putting the candidate active fragment M into the return list RLIST.

5-2 for each candidate b within the candidate list PLIST _i The confidence score is s _i If b is calculated _i And if the overlap ratio with M is larger than the threshold value, the confidence coefficient is reduced in a linear mode. Namely:

s _i (1-iou(M,b _i ))

where iou is the intersection ratio, i.e. the ratio of the intersection of M and bi to the union.

5-3. Repeating steps 5-1 and 5-2 until the candidate list PLIST is empty.

Step (6), obtaining a feature with a fixed dimension by a maximum pooling method, and extracting a feature I with a fixed dimension from the space-time feature by using ROI pooling, namely, performing maximum pooling on the input 512 x L/8 x 7 in a grid of 1 x 4 to obtain a final unified dimension 512 x 1 x 4.

Step (7), inputting the feature I with fixed dimension into two full-connection layers simultaneously, wherein two continuous full-connection layers are connected with a softmax classifier for judging the activity category, and the other two continuous full-connection layers are connected with a regression layer for improving the time period of candidate activity occurrence, and the method is concretely realized as follows:

7-1 two fully connected layers are added.

7-2 adding a boundary regression network to carry out boundary correction, and carrying out behavior classification through a behavior classification network to obtain a target action category, wherein the specific loss function is as follows:

wherein N is _cls The normalized value is classified, namely batch processing quantity; n (N) _reg The regression normalization value is the number of candidate fragments; λ is a weight value for balancing the two losses, still set to 1.

And (8) determining the relation among the tasks according to the target action category and the generated time slices, so that the tasks in the workflow are identified, and the production efficiency of the industrial global equipment can be analyzed.

Claims

1. The workflow identification method based on time sequence behavior detection is characterized by comprising the following steps:

the method comprises the steps of (1) processing a video to be processed by using a sparse sampling strategy, wherein continuous frames in the video are divided into a section, and random sampling is carried out in the section so as to avoid video redundancy;

step (2), extracting features by using a three-dimensional residual error network, reducing training time and reducing the size of a model;

step (4), judging whether the candidate anchor points contain actions or not through a classification network, and determining the boundaries of the anchor points through a boundary regression network, so that a candidate list I is obtained;

step (5), removing the active fragments with high overlap and low confidence in the candidate list I by using a Soft-NMS method to obtain a final candidate list II;

step (6), through a maximum pooling method, candidate features with any length are changed into features I with fixed dimensions of 512 x 1 x 4;

step (7), inputting the feature I with fixed dimension into two full-connection layers simultaneously, wherein two continuous full-connection layers are connected with a softmax classifier for judging the activity category, and the other two continuous full-connection layers are connected with a regression layer for improving the time period of candidate activity occurrence;

step (8), modeling the workflow according to the obtained action category and the generated activity fragment thereof, thereby further optimizing the whole production flow;

the relevant concept definition and notation are described as follows:

f _t : a video frame representing a video at time t;

a _k : representing the size of the kth anchor point at a certain timing position;

lcls: representing a multi-class softmax penalty function for determining the class of active segments in the workflow;

lreg: representing an L1 smoothing loss function for optimizing the relative offset of the candidate segment and the real situation;

PLIST: a candidate list comprising a confidence level;

PLIST: a return list obtained after screening by soft-NMS;

ROI: a region of interest;

softmax: a multi-type classifier, the probability of each class is as follows:

the specific sampling mode of the step (1) is as follows:

1-1. Decomposing the original video into linksSuccessive video frame sequences { f ₁ ,f ₂ ,f ₃ ,…,f _t }；

1-2, taking continuous 4 frames as an interval, and randomly reading one frame in each interval, so that the video frames are prevented from being acquired at the same position each time while the time sequence redundancy is avoided;

1-3, taking the obtained continuous random frames as training samples to be input into a three-dimensional residual neutral network;

and (2) extracting space-time characteristics by using a three-dimensional residual error network, extracting the space-time characteristics of an input video frame, and enabling a time sequence candidate sub-network and a behavior classification sub-network to share the space-time characteristics in order to ensure the calculation efficiency and the end-to-end training, wherein the method is specifically realized as follows:

2-1, compressing the dimension of the input video frame to 112×112 for maximizing GPU performance;

2-2, avoiding the phenomenon of gradient disappearance or gradient explosion through a residual block, and increasing the depth of a network;

2-3, inputting continuous RGB video frames with the size of 3 x 112 into a three-dimensional residual neutral network, and finally outputting the space-time characteristics of 512 x L/8 x 7 by the network;

the step (3) is specifically realized as follows:

anchor＝{c _i ,l _i }

wherein c _i Representing the center position of the anchor segment, l _i Representing the length of the anchor segment in time sequence;

3-2, anchor segments are distributed in the time-space characteristics with the length of 8/L, and k anchor points are arranged at the time sequence position of each time-space characteristic, so that k anchor segment sequences with different lengths are arranged at the time sequence position of each time-space characteristic in the time sequence candidate sub-network; that is, the length increment sequence of the anchor point segment at a certain time sequence position is:

{a ₁ ，a ₂ ，a ₃ ，...，a _k }

ak is the kth anchor point at a certain time sequence position;

{a ₁ *8/f，a ₂ *8/f，a ₃ *8/f，···，a _k *8/f}

the anchor point segments with different lengths can determine the time sequence position of an anchor point segment through a boundary regression network;

step (4) judging whether the anchor segments contain actions or not through a classification network and determining the anchor segments through a boundary regression network, so as to obtain a candidate list I; and (3) carrying out a series of operations on the space-time characteristics generated in the step (2-3), and taking the generated candidate active fragments as the input of a next-stage behavior classification network, wherein the method is specifically realized as follows:

4-1, adding a three-dimensional convolution kernel with the size of 3 x 3 to expand the space-time receptive field;

4-2, adding a three-dimensional maximum pooling kernel with the size of 1*H/16 x W/16 for generating a characteristic diagram only comprising time sequence characteristics;

4-3, adding two convolution kernels with the size of 1 x 1, wherein the size of the finally obtained characteristic diagram is 512 x L/8 x 1;

4-4, obtaining a candidate list I, namely a candidate list PLIST, through a boundary regression network and a behavior classification network, wherein the specific loss function is as follows:

wherein N is _cls The normalized value is classified, namely batch processing quantity; n (N) _reg The regression normalization value, namely the number of anchor point fragments, i is the index value of the anchor point fragments in the feature map; λ is a weight value used to balance the two losses, and since cls term and reg term are almost equal in weight, λ takes 1;

step (5) uses a Soft-NMS method to remove the highly overlapped and low confidence active fragments in the candidate list I to obtain the final candidate list II, namely a return list RLIST, and the specific process is as follows:

5-1, selecting a candidate active fragment M with the highest confidence from the candidate list PLIST, deleting the candidate active fragment M from the candidate list PLIST and putting the candidate active fragment M into a return list RLIST;

5-2 for each candidate b within the candidate list PLIST _i The confidence score is s _i If b is calculated _i The confidence coefficient is reduced in a linear mode when the coincidence degree with M is larger than a threshold value; namely:

s _i (1-iou(M,b _i ))

wherein iou is the intersection ratio, i.e. the ratio of the intersection of M and bi to the union;

5-3, repeating the steps 5-1 and 5-2 until a candidate list PLIST is empty;

step (6) obtaining a feature with a fixed dimension through a maximum pooling method, extracting a feature I with a fixed dimension from the space-time feature by using ROI pooling, namely, performing maximum pooling on the input 512 x L/8 x 7 in a grid of 1 x 4 to obtain a final unified dimension 512 x 1 x 4;

step (7) inputting the feature I with fixed dimension into two full-connection layers simultaneously, wherein two continuous full-connection layers are connected with a softmax classifier for judging the activity category, and the other two continuous full-connection layers are connected with a regression layer for improving the time period of candidate activity occurrence, and the method is concretely realized as follows:

7-1 adding two full connection layers;

wherein N is _cls For classifying normalized values, i.e. batchManaging the quantity; n (N) _reg The regression normalization value is the number of candidate fragments; λ is a weight value for balancing the two losses, still set to 1.