CN111104855A

CN111104855A - Workflow identification method based on time sequence behavior detection

Info

Publication number: CN111104855A
Application number: CN201911097168.8A
Authority: CN
Inventors: 胡海洋; 王庆文; 李忠金; 陈洁; 俞佳成; 张力; 余嘉伟; 周美玲; 陈振辉
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2019-11-11
Filing date: 2019-11-11
Publication date: 2020-05-05
Anticipated expiration: 2039-11-11
Also published as: CN111104855B

Abstract

The invention discloses a workflow identification method based on time sequence behavior detection. The invention provides a time sequence video sparse sampling method, which reduces useless data and accelerates the overall speed of a frame. Meanwhile, in order to accelerate the recognition speed and the recognition precision, the invention uses a three-dimensional residual error network to extract the features so as to ensure the speed and the efficiency of extracting the space-time features. In the time sequence candidate subnet, in order to avoid missing some candidate segments, the invention uses Soft-NMS to update NMS, thus ensuring the recall rate of the detection result. Through the strategy, the framework provided by the invention is more suitable for identifying the workflow in the complex factory production environment. The method solves the problem of time sequence positioning of actions in the video, effectively utilizes a large amount of intelligent monitoring videos generated in a factory environment, detects the types of activities in the videos and the time segments of the activities through the neural network, and models the workflow, thereby further optimizing the whole production flow.

Description

Workflow identification method based on time sequence behavior detection

Technical Field

The invention belongs to the application of computer vision and deep learning in the aspect of a factory production operation behavior identification technology, and is used for identifying the operation category of production operation and the occurrence time segment of the operation category. At present, the intelligent monitoring in industrial production generates tens of thousands of valuable video data every day, and in order to fully utilize the video data, a workflow identification method is urgently needed to be designed, which can automatically extract features from a large amount of video data, and identify the category of production operation of a factory and the occurrence time slice of the production operation of the factory.

Background

With the development of information technology and manufacturing technology, smart manufacturing has become an important trend in the field of industrial production. Workflow identification is also undergoing rapid innovation as a major technological direction for intelligent manufacturing. In general, a workflow is generally viewed as a sequence of independent activities. The traditional workflow recognition technology mainly adopts a process mining technology, namely, relevant contents of business execution are extracted and analyzed from a system log generated by a business process information system, and a business process or a production decision is adjusted in time.

Thanks to the development of computer vision technology, the current workflow identification mainly shoots various production activities on a production line through a camera in a production workshop, processes and calculates videos, and realizes quick detection on industrial processes. Obvious light changes exist in a factory production workshop, and compared with a common scene, the scene identified by the object motion shielding workflow has the particularity, so that the traditional identification method relying on target object detection is difficult to apply. Because the surveillance video is a real-time video, workflow recognition has its real-time requirements in recognition speed.

Meanwhile, as the demand of factory production on workflow identification is further increased, different tasks in the workflow often have different execution times, and there is no clear definition between the start of the task and the end of the task, and workflow identification based on behavior identification cannot perform time sequence positioning on activities in a video. Thus, the present invention shifts the focus of workflow identification from behavior identification to temporal behavior detection. Unlike workflow identification based on behavior identification, the workflow identification method based on chronological behavior detection also includes positioning of the activity on the chronological sequence, i.e. the start time, the end time of the activity. The key to this task is the following two points: 1. for the time sequence boundary of behaviors, a plurality of methods adopt a framework for classifying candidate segments, and for the methods, the important point is higher quality of the candidate segments, namely, the number of the candidate segments is reduced under the condition of ensuring the correct recognition result. 2. And (4) the category of the behavior can accurately obtain the category information of the time sequence segment.

However, production operation behavior identification techniques have their complexities and specificities. Deep learning approaches have enjoyed great success in the field of image processing, and many classification architectures based on convolutional neural networks have been designed to handle workflow recognition in unprocessed long videos. The invention designs a workflow method based on time sequence behavior detection to detect the category of actions in long unprocessed videos in a factory and the time slice of the actions.

Disclosure of Invention

The invention discloses a workflow identification method based on time sequence behavior detection. Compared with a general video scene, a scene identified by the workflow has complexity and specificity due to frequent change of light of a background in a manufacturing environment, serious occlusion between objects, various noise interferences and long continuous working time of workers. In a complex factory environment, workers may continue to perform a production activity for a long period of time, including a large number of useless video frames. Aiming at the phenomenon, the invention provides a time sequence video sparse sampling method, which reduces useless data and accelerates the overall speed of a frame. Meanwhile, in order to accelerate the recognition speed and the recognition precision, the invention uses a three-dimensional residual error network to extract the features so as to ensure the speed and the efficiency of extracting the space-time features. In the time sequence candidate subnet, in order to avoid missing some candidate segments, the invention uses Soft-NMS to update NMS, thus ensuring the recall rate of the detection result. Through the strategy, the framework provided by the invention is more suitable for identifying the workflow in the complex factory production environment.

The method comprises the following specific steps:

and (1) processing the video to be processed by using a sparse sampling strategy, wherein the method comprises the steps of dividing continuous frames in the video into an interval, and randomly sampling in the interval, so that the redundancy of the video is avoided.

And (2) extracting features by using a three-dimensional residual error network, mainly aiming at reducing training time and reducing the size of the model.

Step (3), acquiring candidate active segments by using an anchor point mechanism to form anchor point segments;

and (4) judging whether the candidate anchor segments contain actions or not through a classification network and determining the boundaries of the anchor segments through a boundary regression network so as to obtain a candidate list I.

And (5) removing the active segments with high overlapping and low confidence coefficient in the candidate list I by using a Soft-NMS method to obtain a final candidate list II.

And (6) changing the candidate features with any length into the features I with the fixed dimension of 512 x 1 x 4 by a maximum pooling method.

And (7) simultaneously inputting the fixed-dimension features I into two full-connected layers, wherein two continuous full-connected layers are connected with a softmax classifier for judging the activity category, and the other two continuous full-connected layers are connected with a regression layer for improving the time period of the occurrence of the candidate activity.

And (8) modeling the workflow according to the obtained action types and the generated activity segments thereof, thereby further optimizing the whole production flow.

The invention has the following beneficial effects:

the workflow identification method based on the time sequence behavior detection provided by the invention mainly has the following innovations: 1) a sparse sampling method is provided for processing an input video; 2) performing feature extraction on the input video by using a three-dimensional residual error neural network; 3) candidate segments with high and low confidence of overlap are processed using a Soft-NMS method.

In order to avoid redundant frames generated by carrying out a certain production activity for a long time, the input video is processed by the sparse sampling method provided by the invention. Using a three-dimensional residual neural network reduces training time and narrows the model. In order to avoid the candidate segments with high overlapping and low confidence coefficient, the invention uses a soft-NMS method to improve the quality of the candidate segments.

The method solves the problem of time sequence positioning of actions in the video, effectively utilizes a large amount of intelligent monitoring videos generated in a factory environment, detects the types of activities in the videos and the time segments of the activities through the neural network, and models the workflow, thereby further optimizing the whole production flow.

Drawings

Fig. 1 is a schematic diagram of a three-dimensional residual error neural network construction.

Fig. 2 is a schematic diagram of an anchor point mechanism employed in the present invention.

Fig. 3 is an overall flow from input to output of the present invention.

Detailed Description

The invention is further illustrated with reference to the following figures and examples.

Related concept definition and symbolic description

f_t: representing the video frame of the video at time t.

a_k: indicating the k-th at a certain time positionThe size of the anchor point.

Llcs: and representing a multi-classification softmax loss function used for judging the category of the active segment in the workflow.

And Lreg: an L1 smoothing loss function is shown for optimizing the relative offsets of the candidate segments and the real world.

PLIST: a candidate list containing confidence levels.

RLIST: returned list obtained after screening by soft-NMS.

ROI: a region of interest.

softmax: multi-class classifier, the probability for each class is as follows:

as shown in fig. 1 to 3, a workflow identification method based on time series behavior detection specifically includes the following steps:

step (1), redundancy generated during long-time operation is avoided through a video sparse sampling mode, wherein the specific sampling mode is as follows:

1-1. decomposing an original video into a sequence of consecutive video frames f₁,f₂,f₃,…,f_t}。

1-2, taking 4 continuous frames as an interval, and randomly reading one frame in one interval each time, thereby avoiding the redundancy in time sequence and simultaneously avoiding the acquisition of video frames at the same position each time.

And 1-3, inputting the obtained continuous random frames serving as training samples into a three-dimensional residual error neural network.

And (2) extracting space-time characteristics by using a three-dimensional residual error network. In order to guarantee the calculation efficiency and guarantee the end-to-end training, the time sequence candidate sub-network and the behavior classification sub-network share the space-time characteristics. (see attached FIG. 1)

2-1. compress the dimensions of the input video frame to 112 x 112 to maximize GPU performance.

And 2-2, avoiding the phenomenon of gradient disappearance or gradient explosion by a residual block, and increasing the depth of the network.

2-3, inputting the continuous RGB video frames with the size of 3 × 112 to a three-dimensional residual neural network, and finally outputting the network as space-time characteristics of 512 × L/8 × 7.

And (3) acquiring anchor segments with different sizes by using an anchor mechanism (see the attached figure 2).

3-1, aiming at space-time characteristics, the time sequence candidate sub-network can quickly generate anchor point segments with different sizes, judge the probability that videos in the anchor point segments are targets or backgrounds and be used for preliminarily generating a candidate list I, wherein the anchor point segments have the following expression formula:

anchor＝{c_i,l_i}

wherein, c_iRepresenting the center position of the anchor segment,/_iRepresenting the length of the anchor segment in time sequence.

3-2, the anchor point segments are distributed in the space-time characteristics with the length of 8/L, and k anchor points are arranged at the time sequence position of each space-time characteristic, so that k anchor point segment sequences with different lengths are arranged at the time sequence position of each space-time characteristic in the time sequence candidate sub-network. That is, the length increasing sequence of the anchor segment at a certain time sequence position is:

{a₁,a₂,a₃,…,a_k}

ak is the kth anchor point at a certain time sequence position;

3-3. read f frames per second (FPS ═ f), then the coverage length of these anchor segments at the temporal position is:

{a₁*8/f,a₂*8/f,a₃*8/f,…,a_k*8/f}

the time sequence position of an anchor segment can be determined by a boundary regression network for the anchor segments with different lengths.

Step (4), judging whether the anchor segments contain actions or not through a classification network and determining the anchor segments through a boundary regression network so as to obtain a candidate list I; and (4) carrying out a series of operations on the spatio-temporal characteristics generated in the step 2-3, and taking the generated candidate activity segments as the input of the behavior classification network of the next stage.

And 4-1, adding a three-dimensional convolution kernel with the size of 3 x 3 to expand the space-time receptive field.

4-2. add a three-dimensional maximum pooling kernel of size 1 × H/16 × W/16 for generating a feature map containing only temporal features.

4-3, after adding two convolution kernel operations of 1 × 1, the final feature size obtained is 512 × L/8 × 1.

And 4-4, obtaining a candidate list I (candidate list PLIST) through a boundary regression network and a behavior classification network. The specific loss function is as follows:

wherein N is_clsIs a classification normalized value, namely the batch processing quantity; n is a radical of_regThe regression normalization value is the number of anchor segments, and i is the index value of the anchor segments in the feature map; λ is a weighted value to balance the two losses, and λ is 1 since the cls term and the reg term are more or less equally weighted.

And (5) removing the active segments with high overlapping and low confidence coefficient in the candidate list I (candidate list PLIST) by using a Soft-NMS method to obtain a final candidate list II (return list PLIST). A certain linear method is adopted for reducing instead of directly resetting, so that the method can avoid missing some segments with slightly low fractions as far as possible while ensuring the precision, and the specific process is as follows:

and 5-1, selecting a candidate active segment M with the highest confidence from the candidate list PLIST, deleting the candidate active segment M from the candidate list RLIST and placing the candidate active segment M into the return list RLIST.

5-2. for each candidate b in the candidate list PLIST_iWith a confidence score of s_iIf b is calculated_iThe coincidence degree with M is greater than the threshold value, the coincidence is performed in a linear modeIts confidence level is reduced. Namely:

s_i(1-iou(M,b_i))

wherein iou is the intersection-union ratio, i.e. the ratio of the intersection and union of M and bi.

5-3. repeat steps 5-1 and 5-2 until the candidate list PLIST is empty.

And (6) obtaining features with fixed dimensions by a maximum pooling method, and extracting features I with fixed dimensions from the space-time features by using ROI pooling, namely performing maximum pooling on the input 512X L/8X 7 in a grid of 1X 4 to obtain the final uniform size 512X 1X 4.

Step (7), simultaneously inputting the feature I with fixed dimensionality into two full-connection layers, wherein two continuous full-connection layers are connected with a softmax classifier for judging activity categories, and the other two continuous full-connection layers are connected with a regression layer for improving the time period of occurrence of candidate activities, and the method is specifically realized as follows:

7-1 two fully connected layers were added.

7-2, adding a boundary regression network for boundary correction, and performing behavior classification through a behavior classification network to obtain a target action class, wherein the specific loss function is as follows:

wherein N is_clsIs a classification normalized value, namely the batch processing quantity; n is a radical of_regIs a regression normalization value, namely the number of candidate segments; λ is a weighted value to balance the two losses, still set to 1.

And (8) determining the relation among the tasks according to the target action type and the occurrence time segment, thereby identifying the tasks in the workflow and analyzing the production efficiency of the industrial global equipment.

Claims

1. A workflow identification method based on time sequence behavior detection is characterized by comprising the following steps:

the method comprises the following steps that (1) a sparse sampling strategy is used for processing a video to be processed, continuous frames in the video are divided into an interval, random sampling is carried out in the interval, and video redundancy is avoided;

step (2), extracting features by using a three-dimensional residual error network, reducing training time and reducing the size of a model;

step (4), judging whether the anchor point segments contain actions or not through a classification network and determining the boundaries of the anchor point segments through a boundary regression network so as to obtain a candidate list I;

step (5), removing the active segments with high overlapping and low confidence coefficient in the candidate list I by using a Soft-NMS method to obtain a final candidate list II;

step (6), changing candidate features with any length into features I with fixed dimension of 512 x 1 x 4 by a maximum pooling method;

step (7), simultaneously inputting the feature I with fixed dimensionality into two full-connection layers, wherein two continuous full-connection layers are connected with a softmax classifier for judging activity categories, and the other two continuous full-connection layers are connected with a regression layer for improving the time period of occurrence of candidate activities;

2. The workflow identification method based on the time-series behavior detection according to claim 1, wherein:

the related concept definitions and notations are as follows:

f_t: video frames representing video at time t;

a_k: representing the size of the kth anchor point at a certain time sequence position;

llcs: representing a multi-classification softmax loss function used for judging the category of the active segment in the workflow;

and Lreg: representing the L1 smoothing loss function for optimizing the relative offsets of the candidate segments and the real world;

PLIST: a candidate list containing confidence levels;

PLIST: a return list obtained after screening by soft-NMS;

ROI: a region of interest;

softmax: multi-class classifier, the probability for each class is as follows:

the specific sampling mode of the step (1) is as follows:

1-1. decomposing an original video into a sequence of consecutive video frames f₁,f₂,f₃,…,f_t}；

1-2, taking 4 continuous frames as an interval, and randomly reading one frame in one interval every time, thereby avoiding the redundancy in time sequence and avoiding the acquisition of video frames at the same position every time;

3. The workflow identification method based on time series behavior detection as claimed in claim 2, wherein the step (2) uses a three-dimensional residual network to extract spatiotemporal features, and performs spatiotemporal feature extraction on the input video frame, and in order to ensure the computation efficiency and the end-to-end training, the spatiotemporal features are shared by the time series candidate sub-network and the behavior classification sub-network, and the method is implemented as follows:

2-1. compressing the dimension of the input video frame to 112 x 112 to maximize GPU performance;

2-2, avoiding the phenomenon of gradient disappearance or gradient explosion by a residual block and increasing the depth of the network;

4. The workflow identification method based on the time series behavior detection as claimed in claim 3, wherein the step (3) is implemented as follows:

anchor＝{c_i,l_i}

wherein, c_iRepresenting the center position of the anchor segment,/_iRepresenting the length of the anchor segment in time sequence;

3-2, anchor point segments are distributed in the space-time characteristics with the length of 8/L, and k anchor points are arranged at the time sequence position of each space-time characteristic, so that k anchor point segment sequences with different lengths are arranged at the time sequence position of each space-time characteristic in the time sequence candidate sub-network; that is, the length increasing sequence of the anchor segment at a certain time sequence position is:

{a₁,a₂,a₃,…,a_k}

ak is the kth anchor point at a certain time sequence position;

{a₁*8/f,a₂*8/f,a₃*8/f,…,a_k*8/f}

the time sequence position of an anchor segment can be determined by a boundary regression network through the anchor segments with different lengths.

5. The workflow identification method based on temporal behavior detection as claimed in claim 4, wherein the step (4) determines whether the anchor segments contain actions through a classification network and determines the anchor segments through a boundary regression network, thereby obtaining the candidate list I; performing a series of operations on the spatio-temporal features generated in the step 2-3, and taking the generated candidate activity segments as the input of the behavior classification network of the next stage, wherein the specific implementation is as follows:

4-1, adding a three-dimensional convolution kernel with the size of 3 x 3 to expand the space-time receptive field;

4-2, adding a three-dimensional maximum pooling kernel with a size of 1 × H/16 × W/16 for generating a feature map containing only time series features;

4-3, after adding two convolution kernels of 1 × 1, the size of the final obtained feature map is 512 × L/8 × 1;

4-4, obtaining a candidate list I (candidate list PLIST) through a boundary regression network and a behavior classification network, wherein the specific loss function is as follows:

6. The workflow identification method based on chronological behavior detection as claimed in claim 5, wherein the step (5) uses a Soft-NMS method to remove the active segments with high overlap and low confidence in the candidate list i (candidate list PLIST) to obtain the final candidate list ii (return list RLIST), and the specific procedures are as follows:

5-1, selecting a candidate active segment M with the maximum confidence coefficient from the candidate list PLIST, deleting the candidate active segment M from the candidate list PLIST and placing the candidate active segment M into a return list RLIST;

5-2. for each candidate b in the candidate list PLIST_iWith a confidence score of s_iIf b is calculated_iIf the coincidence degree with M is greater than the threshold value, the confidence coefficient is reduced in a linear mode; namely:

s_i(1-iou(M,b_i))

wherein iou is an intersection-union ratio, namely the ratio of the intersection and the union of M and bi;

5-3. repeat steps 5-1 and 5-2 until the candidate list PLIST is empty.

7. A workflow identification method based on temporal behavior detection according to claim 6, wherein step (6) obtains features of fixed dimensions by a method of maximum pooling, and uses ROI pooling to extract features i of fixed dimensions from space-time features, i.e. the input 512 x L/8 x 7 performs maximum pooling inside a grid of 1 x 4 to obtain the final uniform size 512 x 1 x 4.

8. The workflow identification method based on time series behavior detection as claimed in claim 7, wherein the step (7) inputs the fixed dimension features I into two full-connected layers simultaneously, wherein two consecutive full-connected layers are connected with a softmax classifier for judging the activity category, and the other two consecutive full-connected layers are connected with a regression layer for improving the time period of the candidate activity, which is implemented as follows:

7-1, adding two full connection layers;