CN107194559B

CN107194559B - Workflow identification method based on three-dimensional convolutional neural network

Info

Publication number: CN107194559B
Application number: CN201710335309.XA
Authority: CN
Inventors: 胡海洋; 丁佳民; 陈洁; 胡华; 程凯明
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Taoyi Data Technology Co ltd
Priority date: 2017-05-12
Filing date: 2017-05-12
Publication date: 2020-06-05
Anticipated expiration: 2037-05-12
Also published as: CN107194559A

Abstract

The invention discloses a workflow identification method based on a three-dimensional convolutional neural network. Only different process tasks are divided in advance and different action behaviors are labeled manually in the process of analyzing the video, so that the method is not in line with the automation requirement of intelligent manufacturing. The invention firstly provides an interframe difference method with a self-adaptive threshold value, which is mainly used for dividing the region of a moving object from a complex background, thereby reducing the time complexity of subsequent feature extraction and model training; secondly, the 3D convolutional neural network is improved to be capable of fully adapting to a factory environment with a plurality of monitoring devices, and for different views, views at different angles are fused according to weights by adopting a view pooling layer; finally, a new action division method is provided, and continuous production actions in the video are automatically divided, so that an automatic workflow identification process is realized.

Description

Workflow identification method based on three-dimensional convolutional neural network

Technical Field

The invention belongs to the technical field of workflow identification, and is used for quickly and accurately identifying and detecting a production and manufacturing process.

Background

The intelligent manufacturing is a further development direction of manufacturing automation, and the artificial intelligence technology is widely applied to various links of engineering design, process design, production scheduling, fault diagnosis and the like in the industrial manufacturing process, so that the manufacturing process is intelligent, and the productivity is greatly improved. Workflow recognition (workflow recognition) has attracted attention from the industry and the scientific research community as an important technical direction for intelligent manufacturing. The camera installed in a manufacturing workshop is utilized to shoot the whole process of production scheduling on a production line, then the video is calculated and processed, and the industrial production flow is identified and detected quickly and accurately, so that the method plays an important role in protecting the personal safety of staff, reducing production overhead, ensuring product quality, optimizing production scheduling and flow specification.

However, workflow identification techniques have their complexities and specificities. Firstly, because various machines, transport vehicles, auxiliary equipment and other objects in a production workshop are more and often shielded from each other, and the similarity of different process operations and frequent light intensity changes in the workshop bring challenges to the analysis and identification of videos and images. Furthermore, the dynamic production workflow process makes the identification process rather complex and prone to bias: for example, different tasks in a workflow tend to have different execution times, and there is no explicit definition between the start and end of a task; these tasks may even involve both human and machine actions, and some of these workflow-independent actions must be distinguished from the actual production task. These aspects make conventional motion/pose recognition methods that rely on target object detection and tracking difficult to adapt to complex factory manufacturing environments. In addition, some researchers have developed partial research on workflow identification technology, but how to automatically divide the production process/action of the image sequence in the video is not clearly defined, and most of them only divide the different process tasks in advance and manually mark different action behaviors in the process of analyzing the video, which obviously does not meet the automation requirement of intelligent manufacturing.

Disclosure of Invention

Aiming at the current research situation, the invention provides a workflow identification framework with stronger robustness. In the framework, firstly, an interframe difference method with an adaptive threshold is provided, and the method is mainly used for dividing a region of a moving object from a complex background, so that the time complexity of subsequent feature extraction and model training is reduced; secondly, the 3D convolutional neural network is improved to be capable of fully adapting to a factory environment with a plurality of monitoring devices, and for different views, views at different angles are fused according to weights by adopting a view pooling layer; finally, a new action division method is provided, and continuous production actions in the video are automatically divided, so that an automatic workflow identification process is realized.

The method comprises the following specific steps:

step (1), exporting a workflow video containing multiple visual angles from a data set, and acquiring the video resolution and the frame number of the workflow video at each visual angle;

initializing an interframe difference threshold of workflow videos of all visual angles; respectively carrying out the steps (3) to (11) on the workflow video of each visual angle;

step (3), setting t to be 2;

reading three continuous video frames of t-1, t and t +1, and carrying out graying and median filtering processing on the three video frames;

step 5, performing interframe difference operation on the previous two frames and the next two frames respectively to obtain two interframe difference images;

step (6), dynamically updating an interframe difference threshold according to the two interframe difference images obtained in the step (5); the method for dynamically updating the interframe difference threshold comprises the following steps:

6.1 set 1 for l, t frame inter-frame difference threshold

d_kIs the pixel value of the kth pixel in the inter-frame difference map, max { d }_kIs the maximum value of pixel values in the difference image between frames, min { d }_kThe pixel value in the difference image between frames is the minimum value;

N₁and N₂Respectively represent that

And

the total number of pixels of;

6.3 if

Then will be

Is assigned to tau¹ _tOtherwise, let l ═ l +1, repeat step 6.2;

performing binarization processing on the current frame according to the inter-frame differential threshold obtained in the step (6), wherein pixel points larger than the inter-frame differential threshold are set as 1, and pixel points smaller than the inter-frame differential threshold are set as 0;

step (8), operating and operating the two previous and next interframe difference images to obtain three frame difference images, and acquiring the center coordinates of the interest points by using a block extraction method;

step (9), segmenting the extracted interest points from the original image of the current frame;

step (10), gradually adding 1 to the value of t, and repeatedly executing the steps (4) to (9) until the value of t is 1 less than the value of the last frame of the workflow video, wherein the segmentation size of the step (9) is unchanged in the repeated process; storing the interest point images obtained in the step (9) in each repeated process as interest point videos according to the sequence, and classifying the interest point videos according to the classification rules in the data set;

step (11), randomly selecting 90% of the interest point videos obtained in the step (10) as a training set, and taking the rest as a test set;

step (12), constructing a multi-view three-dimensional convolution neural network, and initializing the number of training rounds to be 5000; the multi-view three-dimensional convolution neural network construction method comprises the following steps:

12.1 convolution and pooling operations are as follows:

initializing a four-dimensional convolution kernel with the size of 9 × 10 for the first convolution layer, wherein an activation function is sigmoid, the window size of the first pooling layer is 2, and the step length is 2;

initializing a four-dimensional convolution kernel with the size of 9 × 7 × 30 for the second convolution layer, wherein the activation function is sigmoid, the window size of the second pooling layer is 2, and the step length is 2;

initializing a four-dimensional convolution kernel with the size of 9 × 8 × 5 × 50 for the third convolution layer, wherein the activation function is sigmoid, the window size of the third pooling layer is 2, and the step length is 2;

initializing a four-dimensional convolution kernel with the size of 4 x 3 x 150 for the fourth convolution layer, wherein the activation function is sigmoid, the window size of the fourth pooling layer is 2, and the step length is 2;

12.2 initializing each characteristic graph weight parameter in the weighted average view pooling layer

Is [0,1 ]]A medium random value, and

the weighted average view pooling operation in the weighted average view pooling layer is as follows:

wherein a is the weighted average characteristic graph after the weighted average view pooling operation, t₁The sequence number of the pooling profile after the convolution and pooling operations,

is a serial number t₁The corresponding pooled feature map is weighted, exp represents an exponential function with e as the base,

is a serial number t₁Corresponding pooling profiles;

12.3, respectively initializing a convolution kernel of 3000 × 1500 and 1500 × 750 for the first two fully-connected layers, and setting an activation function as Relu; inputting the weighted average characteristic graph after the weighted average view pooling operation into the front two fully-connected layers;

12.4 initialize a 750 × 14 convolution kernel for the last fully connected layer and set the Softmax classification function.

Step (13), randomly selecting 20 videos from a training set corresponding to the workflow videos of each visual angle, inputting the 20 videos into the multi-view three-dimensional convolution neural network in the step (12) for feature training, and outputting training errors;

step (14), randomly selecting 10 videos from a training set corresponding to the workflow videos of each visual angle, inputting the 10 videos into a multi-view three-dimensional convolutional neural network for verification, and obtaining the accuracy of classification and identification of the multi-view three-dimensional convolutional neural network;

step (15), repeating the steps (13) to (14), and subtracting 1 from the number of training rounds each time until the number of training rounds is 0 to obtain a trained multi-view three-dimensional convolution neural network;

step (16), testing the multi-view three-dimensional convolution neural network in the step (15) by using a test set corresponding to the workflow video of each visual angle;

step (17), acquiring the resolution and the frame number of the newly input workflow video, and initializing an interframe difference threshold; setting t to be 2;

step (18), extracting the center coordinates of the interest points of two adjacent frames according to the steps (4) to (8), calculating the distance between the two center coordinates, and marking the distance as a motion state S if the distance is greater than a set threshold value T₁Otherwise, the flag is in a relatively static state S₀；

Step (19), gradually adding 1 to the t value, repeating the step (18) until the t value is 1 less than the last frame value of the newly input workflow video, and counting continuous S₀And S₁When S is detected₀Or S₁When the number is greater than or equal to N, continuous S is divided₀Or S₁Storing the target interest points in the corresponding frames into a frame queue, otherwise discarding the continuous S₀Or S₁Corresponding to the frame.

Step (20), each successive S of the frame queue₀Or S₁And extracting continuous key frames from the ith frame in the corresponding frame set, wherein i is more than 5, and the number of the key frames is the same as that of each section of classified video in the data set.

Step (21), inputting the videos formed by the key frames in the step (20) according to the sequence into the multi-view three-dimensional convolution neural network trained in the step (15) to classify and recognize the staff behaviors;

and (22) comparing the behavior type obtained in the step (21) with a predefined standard workflow.

The invention has the following beneficial effects:

the workflow identification method based on the three-dimensional convolution neural network mainly comprises the following functional modules: the device comprises a moving object segmentation module, a behavior identification module and an action division module.

The moving target segmentation module mainly realizes the segmentation of target interest points from image and video sequences. Because the target motion in the workflow video sequence is relatively large, and the background is basically in a static state, the two frames of images before and after can be subtracted to obtain an inter-frame difference image, and then the moving target can be segmented according to the size relation between the pixel difference and the threshold value. The adopted self-adaptive three-frame difference method is to perform AND operation on the inter-frame difference images obtained by the first two frames and the second two frames in the three video frames to obtain three-frame difference images, and the setting of the threshold value is automatically adjusted according to the previous inter-frame difference image, so that the influence of noise can be effectively avoided;

and the behavior recognition module performs behavior recognition on the moving target by utilizing the 3D convolutional neural network and the multi-view learning capability. To achieve multi-view fusion, we use a view-pooling layer (view-pooling layer) to fuse the global view information. Multiple independent 3D-CNNs are involved in the multi-view 3D-CNNs for extracting features from image sequences of different views; then, fusing feature descriptors extracted from different views in a view pooling layer and learning view-related features; finally, a full connected neural network (FNN) with a softmax classifier is used for final identification;

the action partitioning module defines two states: a moving state and a relatively stationary state. And (4) taking the central coordinate of the interest point of each frame, wherein when the interest point moves, the central coordinate of the interest point also moves. At this time, the difference between the center coordinates of the interest points of two adjacent frames can be taken to represent the state of the current interest point. Dynamic and static partitioning is achieved in this way;

the workflow identification method provided by the invention can effectively solve two problems to be solved in workflow identification under complex environment, wherein the first problem is that objects such as various machines, transport vehicles and auxiliary instruments in a production workshop are shielded from each other, the similarity of different process operations and the influence of frequent light intensity change in the workshop on workflow identification, and the second problem is how to automatically divide the production process/action of an image sequence in a video.

Drawings

FIG. 1 is a schematic diagram of a multi-view three-dimensional convolutional neural network construction;

fig. 2 is a schematic diagram of the division of the working flow.

Detailed Description

The invention is further illustrated by the following figures and examples.

First, concept definition and symbol description are performed:

interframe difference threshold

t denotes the current frame number, l ≧ 1 denotes the recursion order, d_kIs the pixel value of the kth pixel in the inter-frame difference map, max { d }_kIs the maximum value of pixel values in the difference image between frames, min { d }_kThe pixel value in the difference image between frames is the minimum value;

N₁and N₂Respectively represent that

And

the total number of pixels.

a: the weighted average view pools the weighted average profiles after the operation.

t₁: sequence number of pooled feature map after convolution and pooling operations.

Number t₁The corresponding pooled feature map takes weight.

Number t₁Corresponding pooling profiles.

Secondly, the workflow identification method based on the three-dimensional convolution neural network comprises the following implementation steps:

(1) and (3) moving object segmentation: video monitoring equipment on a production line is often erected at a higher position, so that most of areas in a monitoring picture are factory backgrounds irrelevant to workflow identification, and if feature vectors are directly extracted from the whole monitoring picture, the difficulty of feature extraction and the time consumption of calculation are greatly increased. Therefore, the three-frame difference method using the adaptive threshold value segments the moving object (interest point) part in the video, thereby reducing the workload of the later steps. Specifically, the method comprises the following steps:

(1.1) exporting the multi-view workflow video from the data set, and acquiring the video resolution and the frame number of the workflow video at each view;

(1.2) initializing an interframe difference threshold of workflow videos of all visual angles; setting t to 2, the steps (1.3) - (1.9) are performed for each view workflow video respectively

(1.3) reading a video frame t and two adjacent frames t-1 and t +1 thereof, and carrying out graying and median filtering processing on the three video frames;

(1.4) performing interframe difference operation on the first two frames and the second two frames respectively to obtain two interframe difference images;

(1.5) dynamically updating the interframe difference threshold according to the two interframe difference images obtained in the step (1.4), wherein the updating method comprises the following steps:

(1.5.1) setting l as 1, and setting the t frame interframe difference threshold value

(1.5.2) order

N₁And N₂Respectively represent that

And

the total number of pixels of (a);

(1.5.3) if

Then will be

Is assigned to tau¹ _tOtherwise, let l ═ l +1, repeat step (1.5.2);

(1.6) carrying out binarization processing on the current frame (namely the middle frame) according to the inter-frame differential threshold obtained in the step (1.5), wherein pixel points larger than the inter-frame differential threshold are set as 1, and pixel points smaller than the inter-frame differential threshold are set as 0;

(1.7) performing AND operation on the front differential image and the rear differential image to obtain three frames of differential images, and acquiring the center coordinates of the interest points by using a Blob Extraction method;

(1.8) segmenting the extracted interest points from the original image of the current frame;

(1.9) gradually adding 1 to the t value, and repeatedly executing the steps (1.3) - (1.8) until the t value is 1 less than the value of the last frame of the workflow video, wherein in the repeated process, the segmentation size in the step (1.8) is unchanged; storing the interest point images obtained in the step (1.8) in each repeated process as interest point videos according to the sequence, and classifying the interest point videos according to the classification rules in the data set;

(2) behavior identification based on a multi-view three-dimensional convolutional neural network: after the current manufacturing production line is inspected, the fact that the multiple cameras are adopted for synchronous real-time monitoring from different angles in the same working scene in the current manufacturing production line is often found, and therefore the quality of products and the safety of staff are guaranteed. By utilizing the characteristic, the influence of a complex environment of a factory on behavior recognition is effectively reduced by using a multi-view feature extraction and fusion method, and the accuracy of the behavior recognition is improved. The specific execution steps are as follows:

(2.1) selecting 90% of the interest point videos obtained in the step (1) as a training set, and taking the rest as a test set;

(2.2) constructing a multi-view three-dimensional convolution neural network (see the attached figure 1). The number of the initial training rounds is 5000, and the multi-view three-dimensional convolution neural network construction method comprises the following steps:

the operation processes of convolution and pooling are (2.2.1) - (2.2.4):

(2.2.1) initializing a four-dimensional convolution kernel of size 9 x 10 for the first convolution layer, the activation function being sigmoid, the first pooling layer window size being 2, and the step size being 2;

(2.2.2) initializing a four-dimensional convolution kernel of size 9 x 7 x 30 for the second convolutional layer, with an activation function of sigmoid, a second pooling layer window size of 2, and a step size of 2;

(2.2.3) initializing a four-dimensional convolution kernel of size 9 x 8 x 5 x 50 for the third convolutional layer, with an activation function of sigmoid, a third pooling layer window size of 2, and a step size of 2;

(2.2.4) initializing a four-dimensional convolution kernel with a size of 4 x 3 x 150 for the fourth convolution layer, with an activation function of sigmoid, a fourth pooling layer window size of 2, and a step size of 2;

(2.2.5) initializing feature map weight parameters in the weighted-average view pooling layer

Is [0,1 ]]A medium random value, and

the weighted average view pooling layer (WAVP) is calculated as follows:

(2.2.6) initializing a convolution kernel of 3000 × 1500 and 1500 × 750 for the first two fully connected layers respectively, and setting the activation function to Relu; inputting the weighted average characteristic graph after the weighted average view pooling operation into the front two fully-connected layers;

(2.2.7) initialize a 750 x 14 convolution kernel for the last fully-connected layer and set the Softmax classification function, where 14 is the kind of action.

(2.3) randomly selecting 20 videos from the training set of the workflow videos of all the visual angles, inputting the 20 videos into the multi-view three-dimensional convolution neural network in the step (2.2) for feature training, and outputting training errors;

(2.4) randomly selecting 10 videos from the training set of the workflow videos of each visual angle, inputting the 10 videos into the multi-view three-dimensional convolutional neural network for verification, and obtaining the accuracy of classification and identification of the multi-view three-dimensional convolutional neural network;

(2.5) repeating the steps (2.3) - (2.4), and subtracting 1 from the number of training rounds each time until the number of training rounds is 0 to obtain a trained multi-view three-dimensional convolutional neural network;

(2.6) testing the multi-view three-dimensional convolution neural network in the (2.5) by using a test set corresponding to the workflow video of each visual angle;

(3) the action division method based on the state comprises the following steps: in an actual environment, the motions of the workers often occur continuously, and in this case, if the motions are to be recognized, the motions need to be divided first, and then each motion can be recognized separately. It has been observed that a displacement occurs between the operations of the worker from taking the part, handling the part to placing the part, and the worker from taking the welding tool to welding the part (see fig. 2). Therefore, the motion can be divided according to the motion state of the worker. The specific execution steps are as follows:

(3.1) acquiring the resolution and the frame number of the newly input video, and initializing an interframe difference threshold; setting t to be 2;

(3.2) extracting the center coordinates of the interest points of two adjacent frames according to the steps (1.3) to (1.7), calculating the distance between the two center coordinates, and marking the distance as a motion state S if the distance is greater than a threshold value T set by people₁Otherwise, the flag is in a relatively static state S₀；

(3.3) t fetchGradually adding 1 to the value, repeating the step (3.2) until the value of t is 1 less than the value of the last frame of the newly input video, and counting the continuous S₀And S₁Number of S when successive S are detected₀Or S₁When the number is larger than or equal to N, N is larger than 10, and continuous S is divided by the method of (1.8)₀Or S₁Storing the target interest points in the corresponding frames into a frame queue, otherwise discarding the continuous S₀Or S₁Corresponding to the frame.

(3.4) respective successive S of the frame queue₀Or S₁And extracting continuous key frames from the ith frame in the corresponding frame set, wherein i is more than 5, so that the number of the key frames is the same as that of each classified video in the data set.

(3.5) inputting the video formed by the key frames in the step (3.4) in sequence into the trained multi-view three-dimensional convolution neural network in the step (2.5) to classify and recognize the staff behaviors;

and (3.6) comparing the behavior categories obtained in (3.5) with a predefined standard workflow.

Claims

1. A workflow identification method based on a three-dimensional convolution neural network is characterized by comprising the following steps: the method comprises the following specific steps:

step (3), setting t to be 2;

6.1 setting l 1, t frameInterframe difference threshold

6.2 order

N₁And N₂Respectively represent that

And

the total number of pixels of;

6.3 if

Then will be

Is assigned to tau¹ _tOtherwise, let l ═ l +1, repeat step 6.2;

12.1 convolution and pooling operations are as follows:

Is [0,1 ]]A medium random value, and

wherein a is the weighted average characteristic graph after the weighted average view pooling operation, t₁For the pools after convolution and pooling operationsThe serial number of the feature map is changed,

is a serial number t₁Corresponding pooling profiles;

12.4, initializing a 750 × 14 convolution kernel for the last fully-connected layer and setting a Softmax classification function;

step (13), randomly selecting 20 videos from the training set corresponding to the workflow videos of each visual angle, inputting the videos into the multi-view three-dimensional convolution neural network in the step (12) for feature training, and outputting training errors;

step (18), extracting the center coordinates of the interest points of two adjacent frames according to the steps (4) to (8), calculating the distance between the two center coordinates, and marking the distance as a motion state S if the distance is greater than a set threshold value T₁Otherwise, mark asRelative stationary state S₀；

Step (19), gradually adding 1 to the t value, repeating the step (18) until the t value is 1 less than the last frame value of the newly input workflow video, and counting continuous S₀And S₁When S is detected₀Or S₁When the number is larger than or equal to N, N is larger than 10, and continuous S is divided₀Or S₁Storing the target interest points in the corresponding frames into a frame queue, otherwise discarding the continuous S₀Or S₁Corresponding frames;

step (20), each successive S of the frame queue₀Or S₁Extracting continuous key frames from the ith frame in the corresponding frame set, wherein i is more than 5, so that the number of the key frames is the same as the number of the frames of each classified section of video in the data set;