CN114973397B

CN114973397B - Real-time process detection system, method and storage medium

Info

Publication number: CN114973397B
Application number: CN202210279966.8A
Authority: CN
Inventors: 胡丹峰; 花蕾; 徐杰; 陆武民
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2022-03-21
Filing date: 2022-03-21
Publication date: 2024-07-05
Anticipated expiration: 2042-03-21
Also published as: CN114973397A

Abstract

The invention discloses a real-time process detection system, a real-time process detection method and a storage medium. The invention takes a depth camera as an acquisition module to acquire an input video stream; processing the video stream by using openpose with light weight to obtain a 3D skeleton sequence; then preprocessing the 3D skeleton sequence; and the subsequent input is sent to DGNN network to complete the final process detection task. Actual tests show that the system can accurately judge the procedure normalization of operators, has the advantages of good real-time performance and high accuracy, and can effectively solve the procedure violation problem in the factory quality inspection link.

Description

Real-time process detection system, method and storage medium

Technical Field

The invention belongs to the technical field of action recognition, and particularly relates to a real-time procedure detection system, a real-time procedure detection method and a storage medium.

Background

Human motion characteristics are an important form for transmitting information, are widely applied to various fields, and are a great research hotspot in the field of computer vision. The common human motion characteristic expression forms mainly comprise types such as pictures, optical flows, human skeleton and the like. Among them, human skeleton is a feature describing human motion, and is favored because of its small data size, high robustness, and strong feature expression ability.

At present, most of mainstream motion recognition methods are realized based on skeleton features, and mainly comprise Convolutional Neural Networks (CNNs), cyclic neural networks (RNNs), graph roll-up neural networks (GCNs) and the like.

For example, sophie and the like, wang and the like propose a coding rule for converting a 3D skeleton sequence into a 2D picture, and a motion recognition task is realized by using the strong classification capability of CNN, but the relevance between skeleton points of a human body in a space domain and the continuity of motion in a time domain are ignored in a coding mode, and the final motion recognition effect is general. (Sophie Aubry, sohaib Laraba,Tilmanne,Thierry Dutoit.Action recognition based on 2D skeletons extracted from RGB videos[J].MATEC Web of Conferences,2019,277.;Pichao Wang,Wanqing Li,Chuankun Li,Yonghong Hou.Action recognition based on joint trajectory maps with convolutional neural networks[J].Knowledge-Based Systems,2018,158.)

In addition, if Yan and the like are adopted, the 3D skeleton sequence data are directly processed by GCN and are used for identifying actions, and compared with a coding mode, the method is more focused on expressing the relation between skeleton points on a space domain and a time domain, and good identification effects are obtained on a main stream data set; GCN offers advantages in handling skeletal features, attracting many scholars to prefer to solve motion recognition problems with GCN .(Sijie Yan,Yuanjun Xiong,and Dahua Lin.Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition.In AAAI,2018.; L.Shi,Y.Zhang,J.Cheng and H.Lu,"Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition,"2019IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR),2019,pp.12018-12027,doi: 10.1109/CVPR.2019.01230.;L.Shi,Y.Zhang,J.Cheng and H.Lu,"Skeleton-Based Action Recognition With Directed Graph Neural Networks,"2019IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR),2019,pp.7904-7913,doi: 10.1109/CVPR.2019.00810.)

However, the research direction of motion recognition is more biased to the theoretical aspect, and the practical application direction is less, and news, periodical, patent and paper are reported or recorded.

Disclosure of Invention

Aiming at the defects existing in the prior art, the invention provides a real-time process detection system, which takes actual application as guidance and aims to solve the problem of process violations existing in a factory quality inspection link.

In order to solve the technical problems and achieve the technical effects, the invention is realized by the following technical scheme:

the real-time process detection system mainly comprises a video acquisition module, a skeleton extraction module, a preprocessing module, a process segmentation module, an action recognition module and a display module, wherein the whole structure of the real-time process detection system comprises a video acquisition module, a skeleton extraction module, a preprocessing module, a process segmentation module, an action recognition module and a display module;

The video acquisition module, the skeleton extraction module, the preprocessing module, the procedure segmentation module, the action recognition module and the display module are connected in sequence, wherein,

The video acquisition module is responsible for acquiring an input video Stream, namely accurately acquiring RGB-Stream and Depth-Stream, and inputting the RGB-Stream and the Depth-Stream into the skeleton extraction module;

The skeleton extraction module comprises a light openpose human body posture detection network and a depth alignment processing unit, wherein,

The lightweight openpose human body posture detection network adopts a bottom-up detection thought, namely is responsible for detecting joint parts of all people in a graph, and then performs matching operation on all detected joints, so as to obtain each independent human body posture, and realize that RGB-Stream with complex background is extracted into a 2D skeleton sequence with high robustness;

the Depth alignment processing unit is responsible for converting RGB-Stream and Depth-Stream under different coordinate systems into the same coordinate system, realizing additional Depth for the 2D image and obtaining an aligned Depth image;

The preprocessing module is in charge of receiving the 2D skeleton sequence and the aligned depth image input by the skeleton extraction module at the same time, completing the conversion operation from the 2D skeleton sequence to the 3D skeleton sequence, and performing a series of preprocessing operations on the converted 3D skeleton sequence, so that a relatively stable and complete 3D skeleton sequence is obtained;

The procedure segmentation module is responsible for segmenting time points from the starting action to the ending action of the procedure, and only inputting the 3D skeleton sequence in the time interval to the subsequent action recognition module, wherein the 3D skeleton sequence which does not belong to the time interval is not subjected to subsequent processing, so that the function of real-time segmentation of the 3D skeleton sequence is realized;

the action recognition module is composed of DGNN and is responsible for recognizing actions corresponding to the 3D skeleton sequences input by the procedure segmentation module;

The display module is used for realizing the integral encapsulation of the system and is mainly responsible for real-time video stream display, skeleton data display, in-process detection process display, detection result recording and storage and the like, and is convenient for users to use.

Furthermore, the video acquisition module is a INTELREALSENSED455 depth camera, the depth camera adopts a structured light technology, the maximum depth range is 0.6-6m, and the depth precision in 4m is lower than 2%; the RGB-FOV of the Depth camera was 90 DEG x 65 DEG, the Depth-FOV was 86 DEG x 57 DEG; the resolution and frame rate of the RGB sensor of the Depth camera can reach 1280x800@30fps, and the resolution and frame rate of the Depth sensor can reach 1280x720@30fps, so that the accurate acquisition of RGB-Stream and Depth-Stream is realized.

Further, the network structure of the lightweight openpose human body posture detection network can be divided into three parts, namely a backup, INITIAL STAGE and REFINEMENT STAGE; wherein, the backup part adopts MobileNetV structure and is responsible for extracting the feature map; INITIAL STAGE and REFINEMENT STAGE1 have similar structures, and are used for extracting a joint point heat map S ^t and a joint affinity vector L ^t, wherein S ^t represents joint point information of a human body, and L ^t represents association information between joint points; finally, the joints belonging to the same person are connected through a greedy analysis Algorithm (GREEDY PARSING Algorithm) to form a finished human skeleton.

The invention provides the following two structural optimization points to solve the problem that the recognition accuracy of the lightweight openpose human body posture detection network is low:

And replacing the MobileNetV1 structure adopted by the backstone part in the lightweight openpose human body posture detection network by VGG19 to form a VGG-backstone so as to enhance the feature expression capability and relieve the problem of limited subsequent network learning capability.

(2) The compact connection is introduced on the REFINEMENT STAGE structures, the whole structure comprises 1 piece INITIAL STAGE and 5 pieces REFINEMENT STAGE, the input of each REFINEMENT STAGE is obtained by connecting the output of all stages in front including INITIAL STAGE with a back bone extraction feature map, so that the reusability of the features is improved, the transfer of the features between different layers is enhanced, and the gradient disappearance problem is relieved.

Furthermore, as the 3D skeleton sequence obtained by the light openpose human body posture detection network and the depth camera is influenced by factors such as shielding, sensor noise and the like, skeleton points have the problems of missing, abnormal depth value, instability and the like, the preprocessing module provided by the invention carries out preprocessing such as full skeleton points, joint depth correction, kalman filtering, standardization processing and the like on the input 3D skeleton sequence, so that a relatively stable and complete 3D skeleton sequence is obtained.

The bone point complement method comprises the steps of searching 30 frames forwards by taking a current frame as a starting point, and taking the coordinate of the bone point which is closest to the current time node and has the defect as the coordinate of the bone point which is missing in the current frame; the method is only suitable for the situation that the skeleton points are missing in part of the time period, and can not be effectively solved for the skeleton points which are always in the missing state; therefore, a depth correction part is added, and reasonable prediction is given to skeleton points in a missing state all the time according to priori knowledge of human skeletons; bone points which are continuously in a missing state are mostly leg bone points, so that the coordinates of the leg bone points are predicted according to the bone length proportion of each part of a human body; the bone points subjected to completion and depth correction also show a larger fluctuation phenomenon in the time domain, and the invention introduces Kalman filtering to process the bone sequence in real time so as to improve the stability of the data of the bone sequence, and finally, the data is subjected to standardized processing to obtain the final preprocessed bone sequence.

Furthermore, in order to accurately count the time information of the working procedure, the starting point and the ending point of each working procedure need to be determined, and the invention provides the following working procedure segmentation scheme; defining a group of state vectors, wherein the state vectors comprise state vectors describing the stretching state of the current arm and state vectors describing the direction information of the current arm; in the actual detection process, two states of taking a tool and putting the tool by an operator are respectively defined as a starting point and an ending point of a working procedure; when the tool is taken, the arm of the operator is approximately in a straight state and is biased to the right front of the operator, at the moment, the state vector describing the straight state of the current arm is in a threshold t1, and the state vector describing the direction information of the current arm is in a threshold t2 (similar to the tool taken by two hands), so that the accurate description of the state of the arm is realized; through experimental tests, the two groups of state vectors defined above can better describe two states of a picking tool and a putting tool, and accurate process segmentation is realized.

The invention also discloses a real-time process detection method based on skeleton feature extraction, which comprises the following steps:

step 1, accurately acquiring RGB-Stream and Depth-Stream by a video acquisition module, and inputting the RGB-Stream and the Depth-Stream to a skeleton extraction module;

Step 2, a lightweight openpose human body posture detection network in the skeleton extraction module is responsible for detecting joint parts of all people in a graph, then performing matching operation on all detected joints, further obtaining each independent human body posture, and extracting RGB-Stream with complex background as a 2D skeleton sequence with high robustness;

Step 3, a Depth alignment processing unit in the skeleton extraction module converts RGB-Stream and Depth-Stream under different coordinate systems into the same coordinate system, and adds Depth to the 2D image to obtain an aligned Depth image;

step 4, the skeleton extraction module inputs the 2D skeleton sequence and the aligned depth image into a preprocessing module, the preprocessing module converts the 2D skeleton sequence into a 3D skeleton sequence, and a series of preprocessing operations are carried out on the converted 3D skeleton sequence to obtain a stable and complete 3D skeleton sequence;

Step 5, the preprocessing module inputs a stable and complete 3D skeleton sequence to the procedure segmentation module, the procedure segmentation module segments the time point from the starting action to the ending action of the procedure, only inputs the 3D skeleton sequence in the time interval to the action recognition module, and does not perform subsequent processing on the 3D skeleton sequence which does not belong to the time interval, so as to segment the 3D skeleton sequence in real time;

Step 6, the action recognition module recognizes actions corresponding to the 3D skeleton sequences input by the procedure segmentation module, and inputs the recognized action information to a display module;

And 7, displaying the video stream, the skeleton data and the on-line inspection procedure in real time by the display module, and recording and storing the detection result.

Further, in step 4, the preprocessing operation performed by the preprocessing module on the 3D bone sequence includes: complement skeletal points, joint depth correction, kalman filtering, and normalization.

Further, in step 5, the method for performing the process segmentation by the process segmentation module includes: defining a group of state vectors, wherein the state vectors comprise state vectors describing the stretching state of the current arm and state vectors describing the direction information of the current arm; respectively defining two states of taking and putting down the tool by an operator as a starting point and an ending point of a working procedure; when the tool is taken, the arm of the operator is approximately in a straight state and is biased to the right front of the operator, at the moment, the state vector describing the straight state of the current arm is in a threshold t1, and the state vector describing the direction information of the current arm is in a threshold t2, so that the accurate description of the state of the arm is realized, and the accurate process segmentation is realized.

The invention also discloses a computer storage medium, at least one executable instruction is stored in the computer storage medium, and the executable instruction enables a processor to execute the operation corresponding to the real-time procedure detection method based on skeleton feature extraction.

Compared with the prior art, the invention has the beneficial effects that:

The system of the invention takes a depth camera as an acquisition module to acquire an input video stream; processing the video stream by using openpose with light weight to obtain a 3D skeleton sequence; then preprocessing the 3D skeleton sequence; and the subsequent input is sent to DGNN network to finish the final procedure detection task. Actual tests show that the system can accurately judge the procedure normalization of operators, has the advantages of good real-time performance and high accuracy, and can effectively solve the procedure violation problem in the factory quality inspection link.

The foregoing description is only an overview of the present invention, and is intended to provide a better understanding of the present invention, as it is embodied in the following description, with reference to the preferred embodiments of the present invention and its details set forth in the accompanying drawings. Specific embodiments of the present invention are given in detail by the following examples and the accompanying drawings.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

FIG. 1 is a schematic diagram of the overall structure of a real-time process detection system according to the present invention;

FIG. 2 is a network architecture diagram of the lightweight openpose human body posture detection network of the present invention;

FIG. 3 is a diagram showing the comparison of network interception results of GhostNet (left panel) and MobileNet V3 (right panel) in experiment one of the present invention;

FIG. 4 is a diagram showing a network structure of ShuffleNet V A in experiment one of the present invention;

FIG. 5 is an internal Block diagram of a Block module used in REFINEMENT STAGE1 in the lightweight openpose human body posture detection network of the present invention;

Fig. 6 is a schematic diagram of a structure of REFINEMENT STAGE in the lightweight openpose human body posture detection network of the present invention after dense connections are introduced;

FIG. 7 is a 3D bone sequence preprocessing flow chart of the preprocessing module of the present invention;

FIG. 8 is a graph showing the contrast of the Kalman filter for the right-hand joint point of the present invention;

FIG. 9 is a graph comparing bone data before and after pretreatment in accordance with the present invention;

FIG. 10 is a state vector definition diagram of a process segmentation module according to the present invention;

FIG. 11 is a schematic view of the operator's operational state (left single hand/right double hand) of taking a tool;

FIG. 12 is a table of data set divisions during real-time process test training of the present invention;

FIG. 13 is a graph showing the test results of the system of the present invention.

Detailed Description

The application will be described in detail below with reference to the drawings in combination with embodiments. The description herein is to be taken in a providing further understanding of the application and is made in the accompanying drawings, which illustrate and explain the application by way of example and not of limitation.

Referring to fig. 1, a real-time process detection system mainly comprises a video acquisition module 1, a skeleton extraction module 2, a preprocessing module 3, a process segmentation module 4, an action recognition module 5 and a display module 6.

The video acquisition module 1, the skeleton extraction module 2, the preprocessing module 3, the procedure segmentation module 4, the action recognition module 5 and the display module 6 are connected in sequence, wherein,

The video acquisition module 1 is responsible for acquiring an input video Stream, namely accurately acquiring RGB-Stream and Depth-Stream, and inputting the RGB-Stream and the Depth-Stream into the skeleton extraction module 2; the video acquisition module 1 is a INTELREALSENSED455 depth camera, the depth camera adopts a structured light technology, the maximum depth range is 0.6-6m, and the depth precision in 4m is lower than 2%; the RGB-FOV of the Depth camera was 90 DEG x 65 DEG, the Depth-FOV was 86 DEG x 57 DEG; the resolution and frame rate of the RGB sensor of the depth camera can reach 1280x800@30fps, and the resolution and frame rate of the depth sensor can reach 1280x720@30fps;

The skeleton extraction module 2 comprises a light openpose human body posture detection network and a depth alignment processing unit, wherein,

The lightweight openpose human body posture detection network adopts a bottom-up detection thought, namely is responsible for detecting joint parts of all people in the graph, and then carries out matching operation on all detected joints, so as to obtain each independent human body posture, and realize that RGB-Stream with complex background is extracted into a 2D skeleton sequence with high robustness;

The preprocessing module 3 is responsible for simultaneously receiving the 2D skeleton sequence input by the skeleton extraction module 2 and the aligned depth image, completing the conversion operation from the 2D skeleton sequence to the 3D skeleton sequence, and performing a series of preprocessing operations on the converted 3D skeleton sequence so as to obtain a relatively stable and complete 3D skeleton sequence;

The process segmentation module 4 is responsible for segmenting the time point from the starting action to the ending action of the process, and only inputs the 3D skeleton sequence in the time interval to the subsequent action recognition module 5, and the 3D skeleton sequence which does not belong to the time interval is not subjected to subsequent processing, so that the function of segmenting the 3D skeleton sequence in real time is realized;

The action recognition module 5 is composed of DGNN and is responsible for recognizing actions corresponding to the 3D bone sequences input by the procedure segmentation module 4;

The display module 6 is used for realizing the integral encapsulation of the system and is mainly responsible for real-time video stream display, skeleton data display, in-process detection display, detection result recording and storage and the like, and is convenient for users to use.

Several design details of the system of the present invention are further described below:

1. Design of skeleton extraction module

The skeleton extraction module 2 uses openpose human body posture detection network which has the advantages of high robustness, irrelevant calculation time and number of people, and the like, but has obvious defects, and is mainly represented by redundancy of parameters; if the performance of the network is fully exerted and the real-time processing requirement is met, the GPU is required to be higher; this increases the cost of deploying the system; under the condition of low GPU configuration, the memory overhead can be reduced and the processing efficiency can be improved only by adjusting network parameters and reducing network performance, but the identification effect is often poor; later Intel team proposes LIGHTWEIGHT OPENPOSE (light OpenPose human body gesture detection network) on the basis of OpenPose, and a great amount of optimization is performed on the network structure and gesture processing part, so that the parameter quantity is greatly reduced to 15% of the original network, but the accuracy is also relatively large and is nearly 6 percentage points.

Referring to fig. 2, the network structure of the whole lightweight openpose human body posture detection network can be divided into three parts, namely a backup, INITIAL STAGE and REFINEMENT STAGE 1; wherein, the backup part adopts MobileNetV structure and is responsible for extracting the feature map; INITIAL STAGE and REFINEMENT STAGE are of similar structures and are responsible for extracting a joint point heat map S ^t and a joint affinity vector L ^t, wherein S ^t represents joint point information of a human body, and L ^t represents association information between joint points; finally, the joints belonging to the same person are connected through a greedy analysis Algorithm (GREEDY PARSING Algorithm) to form a finished human skeleton.

In order to realize better balance between performance and operation efficiency and make the method more suitable for the current application scene, the invention provides a novel network optimization strategy aiming at the problem of low LIGHTWEIGHT OPENPOSE identification accuracy, and the novel network optimization strategy comprises the following two optimization points:

optimization 1: backbone structural optimization

In LIGHTWEIGHT OPENPOSE, the backup part uses MobileNetV1 which is lighter and is superior in terms of calculation amount and forward reasoning time, but is limited by network depth, the characteristic expression capability extracted by MobileNetV1 is weak, and the learning capability of the subsequent network is limited to a certain extent. The present invention has therefore tried to find a more excellent network than MobileNetV a to perform as a backup for LIGHTWEIGHT OPENPOSE by experiment one.

Experiment one the tested network mainly comprises GhostNet, mobileNetV, shuffleNetV2 and other lightweight networks and VGG19. In experiment one, VGG19, ghostNet, shuffleNetV2 and MobileNetV3 were each suitably fine-tuned. The fine tuning rule is: 1. the final convolutional layer output feature map size of the network must be 28 x 28 to match the input size of the subsequent network; 2. on the premise of meeting rule 1, the depth of the network is increased as much as possible and the reasoning time of the network is not negligible.

The VGG19 is used as OpenPose original backup structure, and is intercepted according to the original rule; other lightweight networks have advantages in terms of inference time, so that the intercepted network depth can be maximized.

GhostNet and MobileNetV are similar in structure, so that the fine tuning manner is also similar, the step size of the convolution layer with the output size of the final layer of the feature map of the network being 28×28 is adjusted to be 1, as shown by the two upper rectangular boxes in fig. 3, and the size of the subsequent feature map is enlarged to increase the depth of the interceptable network, and the intercepting is performed until the next downsampling, as shown by the two lower rectangular boxes in fig. 3.

ShuffleNetV2 is taken as a backup part of LIGHTWEIGHT OPENPOSE, the downsampling operation of Conv1 and MaxPool layers are canceled, and Stage4 is directly intercepted, and the size of an output characteristic diagram of Stage4 is 28×28, as shown by a rectangular box in FIG. 4.

Then, the intercepted 4 backbond structures are added with a subsequent pooling layer and a full connection layer, and training tests are uniformly carried out on CIFAR-10 and Imagenet-10 data sets, and training results are shown in tables 1 and 2. Wherein ImageNet-10 is 10 kinds of pictures randomly selected from ImageNet dataset, each kind contains 500 pictures, total 5000 training pictures, 4000 pictures are used as training set, verification set and test set 500.

TABLE 1 CIFAR-10 test results

Table 2 Imagenet-10 test results

In combination, the performance of VGG backbone is obviously better than that of each lightweight backbone, and the performance is relatively stable. Of the lightweight backbones, shuffleNetV2 backbones perform best, with Top1 closest to VGG backbones, and also have some advantages in terms of inference time. And GhostNet backbone and MobileNetV backbones perform slightly less well on both small and large data sets. It may be that the data set size used for the experiments is too small or the improvement rule is not suitable for both networks, so that the two structures of VGG backhaul and ShuffleNetV < 2 > backhaul are tentatively adopted, and the most suitable model is determined through subsequent experiments.

Optimization 2: refinishent structural optimization

In LIGHTWEIGHT OPENPOSE, to maximize the contribution of each point calculation to network accuracy, only one REFINEMENT STAGE is used, and in REFINEMENT STAGE a Block module as shown in fig. 5 is used, where a small-sized convolution kernel is used to further reduce the number of parameters, and a residual structure is introduced to cope with the gradient vanishing problem.

The network configuration using only one REFINEMENT STAGE, while significantly reduced in parameter amount, also suffers more accuracy. In order to improve the precision to the maximum extent, the invention adopts a structure of 5 REFINEMENT STAGE in OpenPose, but the network depth is greatly increased, and the gradient disappearance problem is more serious. Inspired by DenseNet network structure, the present invention proposes to introduce dense connection on REFINEMENT STAGE structure, and the specific connection mode is shown in fig. 6. The overall structure comprises 1 INITIAL STAGE and 5 REFINEMENT STAGE, the input of each REFINEMENT STAGE is obtained by connecting all (including INITIAL STAGE) outputs of the front stage with the back bone extraction feature map, so that the reusability of the features is improved, the transfer of the features between different layers is enhanced, and the gradient vanishing problem is effectively relieved.

In experiment two, dense connections were introduced to REFINEMENT STAGE based on VGG backbone and ShuffleNetV < 2 > backbone, respectively, followed by training tests on the COCO2017 dataset. The difference in network performance before and after improvement was compared REFINEMENT STAGE by experiment two, the structure of the final backhaul was determined, and the comparison result of experiment two is shown in table 3.

Table 3 comparison of results of experiments

Experiment number	Network structure	mAP
			1	VGG backhaul+ non-dense connection	44.1％
2	VGG backhaul+dense connection	45.5％
			3	ShuffleNetV2 backhaul + dense connection	43.2％

Through experiments two of 1 and 2 groups of experiments two, the improvement of 1.4% of the network mAP compared with the prior art can be seen before and after the VGG backup is introduced into the dense connection. By comparing the two groups of experiments 2 and 3 of the experiment two, the network performance of ShuffleNetV as a backhaul structure is general, and even after dense connection is introduced, the performance of VGG backhaul network is still inferior to that of VGG backhaul network without dense connection. In a comprehensive view, the problem of gradient disappearance is relieved to a certain extent by introducing a dense connection method, the network performance is further improved, and meanwhile, the method is determined to be the optimal choice.

2. Pretreatment module design

3D bone sequences obtained through a light openpose human body posture detection network and a depth camera are affected by factors such as shielding, sensor noise and the like, and bone points have the problems of missing, abnormal depth values, instability and the like and need further pretreatment; the method mainly comprises the steps of completing skeleton points, correcting joint depth, performing Kalman filtering, performing standardized treatment and the like; referring to fig. 7, the pretreatment flow is as follows:

firstly, judging whether skeleton points of a 3D skeleton sequence are missing or not; if the bone points are not missing, directly judging whether the joint depth of the 3D bone sequence is normal or not; if the skeleton points are missing, judging whether the joint depth of the 3D skeleton sequence is normal or not after the missing points are complemented according to the latest historical frames; if the joint depth is normal, carrying out Kalman filtering processing and standardization processing, and if the joint depth is abnormal, correcting the depth according to priori knowledge, and then carrying out Kalman filtering processing and standardization processing; the final result is a relatively stable and complete 3D bone sequence.

The bone point is completed, 30 frames are searched forward by taking the current frame as a starting point, and the coordinates of the missing bone point closest to the current time node and having the missing bone point are taken as the coordinates of the missing bone point of the current frame; the method is only suitable for the situation that the bone points are missing in part of the time period, and can not be effectively solved for the bone points which are always in the missing state; for this reason, a depth correction part is added, and reasonable prediction is given to skeleton points which are always in a missing state according to priori knowledge of human skeleton; bone points which are continuously in a missing state are mostly leg bone points, so that coordinates of the leg bone points are predicted according to bone length proportions of various parts of a human body; the bone points subjected to completion and depth correction also show a larger fluctuation phenomenon in the time domain, and Kalman filtering is introduced into the bone sequence to process the bone sequence in real time so as to improve the stability of the bone sequence data, as shown in figure 8. Finally, the data are normalized to obtain a final preprocessed bone sequence, and the bone data pair before and after preprocessing is shown in fig. 9, wherein the left side is 3D bone sequence data before preprocessing, and the right side is 3D bone sequence data after preprocessing.

3. Process division module design

In order to accurately count the time information of the working procedure, the working procedure segmentation module needs to determine the starting point and the ending point of each working procedure, and the specific working procedure segmentation scheme is as follows:

Defining a set of state vectors v _st, wherein st ₁ describes the state of extension of the current arm; st ₂ describes the direction information of the current arm, wherein each vector is shown in fig. 10;

v_st＝[st₁,st₂]

In the actual detection process, two states of taking and putting down the tool by an operator are respectively defined as a starting point and an ending point of a working procedure; as described with reference to fig. 11, when the tool is taken, the arm of the operator is approximately in a straightened state and the arm is biased to the front right of the operator; at this time, st ₁ is within the threshold t1, and st ₂ is within the threshold t2 (similar to the two-hand picking tool), so as to realize accurate description of the arm state; through experimental tests, the two groups of state vectors defined above can better describe two states of a taking tool and a putting tool, and accurate process segmentation is realized.

The advantages of the system of the invention are verified by comparative experiments as follows:

1. Human body posture estimation network performance contrast

In order to evaluate the actual performance of the network after the improvement, the present invention performs detailed evaluation tests on the COCO2017 data set and the actual use scenario on OpenPose and LIGHTWEIGHT OPENPOSE and the network after the improvement of the present invention, the test environment is shown in table 4, and the test results are shown in table 5.

Table 4 test conditions

Related resources	Configuration of
		CPU	i5-10400
GPU	GTX970
		Memory	16GB
Operating system	Ubuntu20.04

Table 5 comparison of network performance

The improved network provided by the invention is only 3.5% lower than OpenPose% in mAP, and the network provided by the invention occupies certain advantages compared with 5.8% lower than LIGHTWEIGHT OPENPOSE. In terms of parameters, the parameters were somewhat greater than LIGHTWEIGHT OPENPOSE due to the adoption of VGG backup and more REFINEMENT STAGE, but still much less than OpenPose. In real-time aspect, the real-time detection frame rate of the network can reach 26fps, which is far higher than 12fps of OpenPose and is only 4fps lower than LIGHTWEIGHT OPENPOSE. From the practical angle, the network of the invention can also meet the requirement of real-time processing and has higher precision.

In a combined way, the improved method provided by the invention obviously improves the network precision under the condition of slightly losing LIGHTWEIGHT OPENPOSE running frame rate, so that the improved method is more suitable for the actual application scene of the design, and the effectiveness of the improved network provided by the invention can be proved to a certain extent.

2. Real-time process detection results

Referring to fig. 12, the motion recognition part makes 5000 video samples for DGNN training; the method mainly comprises the steps of filing, knocking, repairing corners, through holes and spot inspection, wherein the total number of the actions is 1000 samples, 800 samples are used for manufacturing training sets, the rest samples are respectively used for manufacturing verification sets and test sets, and the number of the samples is 100.

Finally, the system of the invention achieves the goal of real-time detection procedures, and the frame rate can be stabilized at about 18 fps. In addition, the system also has the functions of counting working time, working procedure execution times, finishing the number of workpieces, saving data and the like. The final effect of the system of the present invention is shown in figure 13.

Step 1, accurately acquiring RGB-Stream and Depth-Stream by a video acquisition module 1, and inputting the RGB-Stream and the Depth-Stream into a skeleton extraction module 2;

Step 2, a lightweight openpose human body posture detection network in the skeleton extraction module 2 is responsible for detecting joint parts of all people in the drawing, then performing matching operation on all detected joints, further obtaining each independent human body posture, and extracting RGB-Stream with complex background as a 2D skeleton sequence with high robustness;

step 3, a Depth alignment processing unit in the skeleton extraction module 2 converts RGB-Stream and Depth-Stream under different coordinate systems into the same coordinate system, and adds Depth to the 2D image to obtain an aligned Depth image;

Step 4, the skeleton extraction module 2 inputs the 2D skeleton sequence and the aligned depth image into the preprocessing module 3, the preprocessing module 3 converts the 2D skeleton sequence into a 3D skeleton sequence, and a series of preprocessing operations are performed on the converted 3D skeleton sequence to obtain a stable and complete 3D skeleton sequence;

Step 5, the preprocessing module 3 inputs a stable and complete 3D skeleton sequence to the process segmentation module 4, the process segmentation module 4 segments the time point from the start action to the end action of the process, and only inputs the 3D skeleton sequence in the time interval to the action recognition module 5, and the 3D skeleton sequence not belonging to the time interval is not subjected to subsequent processing, so as to segment the 3D skeleton sequence in real time;

step 6, the motion recognition module 5 recognizes a motion corresponding to the 3D bone sequence input by the process segmentation module 4, and inputs the recognized motion information to the display module 6;

And 7, displaying the video stream, the skeleton data and the on-line inspection procedure in real time by the display module 6, and recording and storing the detection result.

Further, in step 4, the preprocessing operation performed by the preprocessing module 3 on the 3D bone sequence includes: complement skeletal points, joint depth correction, kalman filtering, and normalization.

Further, in step 5, the method for performing the process division by the process division module 4 includes: defining a group of state vectors, wherein the state vectors comprise state vectors describing the stretching state of the current arm and state vectors describing the direction information of the current arm; respectively defining two states of taking and putting down the tool by an operator as a starting point and an ending point of a working procedure; when the tool is taken, the arm of the operator is approximately in a straight state and is biased to the right front of the operator, at the moment, the state vector describing the straight state of the current arm is in a threshold t1, and the state vector describing the current arm direction information is in a threshold t2, so that the accurate description of the arm state is realized, and the accurate process segmentation is realized.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A real-time process detection system, characterized by: the device consists of a video acquisition module (1), a skeleton extraction module (2), a preprocessing module (3), a procedure segmentation module (4), an action recognition module (5) and a display module (6); wherein,

The video acquisition module (1) is responsible for accurately acquiring RGB-Stream and Depth-Stream and inputting the RGB-Stream and the Depth-Stream into the skeleton extraction module (2);

the skeleton extraction module (2) comprises a light openpose human body posture detection network and a depth alignment processing unit, wherein,

The preprocessing module (3) is responsible for simultaneously receiving the 2D skeleton sequence input by the skeleton extraction module (2) and the aligned depth image, completing the conversion operation from the 2D skeleton sequence to the 3D skeleton sequence, and performing a series of preprocessing operations on the converted 3D skeleton sequence so as to obtain a stable and complete 3D skeleton sequence;

The procedure segmentation module (4) is responsible for segmenting time points from the starting action to the ending action of the procedure, and only inputs the 3D skeleton sequences in the time interval to the subsequent action recognition module (5), and the 3D skeleton sequences which do not belong to the time interval are not subjected to subsequent processing, so that the function of real-time segmentation of the 3D skeleton sequences is realized;

The action recognition module (5) is composed of DGNN and is responsible for recognizing actions corresponding to the 3D skeleton sequences input by the procedure segmentation module (4);

the display module (6) is used for realizing the integral encapsulation of the system, is responsible for real-time video stream display, skeleton data display, in-process detection display, and detection result recording and storage.

2. The real-time process detection system according to claim 1, wherein: the video acquisition module (1) is a INTELREALSENSED455 depth camera, the depth camera adopts a structured light technology, the maximum depth range is 0.6-6m, and the depth precision in 4m is lower than 2%; the RGB-FOV of the Depth camera was 90 DEG x 65 DEG, the Depth-FOV was 86 DEG x 57 DEG; the RGB sensor resolution and frame rate of the depth camera is up to 1280x800@30fps, and the depth sensor resolution and frame rate is up to 1280x720@30fps.

3. The real-time process detection system according to claim 1, wherein the backlight portion in the lightweight openpose human body posture detection network is structurally optimized with VGG19 to form VGG-backlight, so as to enhance feature expression capability and alleviate the problem of limited learning capability of the subsequent network.

4. The real-time process detection system according to claim 1, wherein: the REFINEMENT STAGE part in the lightweight openpose human body posture detection network is structurally introduced into dense connection for structural optimization, the whole structure of the lightweight openpose human body posture detection network comprises 1 piece INITIAL STAGE and 5 pieces REFINEMENT STAGE, the input of each REFINEMENT STAGE is obtained by connecting the output of all front stages including INITIAL STAGE with a back bone extraction feature map, so that the reusability of features is improved, the transfer of the features among different layers is enhanced, and the gradient disappearance problem is relieved.

5. The real-time process detection system according to claim 1, wherein: the preprocessing module (3) performs preprocessing on the 3D bone sequence, wherein the preprocessing comprises the steps of completing bone points, correcting joint depth, kalman filtering and normalizing.

6. The real-time process detection system according to claim 1, wherein the process segmentation module (4) performs the process segmentation by: defining a group of state vectors, wherein the state vectors comprise state vectors describing the stretching state of the current arm and state vectors describing the direction information of the current arm; respectively defining two states of taking and putting down the tool by an operator as a starting point and an ending point of a working procedure; when the tool is taken, the arm of the operator is approximately in a straight state and is biased to the right front of the operator, at this time, the state vector describing the straight state of the current arm is in a threshold t1, and the state vector describing the direction information of the current arm is in a threshold t2, so that the accurate description of the state of the arm is realized, and the accurate process segmentation is realized.

7. The real-time process detection method based on skeleton feature extraction is characterized by comprising the following steps of:

Step 1, accurately acquiring RGB-Stream and Depth-Stream by a video acquisition module (1), and inputting the RGB-Stream and the Depth-Stream into a skeleton extraction module (2);

Step 2, a lightweight openpose human body posture detection network in the skeleton extraction module (2) is responsible for detecting joint parts of all people in the drawing, then performing matching operation on all detected joints, further obtaining each independent human body posture, and extracting RGB-Stream with complex background into a 2D skeleton sequence with high robustness;

Step 3, a Depth alignment processing unit in the skeleton extraction module (2) converts RGB-Stream and Depth-Stream under different coordinate systems into the same coordinate system, and adds Depth to the 2D image to obtain an aligned Depth image;

Step 4, the skeleton extraction module (2) inputs the 2D skeleton sequence and the aligned depth image into the preprocessing module (3), the preprocessing module (3) converts the 2D skeleton sequence into a 3D skeleton sequence, and a series of preprocessing operations are carried out on the converted 3D skeleton sequence to obtain a stable and complete 3D skeleton sequence;

Step 5, the preprocessing module (3) inputs a stable and complete 3D skeleton sequence to the procedure segmentation module (4), the procedure segmentation module (4) segments the time point from the starting action to the ending action of the procedure, only inputs the 3D skeleton sequence in the time interval to the action recognition module (5), and does not perform subsequent processing on the 3D skeleton sequence which does not belong to the time interval, so as to segment the 3D skeleton sequence in real time;

Step 6, the action recognition module (5) recognizes actions corresponding to the 3D skeleton sequences input by the procedure segmentation module (4), and inputs the recognized action information to the display module (6);

and 7, displaying the video stream, the skeleton data and the on-line detection procedure in real time by the display module (6), and recording and storing the detection result.

8. The real-time process detection method based on skeleton feature extraction of claim 7, wherein in step 4, the preprocessing operation performed by the preprocessing module (3) on the 3D skeleton sequence includes: complement skeletal points, joint depth correction, kalman filtering, and normalization.

9. The method for detecting the real-time process based on the skeleton feature extraction of claim 7, wherein in the step 5, the process segmentation module (4) performs the process segmentation by: defining a group of state vectors, wherein the state vectors comprise state vectors describing the stretching state of the current arm and state vectors describing the direction information of the current arm; respectively defining two states of taking and putting down the tool by an operator as a starting point and an ending point of a working procedure; when the tool is taken, the arm of the operator is approximately in a straight state and is biased to the right front of the operator, at this time, the state vector describing the straight state of the current arm is in a threshold t1, and the state vector describing the direction information of the current arm is in a threshold t2, so that the accurate description of the state of the arm is realized, and the accurate process segmentation is realized.

10. A computer storage medium having stored therein at least one executable instruction for causing a processor to perform operations corresponding to the method for real-time process detection based on skeleton feature extraction according to any one of claims 7 to 9.