CN114648723A - Action normative detection method and device based on time consistency comparison learning - Google Patents
Action normative detection method and device based on time consistency comparison learning Download PDFInfo
- Publication number
- CN114648723A CN114648723A CN202210454687.0A CN202210454687A CN114648723A CN 114648723 A CN114648723 A CN 114648723A CN 202210454687 A CN202210454687 A CN 202210454687A CN 114648723 A CN114648723 A CN 114648723A
- Authority
- CN
- China
- Prior art keywords
- image
- data
- network
- action
- enhancement
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000009471 action Effects 0.000 title claims abstract description 76
- 238000001514 detection method Methods 0.000 title claims abstract description 18
- 230000006399 behavior Effects 0.000 claims abstract description 51
- 238000000034 method Methods 0.000 claims abstract description 27
- 238000010606 normalization Methods 0.000 claims abstract description 9
- 230000008859 change Effects 0.000 claims abstract description 7
- 238000013507 mapping Methods 0.000 claims abstract description 6
- 238000012549 training Methods 0.000 claims description 28
- 230000006870 function Effects 0.000 claims description 25
- 238000000605 extraction Methods 0.000 claims description 19
- 238000003860 storage Methods 0.000 claims description 11
- 238000012545 processing Methods 0.000 claims description 10
- 230000008447 perception Effects 0.000 claims description 9
- 238000005070 sampling Methods 0.000 claims description 8
- 238000004364 calculation method Methods 0.000 claims description 7
- 238000013508 migration Methods 0.000 claims description 6
- 230000005012 migration Effects 0.000 claims description 6
- 238000012360 testing method Methods 0.000 claims description 6
- 230000009466 transformation Effects 0.000 claims description 6
- 238000011478 gradient descent method Methods 0.000 claims description 4
- 238000005520 cutting process Methods 0.000 claims description 3
- 238000011176 pooling Methods 0.000 claims description 3
- 238000013526 transfer learning Methods 0.000 abstract description 6
- 238000012544 monitoring process Methods 0.000 abstract description 4
- 230000000052 comparative effect Effects 0.000 abstract description 3
- 238000013135 deep learning Methods 0.000 abstract description 2
- 230000001681 protective effect Effects 0.000 description 11
- 208000015181 infectious disease Diseases 0.000 description 7
- 238000002372 labelling Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 238000004659 sterilization and disinfection Methods 0.000 description 5
- 238000004422 calculation algorithm Methods 0.000 description 4
- 230000000249 desinfective effect Effects 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 238000004590 computer program Methods 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 230000004888 barrier function Effects 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 230000001902 propagating effect Effects 0.000 description 1
- 238000010008 shearing Methods 0.000 description 1
- 238000005728 strengthening Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to the field of intelligent video monitoring and deep learning, in particular to an action normative detection method and device based on time consistency comparison learning, wherein the method comprises the following steps: firstly, constructing a data set by using videos which are marked by a first quantity and unmarked by a second quantity and are collected by a camera, wherein the first quantity is less than the second quantity; secondly, extracting features of the unmarked video after the strength data of the unmarked video is enhanced, inputting a time consistency behavior alignment network, outputting a feature graph and a set of similar action starting and ending frames among different samples, mapping the set to correspond to sub-feature graphs on the feature graph, constructing similar and dissimilar sub-feature graph samples, and sending the samples to a comparative learning network to extract space-time discriminant features; then, the first quantity of marked videos are sent to a pre-trained network for transfer learning, and behavior categories are output; and finally judging the behavior normalization through the interframe behavior category change, and if the behavior normalization is not judged, giving out early warning.
Description
Technical Field
The invention relates to the field of intelligent video monitoring and deep learning, in particular to an action normative detection method and device based on time consistency comparison learning.
Background
Medical care personnel always walk on the front line of resisting epidemic situations, and the life safety of the masses is protected. The protective equipment is an important protective barrier for medical workers, the high infection rate caused by exposure can be reduced, the medical workers wear the protective clothing regularly and take off the protective clothing is an important measure for preventing infection, and if the medical workers do not wear the protective clothing according to the standard, the risk of high infection exists. Therefore, the putting-on and taking-off process is standardized, the problem that the whole team is isolated due to infection of individual personnel can be effectively avoided, and the number of non-combat personnel is reduced.
Not only medical personnel need obey the standard flow, all need obey disinfection and personal protective equipment standard flow in other high infection risk operation fields, and current restraint to action flow normality is mostly by personnel's training and individual attention, has high infection risk, and a real-time monitoring human action of intelligent monitoring means is needed to urgent need to judge whether action flow accords with the normality.
The existing behavior recognition methods are divided into two categories, and based on a supervised method and an unsupervised method, a large number of samples need to be marked to improve the recognition accuracy rate based on the supervised method, and the marking cost is very high; unsupervised based methods do not require labeling of samples, but recognition rates are lower than those of supervised methods. A method based on contrast learning belongs to unsupervised, a large amount of label-free data are used, a self-supervision pre-training mode is adopted, prior knowledge distribution of the data is learned, and then transfer learning is carried out in downstream tasks (image classification/target detection and the like) so as to improve the performance of the downstream tasks.
The contrast learning is characterized in that the data from the same sample after data enhancement is the same type, and the other data are different types as long as the data are not from the same sample, and the characteristics are learned by utilizing the contrast loss function to maximize the similarity among the same types and minimize the similarity among the different types. Under the same labeling data, the contrast learning plus the transfer learning surpass the supervision method in the image classification, and the labeling cost is obviously reduced. However, when the assumption is applied to the non-cut video classification task, a large problem exists in that when a video segment spans two or more actions, the algorithm is difficult to learn distinctive features, so that the recognition accuracy in the downstream task is affected. There is a need to improve the algorithm so that data with similar actions but from different samples are all classified as homogeneous and data with different actions are classified as heterogeneous.
Disclosure of Invention
In order to solve the technical problems in the prior art, the invention provides a low-cost and high-accuracy action normative detection method and device based on time consistency comparison learning, and the specific technical scheme is as follows:
an action normative detection method based on time consistency comparison learning comprises the following steps:
constructing a data set of a first quantity of marked cameras and a second quantity of unmarked cameras for acquiring videos, wherein the first quantity is smaller than the second quantity;
secondly, performing strong enhancement and weak enhancement on the second quantity of video data which are not marked to obtain strong enhancement data and weak enhancement data respectively;
inputting the strong and weak enhancement data into a coding feature extraction network to extract the features of the strong and weak enhancement data;
inputting the characteristics of strong and weak enhancement data into a time consistency behavior alignment network to obtain strong and weak enhancement image characteristics, searching and aligning the nearest neighbor frames of similar actions between image characteristic sequence pairs consisting of the strong and weak enhancement image characteristics to obtain a similar action initial frame and an ending frame set between the image characteristic sequence pairs;
inputting a set of similar action initial frames and end frames between pairs of strong and weak enhanced image features and image feature sequences into a time-space discriminant feature extraction network, and completing self-supervision pre-training of a second amount of unlabeled video data by combining a contrast learning network;
and step six, extracting network parameters by using self-coding features after self-supervision pre-training, adding a classification network after the self-coding features are extracted, then marking video data by adopting a first quantity to finish network migration learning, and finally judging the action normativity of the video image through the change of behavior categories among video frames.
Further, the second step is specifically as follows:
setting non-cutting video data as,For the video of the i-th frame,for total video frame number, from videoXImage sequence obtained by intermediate samplingWhereinIs an RGB image having a height H x W,,for sampling frequency, after data enhancement, the image sequence after strong enhancement isI.e. strongly enhanced data, weakly enhanced image sequence ofI.e., weak enhancement data;
the weak data enhancement mode is color enhancement combined with scale transformation, and the strong data enhancement mode is video segment replacement combined with scale transformation; the video clip is replaced by: and (3) dividing the video into any section not larger than M, and randomly disorganizing the divided sections.
Further, the third step is specifically:
using 3D ResNet50 as a self-encoder, i.e., self-encoding feature extraction network, willAndmapping the high-dimensional features by a 3D ResNet50 self-encoderWhereinTo representOr,,As a function of the 3D ResNet50,,for outputting the dimensions of the feature vector, i.e. features for respectively obtaining strong enhancement data after passing through the self-encoderAnd features of weakly enhanced data。
Further, the fourth step includes the following substeps:
step 4.1, the characteristics of the data are enhanced stronglyAnd features of weakly enhanced dataOutputting image characteristic sequence through space-time global average pooling layer, full-connection layer and convolution layerAndwhereinThe strong enhanced image characteristic of the ith frame is taken as the weak enhanced image characteristic of the ith frame;
step 4.2, for the image characteristic sequence output in the step 4.1Andthe image characteristic sequence searches the nearest neighbor frames of similar actions among the image characteristic sequences, and firstly calculates the strong enhancement image characteristic of the ith frameIn thatIn the nearest neighbor frameTo obtainBack to back calculationIn thatIn the nearest neighbor frameIf i = k, the alignment of the similar actions between the image feature sequence pairs is successful; to calculate the loss function, theThe mark of the ith frame is 1, the marks of the other frames are 0, and the predicted valueWherein,,Representing a sequence of image featuresThe k frame in (1) strongly enhances the image characteristics, and calculates the loss between the predicted value and the real label by using a cross entropy loss function:
WhereinAndrespectively representing a sequence of image featuresThe j and k frames in (1) weakly enhance image features,to representAnda measure of the degree of similarity between the two,the presence of a real label is indicated,representing a predicted value;
step 4.3, recording the characteristic position of i = k in the image characteristic sequence pair in step 4.2, and recording the characteristic position of the input image characteristic sequence pairRecording i = k position to form similar action initial frame setAnd action end frame setWherein N is the logarithm of the image characteristic sequence successfully aligned,andrespectively represent the alignment starting positions of the image feature sequences,andrespectively represent the alignment end positions of the image feature sequences.
Further, the step five includes the following substeps:
step 5.1, the image feature sequence output in the step 4.1 is processedAndoutputting the sensing characteristic sequence through a plurality of layers of sensing machine layersAndwhereinFor the ith frame strong enhancement image perception feature map,the perception characteristic map of the ith frame of the weakly enhanced image is obtained;
step 5.2, outputting similar action start frame and end frame set according to step 4.3Andunify the length of the similar action sequence, getRho is the minimum sequence length, and the step 5.1 of sampling according to the minimum sequence length outputs a perception characteristic sequenceAndthe initial and end positions to obtain the pair of subsequence feature maps, which are marked as the same kind of positive samplesThe rest of the unaligned image sequences are negative samples of different classesWherein the positive sampleNumber N, negative exampleThe number is 2 rho B-N, B represents the number of input videos, and a contrast loss function is defined by cosine similarity calculation:
Wherein q representsAndall the segments of which the similarity is to be calculated,is the temperature over-parameter of the temperature,denotes the cosine similarity between q and k, where k denotesAnd。
further, the sixth step is specifically:
and 5.1, reserving a self-encoder feature extraction network, locking pre-training parameters of each layer of the network, adding a classification network after the self-encoding feature extraction network, wherein the classification network comprises a full connection layer and a softmax layer, outputting behavior categories and confidence coefficients thereof, then adopting a first quantity of labeled video data to finish network migration learning, using a cross entropy loss function to perform back propagation on the network, continuously updating network parameters by a batch gradient descent method, performing iterative training, and finally outputting the behavior categories and the confidence coefficients of the current frame image by inputting test set data to judge the operation normalization.
The device comprises one or more processors and is used for realizing the method for detecting the action normativity based on the time consistency comparison learning.
A computer-readable storage medium, on which a program is stored, which, when executed by a processor, implements the method for detecting normative behavior based on temporal consistency comparison learning.
The invention has the advantages that:
1. the invention realizes a low-cost, high-performance and intelligent motion normative detection method.
2. Aiming at the problem that the existing contrast learning similar and dissimilar sample division rules cause lower behavior recognition accuracy when applied to non-shearing video behavior classification, the time consistency contrast learning network is provided, similar action characteristic graphs among different samples are aligned to be classified into the same type, other non-similar action characteristic graphs are classified into the different types, a network loss function is improved, and the behavior recognition accuracy is effectively improved.
3. The invention can effectively identify various actions and judge whether the action flow is standard, has the accuracy rate of 95.16 percent in collected operation data concentration of personal protection equipment of disinfection personnel, effectively reduces the manual supervision cost, prevents the infection risk caused by the non-standard action flow, is suitable for the detection environment of standard behaviors and has wide application value.
Drawings
FIG. 1 is a schematic flow chart of an action normative detection method based on time consistency comparison learning according to the present invention;
FIG. 2 is a network architecture diagram of an action normative detection method based on time consistency comparison learning according to an embodiment of the present invention;
FIG. 3 is an illustration of an example of behavior labeling of the personal protective equipment removal operation flow of the disinfection personnel in accordance with an embodiment of the present invention;
FIG. 4 is a schematic diagram of the time-consistent behavior alignment network of the present invention;
FIG. 5 is a schematic flow chart of the network training phase of the present invention;
FIG. 6 is a diagram of the multi-class behavior recognition confusion matrix effect of the present invention;
fig. 7 is a schematic structural diagram of an action normative detection apparatus based on time consistency comparison learning according to the present invention.
Detailed Description
In order to make the objects, technical solutions and technical effects of the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and examples.
As shown in fig. 1 and fig. 2, in the method for detecting action normativity based on time consistency comparison learning of the present invention, first, a data set is constructed from videos acquired by a camera, which are marked in a first quantity and are not marked in a second quantity; secondly, extracting features of the unmarked video after the strength data of the unmarked video is enhanced, inputting a time consistency behavior alignment network, outputting a feature graph and a set of similar action starting and ending frames among different samples, mapping the set to correspond to sub-feature graphs on the feature graph, constructing similar and dissimilar sub-feature graph samples, and sending the samples to a comparative learning network to extract space-time discriminant features; then sending the first quantity of marked videos into a pre-trained network for transfer learning, and outputting behavior categories; finally, judging the behavior normalization through the interframe behavior category change, and if the behavior normalization is not judged, sending out early warning; the method specifically comprises the following steps:
step one, constructing a data set of a first quantity of marked cameras and a second quantity of unmarked cameras for acquiring videos, wherein the first quantity is smaller than the second quantity.
The second amount of unlabeled data is used for contrast learning, and the first amount of labeled data is used for transfer learning. And marking a key action category label on the video action starting frame, and marking a behavior stage label on the two key action starting frames. The invention adopts the self-supervision mode to train the network for feature extraction, does not need to consume manpower to label a large amount of training data sets, namely the second amount of training data sets, but needs to label a small amount of training sets and test sets, namely the first amount of training sets and test sets, in order to verify the effectiveness of the algorithm and carry out transfer learning in a small amount of labeled samples, namely the first amount of labeled samples. As shown in fig. 3, for example, the personal protective equipment removing operation flow of the disinfection personnel is taken as an example, 6 behavior starting frame key behaviors in the video are labeled, and a behavior stage is formed between the two behaviors, which includes 8 behavior stages. In the embodiment, 1240 segments of a training set and 480 segments of a verification set are collected, wherein the training set is only labeled with 300 segments, and the videos in the verification set are all labeled.
And step two, performing strong enhancement and weak enhancement on the second quantity of video data which are not marked to obtain strong enhancement data and weak enhancement data respectively.
Setting a strong data enhancement mode and a weak data enhancement mode, wherein the weak data enhancement mode is color enhancement combined with scale transformation, and the strong data enhancement mode is video clip replacement combined with scale transformation; the video clip is replaced by: and (3) dividing the video into any section not larger than M, and randomly disorganizing the divided sections.
Setting non-cutting video data as,For the video of the i-th frame,the input of the invention is a sequence of images sampled from a video X, for the total number of video framesWhereinIs an RGB image having a height H x W,,for sampling frequency, after data enhancement, the image sequence after strong enhancement isI.e. strongly enhanced data, weakly enhanced image sequence ofI.e. weak enhancement data.
And step three, inputting the strong and weak enhancement data into a coding feature extraction network to extract the features of the strong and weak enhancement data.
Will strongly enhance the dataAnd weak enhancement data are respectively input into a self-coding feature extraction network to extract the space-time features of the image sequence, and particularly, the invention uses 3D ResNet50 as a self-coder, namely the self-coding feature extraction network to extract the space-time features of the image sequenceAndmapping the high-dimensional features by a 3D ResNet50 self-encoderWhereinTo representOr,,As a function of the 3D ResNet50,,=1028 output eigenvector dimensions, i.e. features from strong enhancement data obtained after passing through the self-encoderAnd features of weakly enhanced data。
Step four: inputting the characteristics of strong and weak enhancement data into a time consistency behavior alignment network to obtain strong and weak enhancement image characteristics, searching the nearest neighbor frame of the similar action between image characteristic sequence pairs consisting of the strong and weak enhancement image characteristics and aligning to obtain a similar action starting frame and an ending frame set between the image characteristic sequence pairs.
Will strongly enhance the characteristics of the dataAnd features of weakly enhanced dataInputting a time consistency behavior action alignment network, and calculating a self-supervision alignment loss function of the feature image sequence pairs to obtain a set of similar action initial frames and end frames between the feature image sequence pairs; and then, sending the data to a space-time discriminant feature learning network of the next step, and further extracting behavior action discriminant features by utilizing comparative learning.
Specifically, the step four is realized by the following substeps:
step 4.1, constructing a time consistency behavior alignment network head, and strengthening the characteristics of dataAnd features of weakly enhanced dataOutputting image characteristic sequence through space-time global average pooling layer, full-connection layer and convolution layerAndwhereinThe image features are strongly enhanced for the ith frame,weakly enhancing image features for the ith frame;
step 4.2, for the image characteristic sequence output in the step 4.1Andthe process of searching the nearest neighbor frames of similar actions in the image feature sequence is as shown in fig. 4, first, calculating the strong enhanced image feature of the first frameIn thatIn the nearest neighbor frameTo obtainBack to back calculationIn thatNearest neighbor frame in (2)If i = k, the alignment of the similar actions between the image feature sequence pairs is successful; to calculate the loss function, theThe mark of the ith frame is 1, the marks of the other frames are 0, and the predicted valueWherein,,Representing a sequence of image featuresThe k frame in (1) strongly enhances the image characteristics, and calculates the loss between the predicted value and the real label by using a cross entropy loss function:
WhereinAndrespectively representing a sequence of image featuresThe j and k frames in (1) weakly enhance image features,to representAnda measure of the degree of similarity between the two,the presence of a real label is indicated,indicating the predicted value.
And 4.3, recording the characteristic positions of i = k in the image characteristic sequence pair in the step 4.2, and forming a similar action starting frame set by recording the positions of i = k in the image characteristic sequence pair aiming at the input image characteristic sequence pairAnd action end frame setWherein N is the logarithm of the image characteristic sequence successfully aligned,andrespectively represent the alignment starting positions of the image characteristic sequences,andrespectively represent the alignment end positions of the image feature sequences.
And fifthly, inputting the strong and weak enhanced image features and the image feature sequence pairs into a time-space discriminant feature extraction network, and combining a contrast learning network to finish self-supervision pre-training of a second amount of unlabeled video data.
Sending the strong and weak enhanced image features output in the step 4.1 into a space-time discriminant feature extraction network, constructing similar and heterogeneous sub-feature map samples according to the set of similar action starting frames and ending frames between the image feature sequence pairs output in the step 4.3, mapping the sub-feature maps corresponding to the strong and weak enhanced image features in the set, wherein the sub-feature map pairs in the set are of the same type, and the other image sequence feature map pairs not in the set are classified into different types, calculating a contrast loss function by using a contrast learning network, maximizing the similarity between the same types, minimizing the similarity between the different types, effectively extracting the action discriminant features, and completing the self-supervision pre-training of a second amount of unlabeled video data.
Specifically, the step five is realized by the following substeps:
step 5.1, constructing a time-space discriminant feature extraction network head, and performing image feature sequence output in step 4.1Andoutputting the sensing characteristic sequence through a plurality of layers of sensing machine layersAndwhereinFor the ith frame strong enhancement image perception feature map,the perception characteristic map of the ith frame of the weakly enhanced image is obtained;
step 5.2, setting the number of the input videos each time as B, and obtaining 2B image sequences after strong data enhancement and weak data enhancement processing; output of similar action start and end frame sets according to step 4.3Andunify the length of the similar action sequence, getRho is the minimum sequence length, and the step 5.1 of sampling according to the minimum sequence length outputs a perception characteristic sequenceAndthe initial and end positions to obtain the pair of subsequence feature maps, which are marked as the same kind of positive samplesThe rest of the unaligned image sequences are negative samples of different classesWherein the positive sampleNumber N, negative exampleThe number is 2 rho B-N, and a contrast loss function is defined by using cosine similarity calculation:
Wherein q representsAndall the segments of which the similarity is to be calculated,is the temperature over-parameter of the temperature,denotes the cosine similarity between q and k, where k denotesAnd。
in summary, as shown in fig. 5, to reduce the sample labeling cost, the self-supervised pre-training is performed on the unlabeled data set, and then the migration learning is performed on a small amount of sample data sets, i.e. the first amount of sample data sets; in the pre-training stage, the network is reversely propagated based on two loss functions, network parameters are continuously updated through a batch gradient descent method until the change of the total loss function value is smaller than a set threshold value, the training is finished, and the training is stopped, wherein the calculation formula of the total loss function is as follows:
And step six, extracting network parameters by using self-coding features after self-supervision pre-training, adding a classification network after the self-coding features are extracted, then marking video data by adopting a first quantity to finish network migration learning, and finally judging the action normativity of the video image through the change of behavior categories among video frames.
And 5.1, reserving a self-encoder feature extraction network, locking pre-training parameters of each layer of the network, adding a classification network after the self-encoding feature extraction network, wherein the classification network comprises a full connection layer and a softmax layer, outputting behavior categories and confidence degrees thereof, finely adjusting the network in a training set labeled with 300 sections of videos, reversely propagating the network by using a cross entropy loss function, continuously updating network parameters by a batch gradient descent method, and stopping training after iterating 100000 times.
In the testing stage, 16 frames of images are accumulated for the first time by inputting a video, the latest frame of image is updated to a 16 frame image queue every time, after data enhancement, a network and a classified network head are extracted from the characteristics of a coder, the behavior category and the confidence coefficient of the current frame of image are output, and the action normalization is judged.
In the data labeling stage, behavior category labels are labeled according to the sequence of behavior actions, taking the operation of disinfecting an operator for personal protection equipment removal as an example, the sequence of the labeled labels corresponding to the behavior action flow shown in fig. 3 is as follows: (0) removing gloves- > (1) disinfecting hands- > (2) removing goggles- > (1) disinfecting hands- > (4) removing protective clothing- > (5) removing outer shoe covers- > (6) throwing the protective clothing- > (0) removing gloves- > (1) disinfecting hands, recording inter-frame behavior label values, and if the change of the output action labels does not accord with the action flow specification, sending early warning information.
On a personal protective equipment off-operation data set of disinfection personnel, 300-segment marked watch frequency is used as a training set, on a test set, a recognition accuracy rate of a SlowFast algorithm is 85.36%, in order to verify the effect of time consistency behaviors aligned with a network head, a network without the time consistency behaviors aligned with the network head is trained and tested, the recognition accuracy rate is 90.15%, as shown in fig. 6, a time consistency behaviors comparison learning network multi-class behavior recognition confusion matrix is shown, the average accuracy rate of multi-class behavior recognition is 95.16% in the graph, and under the same marking cost, the accuracy rate is obviously improved.
Corresponding to the embodiment of the motion normalization detection method based on time consistency comparison learning, the invention also provides an embodiment of a motion normalization detection device based on time consistency comparison learning.
Referring to fig. 7, the apparatus for detecting motion normativity based on time consistency comparison learning according to the embodiment of the present invention includes one or more processors, and is configured to implement the method for detecting motion normativity based on time consistency comparison learning according to the foregoing embodiment.
The embodiment of the device for detecting the normative movement based on the comparison and learning of the time consistency can be applied to any equipment with data processing capability, such as computers and other equipment or devices. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and as a device in a logical sense, a processor of any device with data processing capability reads corresponding computer program instructions in the nonvolatile memory into the memory for operation. From a hardware aspect, as shown in fig. 7, a hardware structure diagram of an arbitrary device with data processing capability where the apparatus for detecting action normativity based on time consistency comparison and learning is located according to the present invention is shown, except for the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 7, in an embodiment, an arbitrary device with data processing capability where the apparatus is located may generally include other hardware according to an actual function of the arbitrary device with data processing capability, which is not described again.
The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.
An embodiment of the present invention further provides a computer-readable storage medium, on which a program is stored, where the program, when executed by a processor, implements the method for detecting the action normativity based on the time consistency comparison learning in the foregoing embodiments.
The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any data processing capability device described in any of the foregoing embodiments. The computer readable storage medium may also be an external storage device of the wind turbine, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), and the like, provided on the device. Further, the computer readable storage medium may include both an internal storage unit and an external storage device of any data processing capable device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing-capable device, and may also be used for temporarily storing data that has been output or is to be output.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way. Although the foregoing has described the practice of the present invention in detail, it will be apparent to those skilled in the art that modifications may be made to the practice of the invention as described in the foregoing examples, or that certain features may be substituted in the practice of the invention. All changes, equivalents and the like which come within the spirit and principles of the invention are desired to be protected.
Claims (8)
1. A motion normative detection method based on time consistency comparison learning is characterized by comprising the following steps:
constructing a data set of a first quantity of marked cameras and a second quantity of unmarked cameras for acquiring videos, wherein the first quantity is smaller than the second quantity;
secondly, performing strong enhancement processing and weak enhancement processing on a second quantity of video data which are not marked to obtain strong enhancement data and weak enhancement data respectively;
inputting the strong and weak enhancement data into a coding feature extraction network to extract the features of the strong and weak enhancement data;
inputting the characteristics of strong and weak enhancement data into a time consistency behavior action alignment network to obtain strong and weak enhancement image characteristics, searching and aligning nearest frames of similar actions between image characteristic sequence pairs consisting of the strong and weak enhancement image characteristics to obtain a similar action initial frame and an end frame set between the image characteristic sequence pairs;
inputting a set of similar action initial frames and end frames between pairs of strong and weak enhanced image features and image feature sequences into a time-space discriminant feature extraction network, and completing self-supervision pre-training of a second amount of unlabeled video data by combining a contrast learning network;
and step six, extracting network parameters by using self-coding features after self-supervision pre-training, adding a classification network after the self-coding features are extracted, then marking video data by adopting a first quantity to finish network migration learning, and finally judging the action normativity of the video image through the change of behavior categories among video frames.
2. The method for detecting the action normativity based on the time consistency comparison learning as claimed in claim 1, wherein the second step is specifically as follows:
setting non-cutting video data as,For the video of the i-th frame,for total video frame number, from videoXImage sequence obtained by intermediate samplingWhereinIs an RGB image having a height H x W,,for sampling frequency, after data enhancement, the image sequence after strong enhancement isI.e. strongly enhanced data, weakly enhanced image sequence ofI.e. weak enhancement data;
the weak data enhancement mode is color enhancement combined with scale transformation, and the strong data enhancement mode is video segment replacement combined with scale transformation; the video clip is replaced by: and (3) dividing the video into any section not greater than M, and randomly disturbing the divided sections.
3. The method for detecting the action normativity based on the time consistency comparison learning as claimed in claim 2, wherein the third step is specifically as follows:
using 3D ResNet50 as a self-encoder, i.e., self-encoding feature extraction network, willAndmapping the high-dimensional features by a 3D ResNet50 self-encoderWhereinTo representOr,,As a function of the 3D ResNet50,,for outputting the feature vector dimensions, i.e. features from strong enhancement data obtained after passing through the self-encoderAnd features of weakly enhanced data。
4. The method for detecting the normative behavior based on the time consistency comparison learning as claimed in claim 3, wherein the fourth step comprises the following substeps:
step 4.1, the characteristics of the data are enhanced stronglyAnd features of weakly enhanced dataOutputting image characteristic sequence through space-time global average pooling layer, full-connection layer and convolution layerAndwhereinThe strong enhanced image characteristic of the ith frame is taken as the weak enhanced image characteristic of the ith frame;
step 4.2, for the image characteristic sequence output in the step 4.1Andthe image characteristic sequence searches the nearest neighbor frames of similar actions among the image characteristic sequences, and firstly calculates the strong enhancement image characteristic of the ith frameIn thatIn the nearest neighbor frameTo obtainBack to back calculationIn thatIn the nearest neighbor frameIf i = k, the alignment of the similar actions between the image feature sequence pairs is successful; to calculate the loss function, theThe mark of the ith frame is 1, the marks of the other frames are 0, and the predicted valueWherein,,Representing a sequence of image featuresThe k frame in (1) strongly enhances the image characteristics, and calculates the loss between the predicted value and the real label by using a cross entropy loss function:
WhereinAndindividual watchImage feature sequenceThe j and k frames in (1) weakly enhance image features,to representAnda measure of the degree of similarity between the two,the presence of a real label is indicated,representing a predicted value;
and 4.3, recording the characteristic positions of i = k in the image characteristic sequence pair in the step 4.2, and forming a similar action starting frame set by recording the positions of i = k in the image characteristic sequence pair aiming at the input image characteristic sequence pairAnd action end frame setWherein N is the logarithm of the image characteristic sequence successfully aligned,andrespectively represent the alignment starting positions of the image feature sequences,andrespectively represent the alignment end positions of the image feature sequences.
5. The method for detecting the normative behavior based on the time consistency comparison learning as claimed in claim 4, wherein the step five comprises the following substeps:
step 5.1, the image feature sequence output in the step 4.1 is processedAndoutputting the sensing characteristic sequence through a plurality of layers of sensing machine layersAndwhereinFor the ith frame strong enhancement image perception feature map,the perception characteristic map of the ith frame of the weakly enhanced image is obtained;
step 5.2, outputting similar action start frame and end frame set according to step 4.3Andunify the length of the similar action sequence, getRho is the minimum sequence length, and the step 5.1 of sampling according to the minimum sequence length outputs a perception characteristic sequenceAndthe initial and end positions to obtain the pair of subsequence feature maps, which are marked as the same kind of positive samplesThe rest of the unaligned image sequences are negative samples of different classesWherein the positive sampleNumber N, negative exampleThe number is 2 rho B-N, B represents the number of input videos, and a contrast loss function is defined by using cosine similarity calculation:
6. the method for detecting the action normativity based on the time consistency comparison learning as claimed in claim 5, wherein the sixth step is specifically as follows:
and 5.1, reserving a self-encoder feature extraction network, locking pre-training parameters of each layer of the network, adding a classification network after the self-encoding feature extraction network, wherein the classification network comprises a full connection layer and a softmax layer, outputting behavior categories and confidence coefficients thereof, then adopting a first quantity of labeled video data to finish network migration learning, using a cross entropy loss function to perform back propagation on the network, continuously updating network parameters by a batch gradient descent method, performing iterative training, and finally outputting the behavior categories and the confidence coefficients of the current frame image by inputting test set data to judge the operation normalization.
7. An action normative detection device based on time consistency comparison learning, which is characterized by comprising one or more processors and is used for realizing the action normative detection method based on time consistency comparison learning of any one of claims 1 to 6.
8. A computer-readable storage medium, characterized in that a program is stored thereon, which when executed by a processor, implements the action normative detection method based on time-consistency comparison learning according to any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210454687.0A CN114648723B (en) | 2022-04-28 | 2022-04-28 | Action normalization detection method and device based on time consistency comparison learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210454687.0A CN114648723B (en) | 2022-04-28 | 2022-04-28 | Action normalization detection method and device based on time consistency comparison learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114648723A true CN114648723A (en) | 2022-06-21 |
CN114648723B CN114648723B (en) | 2024-08-02 |
Family
ID=81997635
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210454687.0A Active CN114648723B (en) | 2022-04-28 | 2022-04-28 | Action normalization detection method and device based on time consistency comparison learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114648723B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116030538A (en) * | 2023-03-30 | 2023-04-28 | 中国科学技术大学 | Weak supervision action detection method, system, equipment and storage medium |
CN116758562A (en) * | 2023-08-22 | 2023-09-15 | 杭州实在智能科技有限公司 | Universal text verification code identification method and system |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180204111A1 (en) * | 2013-02-28 | 2018-07-19 | Z Advanced Computing, Inc. | System and Method for Extremely Efficient Image and Pattern Recognition and Artificial Intelligence Platform |
WO2019136591A1 (en) * | 2018-01-09 | 2019-07-18 | 深圳大学 | Salient object detection method and system for weak supervision-based spatio-temporal cascade neural network |
CA3104951A1 (en) * | 2019-03-21 | 2020-09-24 | Illumina, Inc. | Artificial intelligence-based sequencing |
CN113011562A (en) * | 2021-03-18 | 2021-06-22 | 华为技术有限公司 | Model training method and device |
CN113673489A (en) * | 2021-10-21 | 2021-11-19 | 之江实验室 | Video group behavior identification method based on cascade Transformer |
WO2022001489A1 (en) * | 2020-06-28 | 2022-01-06 | 北京交通大学 | Unsupervised domain adaptation target re-identification method |
-
2022
- 2022-04-28 CN CN202210454687.0A patent/CN114648723B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180204111A1 (en) * | 2013-02-28 | 2018-07-19 | Z Advanced Computing, Inc. | System and Method for Extremely Efficient Image and Pattern Recognition and Artificial Intelligence Platform |
WO2019136591A1 (en) * | 2018-01-09 | 2019-07-18 | 深圳大学 | Salient object detection method and system for weak supervision-based spatio-temporal cascade neural network |
CA3104951A1 (en) * | 2019-03-21 | 2020-09-24 | Illumina, Inc. | Artificial intelligence-based sequencing |
WO2022001489A1 (en) * | 2020-06-28 | 2022-01-06 | 北京交通大学 | Unsupervised domain adaptation target re-identification method |
CN113011562A (en) * | 2021-03-18 | 2021-06-22 | 华为技术有限公司 | Model training method and device |
CN113673489A (en) * | 2021-10-21 | 2021-11-19 | 之江实验室 | Video group behavior identification method based on cascade Transformer |
Non-Patent Citations (1)
Title |
---|
李玺;查宇飞;张天柱;崔振;左旺孟;侯志强;卢湖川;王菡子;: "深度学习的目标跟踪算法综述", 中国图象图形学报, no. 12, 16 December 2019 (2019-12-16) * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116030538A (en) * | 2023-03-30 | 2023-04-28 | 中国科学技术大学 | Weak supervision action detection method, system, equipment and storage medium |
CN116030538B (en) * | 2023-03-30 | 2023-06-16 | 中国科学技术大学 | Weak supervision action detection method, system, equipment and storage medium |
CN116758562A (en) * | 2023-08-22 | 2023-09-15 | 杭州实在智能科技有限公司 | Universal text verification code identification method and system |
CN116758562B (en) * | 2023-08-22 | 2023-12-08 | 杭州实在智能科技有限公司 | Universal text verification code identification method and system |
Also Published As
Publication number | Publication date |
---|---|
CN114648723B (en) | 2024-08-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114648723A (en) | Action normative detection method and device based on time consistency comparison learning | |
CN111814661B (en) | Human body behavior recognition method based on residual error-circulating neural network | |
US9070041B2 (en) | Image processing apparatus and image processing method with calculation of variance for composited partial features | |
WO2022218396A1 (en) | Image processing method and apparatus, and computer readable storage medium | |
CN110689043A (en) | Vehicle fine granularity identification method and device based on multiple attention mechanism | |
CN111612100B (en) | Object re-identification method, device, storage medium and computer equipment | |
CN113283282A (en) | Weak supervision time sequence action detection method based on time domain semantic features | |
CN114842343A (en) | ViT-based aerial image identification method | |
CN116977725A (en) | Abnormal behavior identification method and device based on improved convolutional neural network | |
CN116580243A (en) | Cross-domain remote sensing scene classification method for mask image modeling guide domain adaptation | |
Ataş | Performance Evaluation of Jaccard-Dice Coefficient on Building Segmentation from High Resolution Satellite Images | |
CN110795599A (en) | Video emergency monitoring method and system based on multi-scale graph | |
CN116561649B (en) | Diver motion state identification method and system based on multi-source sensor data | |
CN116434150B (en) | Multi-target detection tracking method, system and storage medium for congestion scene | |
CN117152504A (en) | Space correlation guided prototype distillation small sample classification method | |
CN117333766A (en) | Intelligent interactive remote sensing information extraction method and system combined with large visual model | |
CN115359484A (en) | Image processing method, device, equipment and storage medium | |
CN112989869B (en) | Optimization method, device, equipment and storage medium of face quality detection model | |
CN114581769A (en) | Method for identifying houses under construction based on unsupervised clustering | |
CN111860441A (en) | Video target identification method based on unbiased depth migration learning | |
CN111382712A (en) | Palm image recognition method, system and equipment | |
CN117035802B (en) | Consensus method for predicting animal health based on capacity demonstration double test | |
CN118470608B (en) | Weak supervision video anomaly detection method and system based on feature enhancement and fusion | |
CN117935030B (en) | Multi-label confidence calibration method and system for double-view-angle correlation perception regularization | |
CN114239753B (en) | Migratable image identification method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |