CN114648723A

CN114648723A - Action normative detection method and device based on time consistency comparison learning

Info

Publication number: CN114648723A
Application number: CN202210454687.0A
Authority: CN
Inventors: 李玲; 徐晓刚; 王军; 祝敏航; 曹卫强; 何鹏飞
Original assignee: Zhejiang Gongshang University; Zhejiang Lab
Current assignee: Zhejiang Gongshang University; Zhejiang Lab
Priority date: 2022-04-28
Filing date: 2022-04-28
Publication date: 2022-06-21
Anticipated expiration: 2042-04-28
Also published as: CN114648723B

Abstract

The invention relates to the field of intelligent video monitoring and deep learning, in particular to an action normative detection method and device based on time consistency comparison learning, wherein the method comprises the following steps: firstly, constructing a data set by using videos which are marked by a first quantity and unmarked by a second quantity and are collected by a camera, wherein the first quantity is less than the second quantity; secondly, extracting features of the unmarked video after the strength data of the unmarked video is enhanced, inputting a time consistency behavior alignment network, outputting a feature graph and a set of similar action starting and ending frames among different samples, mapping the set to correspond to sub-feature graphs on the feature graph, constructing similar and dissimilar sub-feature graph samples, and sending the samples to a comparative learning network to extract space-time discriminant features; then, the first quantity of marked videos are sent to a pre-trained network for transfer learning, and behavior categories are output; and finally judging the behavior normalization through the interframe behavior category change, and if the behavior normalization is not judged, giving out early warning.

Description

Action normative detection method and device based on time consistency comparison learning

Technical Field

The invention relates to the field of intelligent video monitoring and deep learning, in particular to an action normative detection method and device based on time consistency comparison learning.

Background

Medical care personnel always walk on the front line of resisting epidemic situations, and the life safety of the masses is protected. The protective equipment is an important protective barrier for medical workers, the high infection rate caused by exposure can be reduced, the medical workers wear the protective clothing regularly and take off the protective clothing is an important measure for preventing infection, and if the medical workers do not wear the protective clothing according to the standard, the risk of high infection exists. Therefore, the putting-on and taking-off process is standardized, the problem that the whole team is isolated due to infection of individual personnel can be effectively avoided, and the number of non-combat personnel is reduced.

Not only medical personnel need obey the standard flow, all need obey disinfection and personal protective equipment standard flow in other high infection risk operation fields, and current restraint to action flow normality is mostly by personnel's training and individual attention, has high infection risk, and a real-time monitoring human action of intelligent monitoring means is needed to urgent need to judge whether action flow accords with the normality.

The existing behavior recognition methods are divided into two categories, and based on a supervised method and an unsupervised method, a large number of samples need to be marked to improve the recognition accuracy rate based on the supervised method, and the marking cost is very high; unsupervised based methods do not require labeling of samples, but recognition rates are lower than those of supervised methods. A method based on contrast learning belongs to unsupervised, a large amount of label-free data are used, a self-supervision pre-training mode is adopted, prior knowledge distribution of the data is learned, and then transfer learning is carried out in downstream tasks (image classification/target detection and the like) so as to improve the performance of the downstream tasks.

The contrast learning is characterized in that the data from the same sample after data enhancement is the same type, and the other data are different types as long as the data are not from the same sample, and the characteristics are learned by utilizing the contrast loss function to maximize the similarity among the same types and minimize the similarity among the different types. Under the same labeling data, the contrast learning plus the transfer learning surpass the supervision method in the image classification, and the labeling cost is obviously reduced. However, when the assumption is applied to the non-cut video classification task, a large problem exists in that when a video segment spans two or more actions, the algorithm is difficult to learn distinctive features, so that the recognition accuracy in the downstream task is affected. There is a need to improve the algorithm so that data with similar actions but from different samples are all classified as homogeneous and data with different actions are classified as heterogeneous.

Disclosure of Invention

In order to solve the technical problems in the prior art, the invention provides a low-cost and high-accuracy action normative detection method and device based on time consistency comparison learning, and the specific technical scheme is as follows:

an action normative detection method based on time consistency comparison learning comprises the following steps:

constructing a data set of a first quantity of marked cameras and a second quantity of unmarked cameras for acquiring videos, wherein the first quantity is smaller than the second quantity;

secondly, performing strong enhancement and weak enhancement on the second quantity of video data which are not marked to obtain strong enhancement data and weak enhancement data respectively;

inputting the strong and weak enhancement data into a coding feature extraction network to extract the features of the strong and weak enhancement data;

inputting the characteristics of strong and weak enhancement data into a time consistency behavior alignment network to obtain strong and weak enhancement image characteristics, searching and aligning the nearest neighbor frames of similar actions between image characteristic sequence pairs consisting of the strong and weak enhancement image characteristics to obtain a similar action initial frame and an ending frame set between the image characteristic sequence pairs;

inputting a set of similar action initial frames and end frames between pairs of strong and weak enhanced image features and image feature sequences into a time-space discriminant feature extraction network, and completing self-supervision pre-training of a second amount of unlabeled video data by combining a contrast learning network;

and step six, extracting network parameters by using self-coding features after self-supervision pre-training, adding a classification network after the self-coding features are extracted, then marking video data by adopting a first quantity to finish network migration learning, and finally judging the action normativity of the video image through the change of behavior categories among video frames.

Further, the second step is specifically as follows:

setting non-cutting video data as

，

For the video of the i-th frame,

for total video frame number, from videoXImage sequence obtained by intermediate sampling

Wherein

Is an RGB image having a height H x W,

，

for sampling frequency, after data enhancement, the image sequence after strong enhancement is

I.e. strongly enhanced data, weakly enhanced image sequence of

I.e., weak enhancement data;

the weak data enhancement mode is color enhancement combined with scale transformation, and the strong data enhancement mode is video segment replacement combined with scale transformation; the video clip is replaced by: and (3) dividing the video into any section not larger than M, and randomly disorganizing the divided sections.

Further, the third step is specifically:

using 3D ResNet50 as a self-encoder, i.e., self-encoding feature extraction network, will

And

mapping the high-dimensional features by a 3D ResNet50 self-encoder

Wherein

To represent

Or

，

，

As a function of the 3D ResNet50,

，

for outputting the dimensions of the feature vector, i.e. features for respectively obtaining strong enhancement data after passing through the self-encoder

And features of weakly enhanced data

。

Further, the fourth step includes the following substeps:

step 4.1, the characteristics of the data are enhanced strongly

And features of weakly enhanced data

Outputting image characteristic sequence through space-time global average pooling layer, full-connection layer and convolution layer

And

wherein

The strong enhanced image characteristic of the ith frame is taken as the weak enhanced image characteristic of the ith frame;

step 4.2, for the image characteristic sequence output in the step 4.1

And

the image characteristic sequence searches the nearest neighbor frames of similar actions among the image characteristic sequences, and firstly calculates the strong enhancement image characteristic of the ith frame

In that

In the nearest neighbor frame

To obtain

Back to back calculation

In that

In the nearest neighbor frame

If i = k, the alignment of the similar actions between the image feature sequence pairs is successful; to calculate the loss function, the

The mark of the ith frame is 1, the marks of the other frames are 0, and the predicted value

Wherein

，

，

Representing a sequence of image features

The k frame in (1) strongly enhances the image characteristics, and calculates the loss between the predicted value and the real label by using a cross entropy loss function

：

，

,

，

Wherein

And

respectively representing a sequence of image features

The j and k frames in (1) weakly enhance image features,

to represent

And

a measure of the degree of similarity between the two,

the presence of a real label is indicated,

representing a predicted value;

step 4.3, recording the characteristic position of i = k in the image characteristic sequence pair in step 4.2, and recording the characteristic position of the input image characteristic sequence pairRecording i = k position to form similar action initial frame set

And action end frame set

Wherein N is the logarithm of the image characteristic sequence successfully aligned,

and

respectively represent the alignment starting positions of the image feature sequences,

and

respectively represent the alignment end positions of the image feature sequences.

Further, the step five includes the following substeps:

step 5.1, the image feature sequence output in the step 4.1 is processed

And

outputting the sensing characteristic sequence through a plurality of layers of sensing machine layers

And

wherein

For the ith frame strong enhancement image perception feature map,

the perception characteristic map of the ith frame of the weakly enhanced image is obtained;

step 5.2, outputting similar action start frame and end frame set according to step 4.3

And

unify the length of the similar action sequence, get

Rho is the minimum sequence length, and the step 5.1 of sampling according to the minimum sequence length outputs a perception characteristic sequence

And

the initial and end positions to obtain the pair of subsequence feature maps, which are marked as the same kind of positive samples

The rest of the unaligned image sequences are negative samples of different classes

Wherein the positive sample

Number N, negative example

The number is 2 rho B-N, B represents the number of input videos, and a contrast loss function is defined by cosine similarity calculation

：

Wherein q represents

And

all the segments of which the similarity is to be calculated,

is the temperature over-parameter of the temperature,

denotes the cosine similarity between q and k, where k denotes

And

。

further, the sixth step is specifically:

and 5.1, reserving a self-encoder feature extraction network, locking pre-training parameters of each layer of the network, adding a classification network after the self-encoding feature extraction network, wherein the classification network comprises a full connection layer and a softmax layer, outputting behavior categories and confidence coefficients thereof, then adopting a first quantity of labeled video data to finish network migration learning, using a cross entropy loss function to perform back propagation on the network, continuously updating network parameters by a batch gradient descent method, performing iterative training, and finally outputting the behavior categories and the confidence coefficients of the current frame image by inputting test set data to judge the operation normalization.

The device comprises one or more processors and is used for realizing the method for detecting the action normativity based on the time consistency comparison learning.

A computer-readable storage medium, on which a program is stored, which, when executed by a processor, implements the method for detecting normative behavior based on temporal consistency comparison learning.

The invention has the advantages that:

1. the invention realizes a low-cost, high-performance and intelligent motion normative detection method.

2. Aiming at the problem that the existing contrast learning similar and dissimilar sample division rules cause lower behavior recognition accuracy when applied to non-shearing video behavior classification, the time consistency contrast learning network is provided, similar action characteristic graphs among different samples are aligned to be classified into the same type, other non-similar action characteristic graphs are classified into the different types, a network loss function is improved, and the behavior recognition accuracy is effectively improved.

3. The invention can effectively identify various actions and judge whether the action flow is standard, has the accuracy rate of 95.16 percent in collected operation data concentration of personal protection equipment of disinfection personnel, effectively reduces the manual supervision cost, prevents the infection risk caused by the non-standard action flow, is suitable for the detection environment of standard behaviors and has wide application value.

Drawings

FIG. 1 is a schematic flow chart of an action normative detection method based on time consistency comparison learning according to the present invention;

FIG. 2 is a network architecture diagram of an action normative detection method based on time consistency comparison learning according to an embodiment of the present invention;

FIG. 3 is an illustration of an example of behavior labeling of the personal protective equipment removal operation flow of the disinfection personnel in accordance with an embodiment of the present invention;

FIG. 4 is a schematic diagram of the time-consistent behavior alignment network of the present invention;

FIG. 5 is a schematic flow chart of the network training phase of the present invention;

FIG. 6 is a diagram of the multi-class behavior recognition confusion matrix effect of the present invention;

fig. 7 is a schematic structural diagram of an action normative detection apparatus based on time consistency comparison learning according to the present invention.

Detailed Description

In order to make the objects, technical solutions and technical effects of the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and examples.

As shown in fig. 1 and fig. 2, in the method for detecting action normativity based on time consistency comparison learning of the present invention, first, a data set is constructed from videos acquired by a camera, which are marked in a first quantity and are not marked in a second quantity; secondly, extracting features of the unmarked video after the strength data of the unmarked video is enhanced, inputting a time consistency behavior alignment network, outputting a feature graph and a set of similar action starting and ending frames among different samples, mapping the set to correspond to sub-feature graphs on the feature graph, constructing similar and dissimilar sub-feature graph samples, and sending the samples to a comparative learning network to extract space-time discriminant features; then sending the first quantity of marked videos into a pre-trained network for transfer learning, and outputting behavior categories; finally, judging the behavior normalization through the interframe behavior category change, and if the behavior normalization is not judged, sending out early warning; the method specifically comprises the following steps:

step one, constructing a data set of a first quantity of marked cameras and a second quantity of unmarked cameras for acquiring videos, wherein the first quantity is smaller than the second quantity.

The second amount of unlabeled data is used for contrast learning, and the first amount of labeled data is used for transfer learning. And marking a key action category label on the video action starting frame, and marking a behavior stage label on the two key action starting frames. The invention adopts the self-supervision mode to train the network for feature extraction, does not need to consume manpower to label a large amount of training data sets, namely the second amount of training data sets, but needs to label a small amount of training sets and test sets, namely the first amount of training sets and test sets, in order to verify the effectiveness of the algorithm and carry out transfer learning in a small amount of labeled samples, namely the first amount of labeled samples. As shown in fig. 3, for example, the personal protective equipment removing operation flow of the disinfection personnel is taken as an example, 6 behavior starting frame key behaviors in the video are labeled, and a behavior stage is formed between the two behaviors, which includes 8 behavior stages. In the embodiment, 1240 segments of a training set and 480 segments of a verification set are collected, wherein the training set is only labeled with 300 segments, and the videos in the verification set are all labeled.

And step two, performing strong enhancement and weak enhancement on the second quantity of video data which are not marked to obtain strong enhancement data and weak enhancement data respectively.

Setting a strong data enhancement mode and a weak data enhancement mode, wherein the weak data enhancement mode is color enhancement combined with scale transformation, and the strong data enhancement mode is video clip replacement combined with scale transformation; the video clip is replaced by: and (3) dividing the video into any section not larger than M, and randomly disorganizing the divided sections.

Setting non-cutting video data as

，

For the video of the i-th frame,

the input of the invention is a sequence of images sampled from a video X, for the total number of video frames

Wherein

Is an RGB image having a height H x W,

，

I.e. strongly enhanced data, weakly enhanced image sequence of

I.e. weak enhancement data.

And step three, inputting the strong and weak enhancement data into a coding feature extraction network to extract the features of the strong and weak enhancement data.

Will strongly enhance the dataAnd weak enhancement data are respectively input into a self-coding feature extraction network to extract the space-time features of the image sequence, and particularly, the invention uses 3D ResNet50 as a self-coder, namely the self-coding feature extraction network to extract the space-time features of the image sequence

And

mapping the high-dimensional features by a 3D ResNet50 self-encoder

Wherein

To represent

Or

，

，

As a function of the 3D ResNet50,

，

=1028 output eigenvector dimensions, i.e. features from strong enhancement data obtained after passing through the self-encoder

And features of weakly enhanced data

。

Step four: inputting the characteristics of strong and weak enhancement data into a time consistency behavior alignment network to obtain strong and weak enhancement image characteristics, searching the nearest neighbor frame of the similar action between image characteristic sequence pairs consisting of the strong and weak enhancement image characteristics and aligning to obtain a similar action starting frame and an ending frame set between the image characteristic sequence pairs.

Will strongly enhance the characteristics of the data

And features of weakly enhanced data

Inputting a time consistency behavior action alignment network, and calculating a self-supervision alignment loss function of the feature image sequence pairs to obtain a set of similar action initial frames and end frames between the feature image sequence pairs; and then, sending the data to a space-time discriminant feature learning network of the next step, and further extracting behavior action discriminant features by utilizing comparative learning.

Specifically, the step four is realized by the following substeps:

step 4.1, constructing a time consistency behavior alignment network head, and strengthening the characteristics of data

And features of weakly enhanced data

And

wherein

The image features are strongly enhanced for the ith frame,

weakly enhancing image features for the ith frame;

step 4.2, for the image characteristic sequence output in the step 4.1

And

the process of searching the nearest neighbor frames of similar actions in the image feature sequence is as shown in fig. 4, first, calculating the strong enhanced image feature of the first frame

In that

In the nearest neighbor frame

To obtain

Back to back calculation

In that

Nearest neighbor frame in (2)

Wherein

，

，

Representing a sequence of image features

：

，

，

，

Wherein

And

respectively representing a sequence of image features

The j and k frames in (1) weakly enhance image features,

to represent

And

a measure of the degree of similarity between the two,

the presence of a real label is indicated,

indicating the predicted value.

And 4.3, recording the characteristic positions of i = k in the image characteristic sequence pair in the step 4.2, and forming a similar action starting frame set by recording the positions of i = k in the image characteristic sequence pair aiming at the input image characteristic sequence pair

And action end frame set

and

respectively represent the alignment starting positions of the image characteristic sequences,

and

And fifthly, inputting the strong and weak enhanced image features and the image feature sequence pairs into a time-space discriminant feature extraction network, and combining a contrast learning network to finish self-supervision pre-training of a second amount of unlabeled video data.

Sending the strong and weak enhanced image features output in the step 4.1 into a space-time discriminant feature extraction network, constructing similar and heterogeneous sub-feature map samples according to the set of similar action starting frames and ending frames between the image feature sequence pairs output in the step 4.3, mapping the sub-feature maps corresponding to the strong and weak enhanced image features in the set, wherein the sub-feature map pairs in the set are of the same type, and the other image sequence feature map pairs not in the set are classified into different types, calculating a contrast loss function by using a contrast learning network, maximizing the similarity between the same types, minimizing the similarity between the different types, effectively extracting the action discriminant features, and completing the self-supervision pre-training of a second amount of unlabeled video data.

Specifically, the step five is realized by the following substeps:

step 5.1, constructing a time-space discriminant feature extraction network head, and performing image feature sequence output in step 4.1

And

And

wherein

For the ith frame strong enhancement image perception feature map,

step 5.2, setting the number of the input videos each time as B, and obtaining 2B image sequences after strong data enhancement and weak data enhancement processing; output of similar action start and end frame sets according to step 4.3

And

unify the length of the similar action sequence, get

And

Wherein the positive sample

Number N, negative example

The number is 2 rho B-N, and a contrast loss function is defined by using cosine similarity calculation

：

Wherein q represents

And

all the segments of which the similarity is to be calculated,

is the temperature over-parameter of the temperature,

denotes the cosine similarity between q and k, where k denotes

And

。

in summary, as shown in fig. 5, to reduce the sample labeling cost, the self-supervised pre-training is performed on the unlabeled data set, and then the migration learning is performed on a small amount of sample data sets, i.e. the first amount of sample data sets; in the pre-training stage, the network is reversely propagated based on two loss functions, network parameters are continuously updated through a batch gradient descent method until the change of the total loss function value is smaller than a set threshold value, the training is finished, and the training is stopped, wherein the calculation formula of the total loss function is as follows:

wherein

，

Are the weight values of two loss functions.

And 5.1, reserving a self-encoder feature extraction network, locking pre-training parameters of each layer of the network, adding a classification network after the self-encoding feature extraction network, wherein the classification network comprises a full connection layer and a softmax layer, outputting behavior categories and confidence degrees thereof, finely adjusting the network in a training set labeled with 300 sections of videos, reversely propagating the network by using a cross entropy loss function, continuously updating network parameters by a batch gradient descent method, and stopping training after iterating 100000 times.

In the testing stage, 16 frames of images are accumulated for the first time by inputting a video, the latest frame of image is updated to a 16 frame image queue every time, after data enhancement, a network and a classified network head are extracted from the characteristics of a coder, the behavior category and the confidence coefficient of the current frame of image are output, and the action normalization is judged.

In the data labeling stage, behavior category labels are labeled according to the sequence of behavior actions, taking the operation of disinfecting an operator for personal protection equipment removal as an example, the sequence of the labeled labels corresponding to the behavior action flow shown in fig. 3 is as follows: (0) removing gloves- > (1) disinfecting hands- > (2) removing goggles- > (1) disinfecting hands- > (4) removing protective clothing- > (5) removing outer shoe covers- > (6) throwing the protective clothing- > (0) removing gloves- > (1) disinfecting hands, recording inter-frame behavior label values, and if the change of the output action labels does not accord with the action flow specification, sending early warning information.

On a personal protective equipment off-operation data set of disinfection personnel, 300-segment marked watch frequency is used as a training set, on a test set, a recognition accuracy rate of a SlowFast algorithm is 85.36%, in order to verify the effect of time consistency behaviors aligned with a network head, a network without the time consistency behaviors aligned with the network head is trained and tested, the recognition accuracy rate is 90.15%, as shown in fig. 6, a time consistency behaviors comparison learning network multi-class behavior recognition confusion matrix is shown, the average accuracy rate of multi-class behavior recognition is 95.16% in the graph, and under the same marking cost, the accuracy rate is obviously improved.

Corresponding to the embodiment of the motion normalization detection method based on time consistency comparison learning, the invention also provides an embodiment of a motion normalization detection device based on time consistency comparison learning.

Referring to fig. 7, the apparatus for detecting motion normativity based on time consistency comparison learning according to the embodiment of the present invention includes one or more processors, and is configured to implement the method for detecting motion normativity based on time consistency comparison learning according to the foregoing embodiment.

The embodiment of the device for detecting the normative movement based on the comparison and learning of the time consistency can be applied to any equipment with data processing capability, such as computers and other equipment or devices. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and as a device in a logical sense, a processor of any device with data processing capability reads corresponding computer program instructions in the nonvolatile memory into the memory for operation. From a hardware aspect, as shown in fig. 7, a hardware structure diagram of an arbitrary device with data processing capability where the apparatus for detecting action normativity based on time consistency comparison and learning is located according to the present invention is shown, except for the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 7, in an embodiment, an arbitrary device with data processing capability where the apparatus is located may generally include other hardware according to an actual function of the arbitrary device with data processing capability, which is not described again.

The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.

An embodiment of the present invention further provides a computer-readable storage medium, on which a program is stored, where the program, when executed by a processor, implements the method for detecting the action normativity based on the time consistency comparison learning in the foregoing embodiments.

The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any data processing capability device described in any of the foregoing embodiments. The computer readable storage medium may also be an external storage device of the wind turbine, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), and the like, provided on the device. Further, the computer readable storage medium may include both an internal storage unit and an external storage device of any data processing capable device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing-capable device, and may also be used for temporarily storing data that has been output or is to be output.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way. Although the foregoing has described the practice of the present invention in detail, it will be apparent to those skilled in the art that modifications may be made to the practice of the invention as described in the foregoing examples, or that certain features may be substituted in the practice of the invention. All changes, equivalents and the like which come within the spirit and principles of the invention are desired to be protected.

Claims

1. A motion normative detection method based on time consistency comparison learning is characterized by comprising the following steps:

secondly, performing strong enhancement processing and weak enhancement processing on a second quantity of video data which are not marked to obtain strong enhancement data and weak enhancement data respectively;

inputting the characteristics of strong and weak enhancement data into a time consistency behavior action alignment network to obtain strong and weak enhancement image characteristics, searching and aligning nearest frames of similar actions between image characteristic sequence pairs consisting of the strong and weak enhancement image characteristics to obtain a similar action initial frame and an end frame set between the image characteristic sequence pairs;

2. The method for detecting the action normativity based on the time consistency comparison learning as claimed in claim 1, wherein the second step is specifically as follows:

setting non-cutting video data as

，

For the video of the i-th frame,

Wherein

Is an RGB image having a height H x W,

，

I.e. strongly enhanced data, weakly enhanced image sequence of

I.e. weak enhancement data;

the weak data enhancement mode is color enhancement combined with scale transformation, and the strong data enhancement mode is video segment replacement combined with scale transformation; the video clip is replaced by: and (3) dividing the video into any section not greater than M, and randomly disturbing the divided sections.

3. The method for detecting the action normativity based on the time consistency comparison learning as claimed in claim 2, wherein the third step is specifically as follows:

And

mapping the high-dimensional features by a 3D ResNet50 self-encoder

Wherein

To represent

Or

，

，

As a function of the 3D ResNet50,

，

for outputting the feature vector dimensions, i.e. features from strong enhancement data obtained after passing through the self-encoder

And features of weakly enhanced data

。

4. The method for detecting the normative behavior based on the time consistency comparison learning as claimed in claim 3, wherein the fourth step comprises the following substeps:

step 4.1, the characteristics of the data are enhanced strongly

And features of weakly enhanced data

And

wherein

step 4.2, for the image characteristic sequence output in the step 4.1

And

In that

In the nearest neighbor frame

To obtain

Back to back calculation

In that

In the nearest neighbor frame

Wherein

，

，

Representing a sequence of image features

：

，

,

，

Wherein

And

individual watchImage feature sequence

The j and k frames in (1) weakly enhance image features,

to represent

And

a measure of the degree of similarity between the two,

the presence of a real label is indicated,

representing a predicted value;

And action end frame set

and

and

5. The method for detecting the normative behavior based on the time consistency comparison learning as claimed in claim 4, wherein the step five comprises the following substeps:

step 5.1, the image feature sequence output in the step 4.1 is processed

And

And

wherein

For the ith frame strong enhancement image perception feature map,

And

unify the length of the similar action sequence, get

And

Wherein the positive sample

Number N, negative example

The number is 2 rho B-N, B represents the number of input videos, and a contrast loss function is defined by using cosine similarity calculation

：

Wherein q represents

And

all the segments of which the similarity is to be calculated,

is a temperature-over-parameter that is,

denotes the cosine similarity between q and k, where k denotes

And

。

6. the method for detecting the action normativity based on the time consistency comparison learning as claimed in claim 5, wherein the sixth step is specifically as follows:

7. An action normative detection device based on time consistency comparison learning, which is characterized by comprising one or more processors and is used for realizing the action normative detection method based on time consistency comparison learning of any one of claims 1 to 6.

8. A computer-readable storage medium, characterized in that a program is stored thereon, which when executed by a processor, implements the action normative detection method based on time-consistency comparison learning according to any one of claims 1 to 6.