CN114648723A - Action normative detection method and device based on time consistency comparison learning - Google Patents

Action normative detection method and device based on time consistency comparison learning Download PDF

Info

Publication number
CN114648723A
CN114648723A CN202210454687.0A CN202210454687A CN114648723A CN 114648723 A CN114648723 A CN 114648723A CN 202210454687 A CN202210454687 A CN 202210454687A CN 114648723 A CN114648723 A CN 114648723A
Authority
CN
China
Prior art keywords
image
data
network
action
enhancement
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210454687.0A
Other languages
Chinese (zh)
Other versions
CN114648723B (en
Inventor
李玲
徐晓刚
王军
祝敏航
曹卫强
何鹏飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Gongshang University
Zhejiang Lab
Original Assignee
Zhejiang Gongshang University
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Gongshang University, Zhejiang Lab filed Critical Zhejiang Gongshang University
Priority to CN202210454687.0A priority Critical patent/CN114648723B/en
Publication of CN114648723A publication Critical patent/CN114648723A/en
Application granted granted Critical
Publication of CN114648723B publication Critical patent/CN114648723B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the field of intelligent video monitoring and deep learning, in particular to an action normative detection method and device based on time consistency comparison learning, wherein the method comprises the following steps: firstly, constructing a data set by using videos which are marked by a first quantity and unmarked by a second quantity and are collected by a camera, wherein the first quantity is less than the second quantity; secondly, extracting features of the unmarked video after the strength data of the unmarked video is enhanced, inputting a time consistency behavior alignment network, outputting a feature graph and a set of similar action starting and ending frames among different samples, mapping the set to correspond to sub-feature graphs on the feature graph, constructing similar and dissimilar sub-feature graph samples, and sending the samples to a comparative learning network to extract space-time discriminant features; then, the first quantity of marked videos are sent to a pre-trained network for transfer learning, and behavior categories are output; and finally judging the behavior normalization through the interframe behavior category change, and if the behavior normalization is not judged, giving out early warning.

Description

Action normative detection method and device based on time consistency comparison learning
Technical Field
The invention relates to the field of intelligent video monitoring and deep learning, in particular to an action normative detection method and device based on time consistency comparison learning.
Background
Medical care personnel always walk on the front line of resisting epidemic situations, and the life safety of the masses is protected. The protective equipment is an important protective barrier for medical workers, the high infection rate caused by exposure can be reduced, the medical workers wear the protective clothing regularly and take off the protective clothing is an important measure for preventing infection, and if the medical workers do not wear the protective clothing according to the standard, the risk of high infection exists. Therefore, the putting-on and taking-off process is standardized, the problem that the whole team is isolated due to infection of individual personnel can be effectively avoided, and the number of non-combat personnel is reduced.
Not only medical personnel need obey the standard flow, all need obey disinfection and personal protective equipment standard flow in other high infection risk operation fields, and current restraint to action flow normality is mostly by personnel's training and individual attention, has high infection risk, and a real-time monitoring human action of intelligent monitoring means is needed to urgent need to judge whether action flow accords with the normality.
The existing behavior recognition methods are divided into two categories, and based on a supervised method and an unsupervised method, a large number of samples need to be marked to improve the recognition accuracy rate based on the supervised method, and the marking cost is very high; unsupervised based methods do not require labeling of samples, but recognition rates are lower than those of supervised methods. A method based on contrast learning belongs to unsupervised, a large amount of label-free data are used, a self-supervision pre-training mode is adopted, prior knowledge distribution of the data is learned, and then transfer learning is carried out in downstream tasks (image classification/target detection and the like) so as to improve the performance of the downstream tasks.
The contrast learning is characterized in that the data from the same sample after data enhancement is the same type, and the other data are different types as long as the data are not from the same sample, and the characteristics are learned by utilizing the contrast loss function to maximize the similarity among the same types and minimize the similarity among the different types. Under the same labeling data, the contrast learning plus the transfer learning surpass the supervision method in the image classification, and the labeling cost is obviously reduced. However, when the assumption is applied to the non-cut video classification task, a large problem exists in that when a video segment spans two or more actions, the algorithm is difficult to learn distinctive features, so that the recognition accuracy in the downstream task is affected. There is a need to improve the algorithm so that data with similar actions but from different samples are all classified as homogeneous and data with different actions are classified as heterogeneous.
Disclosure of Invention
In order to solve the technical problems in the prior art, the invention provides a low-cost and high-accuracy action normative detection method and device based on time consistency comparison learning, and the specific technical scheme is as follows:
an action normative detection method based on time consistency comparison learning comprises the following steps:
constructing a data set of a first quantity of marked cameras and a second quantity of unmarked cameras for acquiring videos, wherein the first quantity is smaller than the second quantity;
secondly, performing strong enhancement and weak enhancement on the second quantity of video data which are not marked to obtain strong enhancement data and weak enhancement data respectively;
inputting the strong and weak enhancement data into a coding feature extraction network to extract the features of the strong and weak enhancement data;
inputting the characteristics of strong and weak enhancement data into a time consistency behavior alignment network to obtain strong and weak enhancement image characteristics, searching and aligning the nearest neighbor frames of similar actions between image characteristic sequence pairs consisting of the strong and weak enhancement image characteristics to obtain a similar action initial frame and an ending frame set between the image characteristic sequence pairs;
inputting a set of similar action initial frames and end frames between pairs of strong and weak enhanced image features and image feature sequences into a time-space discriminant feature extraction network, and completing self-supervision pre-training of a second amount of unlabeled video data by combining a contrast learning network;
and step six, extracting network parameters by using self-coding features after self-supervision pre-training, adding a classification network after the self-coding features are extracted, then marking video data by adopting a first quantity to finish network migration learning, and finally judging the action normativity of the video image through the change of behavior categories among video frames.
Further, the second step is specifically as follows:
setting non-cutting video data as
Figure 100002_DEST_PATH_IMAGE001
Figure 100002_DEST_PATH_IMAGE002
For the video of the i-th frame,
Figure 100002_DEST_PATH_IMAGE003
for total video frame number, from videoXImage sequence obtained by intermediate sampling
Figure 100002_DEST_PATH_IMAGE004
Wherein
Figure 100002_DEST_PATH_IMAGE005
Is an RGB image having a height H x W,
Figure 100002_DEST_PATH_IMAGE006
Figure 100002_DEST_PATH_IMAGE007
for sampling frequency, after data enhancement, the image sequence after strong enhancement is
Figure 100002_DEST_PATH_IMAGE008
I.e. strongly enhanced data, weakly enhanced image sequence of
Figure 100002_DEST_PATH_IMAGE009
I.e., weak enhancement data;
the weak data enhancement mode is color enhancement combined with scale transformation, and the strong data enhancement mode is video segment replacement combined with scale transformation; the video clip is replaced by: and (3) dividing the video into any section not larger than M, and randomly disorganizing the divided sections.
Further, the third step is specifically:
using 3D ResNet50 as a self-encoder, i.e., self-encoding feature extraction network, will
Figure 100002_DEST_PATH_IMAGE010
And
Figure 100002_DEST_PATH_IMAGE011
mapping the high-dimensional features by a 3D ResNet50 self-encoder
Figure 100002_DEST_PATH_IMAGE012
Wherein
Figure 100002_DEST_PATH_IMAGE013
To represent
Figure 899802DEST_PATH_IMAGE010
Or
Figure 221062DEST_PATH_IMAGE011
Figure 100002_DEST_PATH_IMAGE014
Figure 100002_DEST_PATH_IMAGE015
As a function of the 3D ResNet50,
Figure 100002_DEST_PATH_IMAGE016
Figure 100002_DEST_PATH_IMAGE017
for outputting the dimensions of the feature vector, i.e. features for respectively obtaining strong enhancement data after passing through the self-encoder
Figure 100002_DEST_PATH_IMAGE018
And features of weakly enhanced data
Figure 100002_DEST_PATH_IMAGE019
Further, the fourth step includes the following substeps:
step 4.1, the characteristics of the data are enhanced strongly
Figure 506550DEST_PATH_IMAGE018
And features of weakly enhanced data
Figure 509141DEST_PATH_IMAGE019
Outputting image characteristic sequence through space-time global average pooling layer, full-connection layer and convolution layer
Figure 100002_DEST_PATH_IMAGE020
And
Figure 100002_DEST_PATH_IMAGE021
wherein
Figure 100002_DEST_PATH_IMAGE022
The strong enhanced image characteristic of the ith frame is taken as the weak enhanced image characteristic of the ith frame;
step 4.2, for the image characteristic sequence output in the step 4.1
Figure 100002_DEST_PATH_IMAGE023
And
Figure 100002_DEST_PATH_IMAGE024
the image characteristic sequence searches the nearest neighbor frames of similar actions among the image characteristic sequences, and firstly calculates the strong enhancement image characteristic of the ith frame
Figure 100002_DEST_PATH_IMAGE025
In that
Figure 100002_DEST_PATH_IMAGE026
In the nearest neighbor frame
Figure 100002_DEST_PATH_IMAGE027
To obtain
Figure 100002_DEST_PATH_IMAGE028
Back to back calculation
Figure 100002_DEST_PATH_IMAGE029
In that
Figure 750635DEST_PATH_IMAGE023
In the nearest neighbor frame
Figure 100002_DEST_PATH_IMAGE030
If i = k, the alignment of the similar actions between the image feature sequence pairs is successful; to calculate the loss function, the
Figure 649321DEST_PATH_IMAGE023
The mark of the ith frame is 1, the marks of the other frames are 0, and the predicted value
Figure 100002_DEST_PATH_IMAGE031
Wherein
Figure 100002_DEST_PATH_IMAGE032
Figure 100002_DEST_PATH_IMAGE033
Figure 687684DEST_PATH_IMAGE030
Representing a sequence of image features
Figure 228387DEST_PATH_IMAGE023
The k frame in (1) strongly enhances the image characteristics, and calculates the loss between the predicted value and the real label by using a cross entropy loss function
Figure 100002_DEST_PATH_IMAGE034
Figure 100002_DEST_PATH_IMAGE035
Figure 100002_DEST_PATH_IMAGE036
,
Figure 100002_DEST_PATH_IMAGE037
Wherein
Figure 100002_DEST_PATH_IMAGE038
And
Figure 100002_DEST_PATH_IMAGE039
respectively representing a sequence of image features
Figure 100002_DEST_PATH_IMAGE040
The j and k frames in (1) weakly enhance image features,
Figure 100002_DEST_PATH_IMAGE041
to represent
Figure 100002_DEST_PATH_IMAGE042
And
Figure 100002_DEST_PATH_IMAGE043
a measure of the degree of similarity between the two,
Figure 100002_DEST_PATH_IMAGE044
the presence of a real label is indicated,
Figure 100002_DEST_PATH_IMAGE045
representing a predicted value;
step 4.3, recording the characteristic position of i = k in the image characteristic sequence pair in step 4.2, and recording the characteristic position of the input image characteristic sequence pairRecording i = k position to form similar action initial frame set
Figure 100002_DEST_PATH_IMAGE046
And action end frame set
Figure 100002_DEST_PATH_IMAGE047
Wherein N is the logarithm of the image characteristic sequence successfully aligned,
Figure 100002_DEST_PATH_IMAGE048
and
Figure 100002_DEST_PATH_IMAGE049
respectively represent the alignment starting positions of the image feature sequences,
Figure 100002_DEST_PATH_IMAGE050
and
Figure 100002_DEST_PATH_IMAGE051
respectively represent the alignment end positions of the image feature sequences.
Further, the step five includes the following substeps:
step 5.1, the image feature sequence output in the step 4.1 is processed
Figure 100002_DEST_PATH_IMAGE052
And
Figure 100002_DEST_PATH_IMAGE053
outputting the sensing characteristic sequence through a plurality of layers of sensing machine layers
Figure 100002_DEST_PATH_IMAGE054
And
Figure 100002_DEST_PATH_IMAGE055
wherein
Figure 100002_DEST_PATH_IMAGE056
For the ith frame strong enhancement image perception feature map,
Figure 100002_DEST_PATH_IMAGE057
the perception characteristic map of the ith frame of the weakly enhanced image is obtained;
step 5.2, outputting similar action start frame and end frame set according to step 4.3
Figure 100002_DEST_PATH_IMAGE058
And
Figure 100002_DEST_PATH_IMAGE059
unify the length of the similar action sequence, get
Figure 100002_DEST_PATH_IMAGE060
Rho is the minimum sequence length, and the step 5.1 of sampling according to the minimum sequence length outputs a perception characteristic sequence
Figure 100002_DEST_PATH_IMAGE061
And
Figure 100002_DEST_PATH_IMAGE062
the initial and end positions to obtain the pair of subsequence feature maps, which are marked as the same kind of positive samples
Figure 100002_DEST_PATH_IMAGE063
The rest of the unaligned image sequences are negative samples of different classes
Figure 100002_DEST_PATH_IMAGE064
Wherein the positive sample
Figure 100002_DEST_PATH_IMAGE065
Number N, negative example
Figure 100002_DEST_PATH_IMAGE066
The number is 2 rho B-N, B represents the number of input videos, and a contrast loss function is defined by cosine similarity calculation
Figure 100002_DEST_PATH_IMAGE067
Figure 100002_DEST_PATH_IMAGE068
Wherein q represents
Figure 100002_DEST_PATH_IMAGE069
And
Figure 100002_DEST_PATH_IMAGE070
all the segments of which the similarity is to be calculated,
Figure 100002_DEST_PATH_IMAGE071
is the temperature over-parameter of the temperature,
Figure 100002_DEST_PATH_IMAGE072
denotes the cosine similarity between q and k, where k denotes
Figure 100002_DEST_PATH_IMAGE073
And
Figure 100002_DEST_PATH_IMAGE074
further, the sixth step is specifically:
and 5.1, reserving a self-encoder feature extraction network, locking pre-training parameters of each layer of the network, adding a classification network after the self-encoding feature extraction network, wherein the classification network comprises a full connection layer and a softmax layer, outputting behavior categories and confidence coefficients thereof, then adopting a first quantity of labeled video data to finish network migration learning, using a cross entropy loss function to perform back propagation on the network, continuously updating network parameters by a batch gradient descent method, performing iterative training, and finally outputting the behavior categories and the confidence coefficients of the current frame image by inputting test set data to judge the operation normalization.
The device comprises one or more processors and is used for realizing the method for detecting the action normativity based on the time consistency comparison learning.
A computer-readable storage medium, on which a program is stored, which, when executed by a processor, implements the method for detecting normative behavior based on temporal consistency comparison learning.
The invention has the advantages that:
1. the invention realizes a low-cost, high-performance and intelligent motion normative detection method.
2. Aiming at the problem that the existing contrast learning similar and dissimilar sample division rules cause lower behavior recognition accuracy when applied to non-shearing video behavior classification, the time consistency contrast learning network is provided, similar action characteristic graphs among different samples are aligned to be classified into the same type, other non-similar action characteristic graphs are classified into the different types, a network loss function is improved, and the behavior recognition accuracy is effectively improved.
3. The invention can effectively identify various actions and judge whether the action flow is standard, has the accuracy rate of 95.16 percent in collected operation data concentration of personal protection equipment of disinfection personnel, effectively reduces the manual supervision cost, prevents the infection risk caused by the non-standard action flow, is suitable for the detection environment of standard behaviors and has wide application value.
Drawings
FIG. 1 is a schematic flow chart of an action normative detection method based on time consistency comparison learning according to the present invention;
FIG. 2 is a network architecture diagram of an action normative detection method based on time consistency comparison learning according to an embodiment of the present invention;
FIG. 3 is an illustration of an example of behavior labeling of the personal protective equipment removal operation flow of the disinfection personnel in accordance with an embodiment of the present invention;
FIG. 4 is a schematic diagram of the time-consistent behavior alignment network of the present invention;
FIG. 5 is a schematic flow chart of the network training phase of the present invention;
FIG. 6 is a diagram of the multi-class behavior recognition confusion matrix effect of the present invention;
fig. 7 is a schematic structural diagram of an action normative detection apparatus based on time consistency comparison learning according to the present invention.
Detailed Description
In order to make the objects, technical solutions and technical effects of the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and examples.
As shown in fig. 1 and fig. 2, in the method for detecting action normativity based on time consistency comparison learning of the present invention, first, a data set is constructed from videos acquired by a camera, which are marked in a first quantity and are not marked in a second quantity; secondly, extracting features of the unmarked video after the strength data of the unmarked video is enhanced, inputting a time consistency behavior alignment network, outputting a feature graph and a set of similar action starting and ending frames among different samples, mapping the set to correspond to sub-feature graphs on the feature graph, constructing similar and dissimilar sub-feature graph samples, and sending the samples to a comparative learning network to extract space-time discriminant features; then sending the first quantity of marked videos into a pre-trained network for transfer learning, and outputting behavior categories; finally, judging the behavior normalization through the interframe behavior category change, and if the behavior normalization is not judged, sending out early warning; the method specifically comprises the following steps:
step one, constructing a data set of a first quantity of marked cameras and a second quantity of unmarked cameras for acquiring videos, wherein the first quantity is smaller than the second quantity.
The second amount of unlabeled data is used for contrast learning, and the first amount of labeled data is used for transfer learning. And marking a key action category label on the video action starting frame, and marking a behavior stage label on the two key action starting frames. The invention adopts the self-supervision mode to train the network for feature extraction, does not need to consume manpower to label a large amount of training data sets, namely the second amount of training data sets, but needs to label a small amount of training sets and test sets, namely the first amount of training sets and test sets, in order to verify the effectiveness of the algorithm and carry out transfer learning in a small amount of labeled samples, namely the first amount of labeled samples. As shown in fig. 3, for example, the personal protective equipment removing operation flow of the disinfection personnel is taken as an example, 6 behavior starting frame key behaviors in the video are labeled, and a behavior stage is formed between the two behaviors, which includes 8 behavior stages. In the embodiment, 1240 segments of a training set and 480 segments of a verification set are collected, wherein the training set is only labeled with 300 segments, and the videos in the verification set are all labeled.
And step two, performing strong enhancement and weak enhancement on the second quantity of video data which are not marked to obtain strong enhancement data and weak enhancement data respectively.
Setting a strong data enhancement mode and a weak data enhancement mode, wherein the weak data enhancement mode is color enhancement combined with scale transformation, and the strong data enhancement mode is video clip replacement combined with scale transformation; the video clip is replaced by: and (3) dividing the video into any section not larger than M, and randomly disorganizing the divided sections.
Setting non-cutting video data as
Figure DEST_PATH_IMAGE075
Figure DEST_PATH_IMAGE076
For the video of the i-th frame,
Figure DEST_PATH_IMAGE077
the input of the invention is a sequence of images sampled from a video X, for the total number of video frames
Figure DEST_PATH_IMAGE078
Wherein
Figure DEST_PATH_IMAGE079
Is an RGB image having a height H x W,
Figure DEST_PATH_IMAGE080
Figure DEST_PATH_IMAGE081
for sampling frequency, after data enhancement, the image sequence after strong enhancement is
Figure DEST_PATH_IMAGE082
I.e. strongly enhanced data, weakly enhanced image sequence of
Figure DEST_PATH_IMAGE083
I.e. weak enhancement data.
And step three, inputting the strong and weak enhancement data into a coding feature extraction network to extract the features of the strong and weak enhancement data.
Will strongly enhance the dataAnd weak enhancement data are respectively input into a self-coding feature extraction network to extract the space-time features of the image sequence, and particularly, the invention uses 3D ResNet50 as a self-coder, namely the self-coding feature extraction network to extract the space-time features of the image sequence
Figure 354081DEST_PATH_IMAGE082
And
Figure 954827DEST_PATH_IMAGE083
mapping the high-dimensional features by a 3D ResNet50 self-encoder
Figure DEST_PATH_IMAGE084
Wherein
Figure DEST_PATH_IMAGE085
To represent
Figure 480486DEST_PATH_IMAGE082
Or
Figure 824880DEST_PATH_IMAGE083
Figure DEST_PATH_IMAGE086
Figure DEST_PATH_IMAGE087
As a function of the 3D ResNet50,
Figure DEST_PATH_IMAGE088
Figure DEST_PATH_IMAGE089
=1028 output eigenvector dimensions, i.e. features from strong enhancement data obtained after passing through the self-encoder
Figure DEST_PATH_IMAGE090
And features of weakly enhanced data
Figure DEST_PATH_IMAGE091
Step four: inputting the characteristics of strong and weak enhancement data into a time consistency behavior alignment network to obtain strong and weak enhancement image characteristics, searching the nearest neighbor frame of the similar action between image characteristic sequence pairs consisting of the strong and weak enhancement image characteristics and aligning to obtain a similar action starting frame and an ending frame set between the image characteristic sequence pairs.
Will strongly enhance the characteristics of the data
Figure 244229DEST_PATH_IMAGE090
And features of weakly enhanced data
Figure 281455DEST_PATH_IMAGE091
Inputting a time consistency behavior action alignment network, and calculating a self-supervision alignment loss function of the feature image sequence pairs to obtain a set of similar action initial frames and end frames between the feature image sequence pairs; and then, sending the data to a space-time discriminant feature learning network of the next step, and further extracting behavior action discriminant features by utilizing comparative learning.
Specifically, the step four is realized by the following substeps:
step 4.1, constructing a time consistency behavior alignment network head, and strengthening the characteristics of data
Figure 559990DEST_PATH_IMAGE090
And features of weakly enhanced data
Figure 708074DEST_PATH_IMAGE091
Outputting image characteristic sequence through space-time global average pooling layer, full-connection layer and convolution layer
Figure DEST_PATH_IMAGE092
And
Figure DEST_PATH_IMAGE093
wherein
Figure DEST_PATH_IMAGE094
The image features are strongly enhanced for the ith frame,
Figure DEST_PATH_IMAGE095
weakly enhancing image features for the ith frame;
step 4.2, for the image characteristic sequence output in the step 4.1
Figure DEST_PATH_IMAGE096
And
Figure DEST_PATH_IMAGE097
the process of searching the nearest neighbor frames of similar actions in the image feature sequence is as shown in fig. 4, first, calculating the strong enhanced image feature of the first frame
Figure DEST_PATH_IMAGE098
In that
Figure 931331DEST_PATH_IMAGE097
In the nearest neighbor frame
Figure DEST_PATH_IMAGE099
To obtain
Figure 342721DEST_PATH_IMAGE099
Back to back calculation
Figure 46235DEST_PATH_IMAGE099
In that
Figure 794748DEST_PATH_IMAGE096
Nearest neighbor frame in (2)
Figure DEST_PATH_IMAGE100
If i = k, the alignment of the similar actions between the image feature sequence pairs is successful; to calculate the loss function, the
Figure 532897DEST_PATH_IMAGE096
The mark of the ith frame is 1, the marks of the other frames are 0, and the predicted value
Figure DEST_PATH_IMAGE101
Wherein
Figure DEST_PATH_IMAGE102
Figure DEST_PATH_IMAGE103
Figure DEST_PATH_IMAGE104
Representing a sequence of image features
Figure 911926DEST_PATH_IMAGE096
The k frame in (1) strongly enhances the image characteristics, and calculates the loss between the predicted value and the real label by using a cross entropy loss function
Figure DEST_PATH_IMAGE105
Figure DEST_PATH_IMAGE106
Figure DEST_PATH_IMAGE107
Figure DEST_PATH_IMAGE108
Wherein
Figure DEST_PATH_IMAGE109
And
Figure DEST_PATH_IMAGE110
respectively representing a sequence of image features
Figure 492949DEST_PATH_IMAGE097
The j and k frames in (1) weakly enhance image features,
Figure DEST_PATH_IMAGE111
to represent
Figure DEST_PATH_IMAGE112
And
Figure DEST_PATH_IMAGE113
a measure of the degree of similarity between the two,
Figure DEST_PATH_IMAGE114
the presence of a real label is indicated,
Figure DEST_PATH_IMAGE115
indicating the predicted value.
And 4.3, recording the characteristic positions of i = k in the image characteristic sequence pair in the step 4.2, and forming a similar action starting frame set by recording the positions of i = k in the image characteristic sequence pair aiming at the input image characteristic sequence pair
Figure DEST_PATH_IMAGE116
And action end frame set
Figure DEST_PATH_IMAGE117
Wherein N is the logarithm of the image characteristic sequence successfully aligned,
Figure DEST_PATH_IMAGE118
and
Figure DEST_PATH_IMAGE119
respectively represent the alignment starting positions of the image characteristic sequences,
Figure DEST_PATH_IMAGE120
and
Figure DEST_PATH_IMAGE121
respectively represent the alignment end positions of the image feature sequences.
And fifthly, inputting the strong and weak enhanced image features and the image feature sequence pairs into a time-space discriminant feature extraction network, and combining a contrast learning network to finish self-supervision pre-training of a second amount of unlabeled video data.
Sending the strong and weak enhanced image features output in the step 4.1 into a space-time discriminant feature extraction network, constructing similar and heterogeneous sub-feature map samples according to the set of similar action starting frames and ending frames between the image feature sequence pairs output in the step 4.3, mapping the sub-feature maps corresponding to the strong and weak enhanced image features in the set, wherein the sub-feature map pairs in the set are of the same type, and the other image sequence feature map pairs not in the set are classified into different types, calculating a contrast loss function by using a contrast learning network, maximizing the similarity between the same types, minimizing the similarity between the different types, effectively extracting the action discriminant features, and completing the self-supervision pre-training of a second amount of unlabeled video data.
Specifically, the step five is realized by the following substeps:
step 5.1, constructing a time-space discriminant feature extraction network head, and performing image feature sequence output in step 4.1
Figure 904207DEST_PATH_IMAGE096
And
Figure 231283DEST_PATH_IMAGE097
outputting the sensing characteristic sequence through a plurality of layers of sensing machine layers
Figure DEST_PATH_IMAGE122
And
Figure DEST_PATH_IMAGE123
wherein
Figure DEST_PATH_IMAGE124
For the ith frame strong enhancement image perception feature map,
Figure DEST_PATH_IMAGE125
the perception characteristic map of the ith frame of the weakly enhanced image is obtained;
step 5.2, setting the number of the input videos each time as B, and obtaining 2B image sequences after strong data enhancement and weak data enhancement processing; output of similar action start and end frame sets according to step 4.3
Figure DEST_PATH_IMAGE126
And
Figure DEST_PATH_IMAGE127
unify the length of the similar action sequence, get
Figure DEST_PATH_IMAGE128
Rho is the minimum sequence length, and the step 5.1 of sampling according to the minimum sequence length outputs a perception characteristic sequence
Figure DEST_PATH_IMAGE129
And
Figure DEST_PATH_IMAGE130
the initial and end positions to obtain the pair of subsequence feature maps, which are marked as the same kind of positive samples
Figure DEST_PATH_IMAGE131
The rest of the unaligned image sequences are negative samples of different classes
Figure DEST_PATH_IMAGE132
Wherein the positive sample
Figure DEST_PATH_IMAGE133
Number N, negative example
Figure DEST_PATH_IMAGE134
The number is 2 rho B-N, and a contrast loss function is defined by using cosine similarity calculation
Figure DEST_PATH_IMAGE135
Figure DEST_PATH_IMAGE136
Wherein q represents
Figure DEST_PATH_IMAGE137
And
Figure DEST_PATH_IMAGE138
all the segments of which the similarity is to be calculated,
Figure DEST_PATH_IMAGE139
is the temperature over-parameter of the temperature,
Figure DEST_PATH_IMAGE140
denotes the cosine similarity between q and k, where k denotes
Figure 233743DEST_PATH_IMAGE131
And
Figure 974166DEST_PATH_IMAGE132
in summary, as shown in fig. 5, to reduce the sample labeling cost, the self-supervised pre-training is performed on the unlabeled data set, and then the migration learning is performed on a small amount of sample data sets, i.e. the first amount of sample data sets; in the pre-training stage, the network is reversely propagated based on two loss functions, network parameters are continuously updated through a batch gradient descent method until the change of the total loss function value is smaller than a set threshold value, the training is finished, and the training is stopped, wherein the calculation formula of the total loss function is as follows:
Figure DEST_PATH_IMAGE141
wherein
Figure DEST_PATH_IMAGE142
Figure DEST_PATH_IMAGE143
Are the weight values of two loss functions.
And step six, extracting network parameters by using self-coding features after self-supervision pre-training, adding a classification network after the self-coding features are extracted, then marking video data by adopting a first quantity to finish network migration learning, and finally judging the action normativity of the video image through the change of behavior categories among video frames.
And 5.1, reserving a self-encoder feature extraction network, locking pre-training parameters of each layer of the network, adding a classification network after the self-encoding feature extraction network, wherein the classification network comprises a full connection layer and a softmax layer, outputting behavior categories and confidence degrees thereof, finely adjusting the network in a training set labeled with 300 sections of videos, reversely propagating the network by using a cross entropy loss function, continuously updating network parameters by a batch gradient descent method, and stopping training after iterating 100000 times.
In the testing stage, 16 frames of images are accumulated for the first time by inputting a video, the latest frame of image is updated to a 16 frame image queue every time, after data enhancement, a network and a classified network head are extracted from the characteristics of a coder, the behavior category and the confidence coefficient of the current frame of image are output, and the action normalization is judged.
In the data labeling stage, behavior category labels are labeled according to the sequence of behavior actions, taking the operation of disinfecting an operator for personal protection equipment removal as an example, the sequence of the labeled labels corresponding to the behavior action flow shown in fig. 3 is as follows: (0) removing gloves- > (1) disinfecting hands- > (2) removing goggles- > (1) disinfecting hands- > (4) removing protective clothing- > (5) removing outer shoe covers- > (6) throwing the protective clothing- > (0) removing gloves- > (1) disinfecting hands, recording inter-frame behavior label values, and if the change of the output action labels does not accord with the action flow specification, sending early warning information.
On a personal protective equipment off-operation data set of disinfection personnel, 300-segment marked watch frequency is used as a training set, on a test set, a recognition accuracy rate of a SlowFast algorithm is 85.36%, in order to verify the effect of time consistency behaviors aligned with a network head, a network without the time consistency behaviors aligned with the network head is trained and tested, the recognition accuracy rate is 90.15%, as shown in fig. 6, a time consistency behaviors comparison learning network multi-class behavior recognition confusion matrix is shown, the average accuracy rate of multi-class behavior recognition is 95.16% in the graph, and under the same marking cost, the accuracy rate is obviously improved.
Corresponding to the embodiment of the motion normalization detection method based on time consistency comparison learning, the invention also provides an embodiment of a motion normalization detection device based on time consistency comparison learning.
Referring to fig. 7, the apparatus for detecting motion normativity based on time consistency comparison learning according to the embodiment of the present invention includes one or more processors, and is configured to implement the method for detecting motion normativity based on time consistency comparison learning according to the foregoing embodiment.
The embodiment of the device for detecting the normative movement based on the comparison and learning of the time consistency can be applied to any equipment with data processing capability, such as computers and other equipment or devices. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and as a device in a logical sense, a processor of any device with data processing capability reads corresponding computer program instructions in the nonvolatile memory into the memory for operation. From a hardware aspect, as shown in fig. 7, a hardware structure diagram of an arbitrary device with data processing capability where the apparatus for detecting action normativity based on time consistency comparison and learning is located according to the present invention is shown, except for the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 7, in an embodiment, an arbitrary device with data processing capability where the apparatus is located may generally include other hardware according to an actual function of the arbitrary device with data processing capability, which is not described again.
The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.
An embodiment of the present invention further provides a computer-readable storage medium, on which a program is stored, where the program, when executed by a processor, implements the method for detecting the action normativity based on the time consistency comparison learning in the foregoing embodiments.
The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any data processing capability device described in any of the foregoing embodiments. The computer readable storage medium may also be an external storage device of the wind turbine, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), and the like, provided on the device. Further, the computer readable storage medium may include both an internal storage unit and an external storage device of any data processing capable device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing-capable device, and may also be used for temporarily storing data that has been output or is to be output.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way. Although the foregoing has described the practice of the present invention in detail, it will be apparent to those skilled in the art that modifications may be made to the practice of the invention as described in the foregoing examples, or that certain features may be substituted in the practice of the invention. All changes, equivalents and the like which come within the spirit and principles of the invention are desired to be protected.

Claims (8)

1. A motion normative detection method based on time consistency comparison learning is characterized by comprising the following steps:
constructing a data set of a first quantity of marked cameras and a second quantity of unmarked cameras for acquiring videos, wherein the first quantity is smaller than the second quantity;
secondly, performing strong enhancement processing and weak enhancement processing on a second quantity of video data which are not marked to obtain strong enhancement data and weak enhancement data respectively;
inputting the strong and weak enhancement data into a coding feature extraction network to extract the features of the strong and weak enhancement data;
inputting the characteristics of strong and weak enhancement data into a time consistency behavior action alignment network to obtain strong and weak enhancement image characteristics, searching and aligning nearest frames of similar actions between image characteristic sequence pairs consisting of the strong and weak enhancement image characteristics to obtain a similar action initial frame and an end frame set between the image characteristic sequence pairs;
inputting a set of similar action initial frames and end frames between pairs of strong and weak enhanced image features and image feature sequences into a time-space discriminant feature extraction network, and completing self-supervision pre-training of a second amount of unlabeled video data by combining a contrast learning network;
and step six, extracting network parameters by using self-coding features after self-supervision pre-training, adding a classification network after the self-coding features are extracted, then marking video data by adopting a first quantity to finish network migration learning, and finally judging the action normativity of the video image through the change of behavior categories among video frames.
2. The method for detecting the action normativity based on the time consistency comparison learning as claimed in claim 1, wherein the second step is specifically as follows:
setting non-cutting video data as
Figure DEST_PATH_IMAGE001
Figure DEST_PATH_IMAGE002
For the video of the i-th frame,
Figure DEST_PATH_IMAGE003
for total video frame number, from videoXImage sequence obtained by intermediate sampling
Figure DEST_PATH_IMAGE004
Wherein
Figure DEST_PATH_IMAGE005
Is an RGB image having a height H x W,
Figure DEST_PATH_IMAGE006
Figure DEST_PATH_IMAGE007
for sampling frequency, after data enhancement, the image sequence after strong enhancement is
Figure DEST_PATH_IMAGE008
I.e. strongly enhanced data, weakly enhanced image sequence of
Figure DEST_PATH_IMAGE009
I.e. weak enhancement data;
the weak data enhancement mode is color enhancement combined with scale transformation, and the strong data enhancement mode is video segment replacement combined with scale transformation; the video clip is replaced by: and (3) dividing the video into any section not greater than M, and randomly disturbing the divided sections.
3. The method for detecting the action normativity based on the time consistency comparison learning as claimed in claim 2, wherein the third step is specifically as follows:
using 3D ResNet50 as a self-encoder, i.e., self-encoding feature extraction network, will
Figure DEST_PATH_IMAGE010
And
Figure DEST_PATH_IMAGE011
mapping the high-dimensional features by a 3D ResNet50 self-encoder
Figure DEST_PATH_IMAGE012
Wherein
Figure DEST_PATH_IMAGE013
To represent
Figure 132427DEST_PATH_IMAGE010
Or
Figure 49568DEST_PATH_IMAGE011
Figure DEST_PATH_IMAGE014
Figure DEST_PATH_IMAGE015
As a function of the 3D ResNet50,
Figure DEST_PATH_IMAGE016
Figure DEST_PATH_IMAGE017
for outputting the feature vector dimensions, i.e. features from strong enhancement data obtained after passing through the self-encoder
Figure DEST_PATH_IMAGE018
And features of weakly enhanced data
Figure DEST_PATH_IMAGE019
4. The method for detecting the normative behavior based on the time consistency comparison learning as claimed in claim 3, wherein the fourth step comprises the following substeps:
step 4.1, the characteristics of the data are enhanced strongly
Figure 750676DEST_PATH_IMAGE018
And features of weakly enhanced data
Figure 349148DEST_PATH_IMAGE019
Outputting image characteristic sequence through space-time global average pooling layer, full-connection layer and convolution layer
Figure DEST_PATH_IMAGE020
And
Figure DEST_PATH_IMAGE021
wherein
Figure DEST_PATH_IMAGE022
The strong enhanced image characteristic of the ith frame is taken as the weak enhanced image characteristic of the ith frame;
step 4.2, for the image characteristic sequence output in the step 4.1
Figure DEST_PATH_IMAGE023
And
Figure DEST_PATH_IMAGE024
the image characteristic sequence searches the nearest neighbor frames of similar actions among the image characteristic sequences, and firstly calculates the strong enhancement image characteristic of the ith frame
Figure DEST_PATH_IMAGE025
In that
Figure DEST_PATH_IMAGE026
In the nearest neighbor frame
Figure DEST_PATH_IMAGE027
To obtain
Figure DEST_PATH_IMAGE028
Back to back calculation
Figure DEST_PATH_IMAGE029
In that
Figure 881629DEST_PATH_IMAGE023
In the nearest neighbor frame
Figure DEST_PATH_IMAGE030
If i = k, the alignment of the similar actions between the image feature sequence pairs is successful; to calculate the loss function, the
Figure 172933DEST_PATH_IMAGE023
The mark of the ith frame is 1, the marks of the other frames are 0, and the predicted value
Figure DEST_PATH_IMAGE031
Wherein
Figure DEST_PATH_IMAGE032
Figure DEST_PATH_IMAGE033
Figure 514002DEST_PATH_IMAGE030
Representing a sequence of image features
Figure 978482DEST_PATH_IMAGE023
The k frame in (1) strongly enhances the image characteristics, and calculates the loss between the predicted value and the real label by using a cross entropy loss function
Figure DEST_PATH_IMAGE034
Figure DEST_PATH_IMAGE035
Figure DEST_PATH_IMAGE036
,
Figure DEST_PATH_IMAGE037
Wherein
Figure DEST_PATH_IMAGE038
And
Figure DEST_PATH_IMAGE039
individual watchImage feature sequence
Figure DEST_PATH_IMAGE040
The j and k frames in (1) weakly enhance image features,
Figure DEST_PATH_IMAGE041
to represent
Figure DEST_PATH_IMAGE042
And
Figure DEST_PATH_IMAGE043
a measure of the degree of similarity between the two,
Figure DEST_PATH_IMAGE044
the presence of a real label is indicated,
Figure DEST_PATH_IMAGE045
representing a predicted value;
and 4.3, recording the characteristic positions of i = k in the image characteristic sequence pair in the step 4.2, and forming a similar action starting frame set by recording the positions of i = k in the image characteristic sequence pair aiming at the input image characteristic sequence pair
Figure DEST_PATH_IMAGE046
And action end frame set
Figure DEST_PATH_IMAGE047
Wherein N is the logarithm of the image characteristic sequence successfully aligned,
Figure DEST_PATH_IMAGE048
and
Figure DEST_PATH_IMAGE049
respectively represent the alignment starting positions of the image feature sequences,
Figure DEST_PATH_IMAGE050
and
Figure DEST_PATH_IMAGE051
respectively represent the alignment end positions of the image feature sequences.
5. The method for detecting the normative behavior based on the time consistency comparison learning as claimed in claim 4, wherein the step five comprises the following substeps:
step 5.1, the image feature sequence output in the step 4.1 is processed
Figure DEST_PATH_IMAGE052
And
Figure DEST_PATH_IMAGE053
outputting the sensing characteristic sequence through a plurality of layers of sensing machine layers
Figure DEST_PATH_IMAGE054
And
Figure DEST_PATH_IMAGE055
wherein
Figure DEST_PATH_IMAGE056
For the ith frame strong enhancement image perception feature map,
Figure DEST_PATH_IMAGE057
the perception characteristic map of the ith frame of the weakly enhanced image is obtained;
step 5.2, outputting similar action start frame and end frame set according to step 4.3
Figure DEST_PATH_IMAGE058
And
Figure DEST_PATH_IMAGE059
unify the length of the similar action sequence, get
Figure DEST_PATH_IMAGE060
Rho is the minimum sequence length, and the step 5.1 of sampling according to the minimum sequence length outputs a perception characteristic sequence
Figure DEST_PATH_IMAGE061
And
Figure DEST_PATH_IMAGE062
the initial and end positions to obtain the pair of subsequence feature maps, which are marked as the same kind of positive samples
Figure DEST_PATH_IMAGE063
The rest of the unaligned image sequences are negative samples of different classes
Figure DEST_PATH_IMAGE064
Wherein the positive sample
Figure DEST_PATH_IMAGE065
Number N, negative example
Figure DEST_PATH_IMAGE066
The number is 2 rho B-N, B represents the number of input videos, and a contrast loss function is defined by using cosine similarity calculation
Figure DEST_PATH_IMAGE067
Figure DEST_PATH_IMAGE068
Wherein q represents
Figure DEST_PATH_IMAGE069
And
Figure DEST_PATH_IMAGE070
all the segments of which the similarity is to be calculated,
Figure DEST_PATH_IMAGE071
is a temperature-over-parameter that is,
Figure DEST_PATH_IMAGE072
denotes the cosine similarity between q and k, where k denotes
Figure DEST_PATH_IMAGE073
And
Figure DEST_PATH_IMAGE074
6. the method for detecting the action normativity based on the time consistency comparison learning as claimed in claim 5, wherein the sixth step is specifically as follows:
and 5.1, reserving a self-encoder feature extraction network, locking pre-training parameters of each layer of the network, adding a classification network after the self-encoding feature extraction network, wherein the classification network comprises a full connection layer and a softmax layer, outputting behavior categories and confidence coefficients thereof, then adopting a first quantity of labeled video data to finish network migration learning, using a cross entropy loss function to perform back propagation on the network, continuously updating network parameters by a batch gradient descent method, performing iterative training, and finally outputting the behavior categories and the confidence coefficients of the current frame image by inputting test set data to judge the operation normalization.
7. An action normative detection device based on time consistency comparison learning, which is characterized by comprising one or more processors and is used for realizing the action normative detection method based on time consistency comparison learning of any one of claims 1 to 6.
8. A computer-readable storage medium, characterized in that a program is stored thereon, which when executed by a processor, implements the action normative detection method based on time-consistency comparison learning according to any one of claims 1 to 6.
CN202210454687.0A 2022-04-28 2022-04-28 Action normalization detection method and device based on time consistency comparison learning Active CN114648723B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210454687.0A CN114648723B (en) 2022-04-28 2022-04-28 Action normalization detection method and device based on time consistency comparison learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210454687.0A CN114648723B (en) 2022-04-28 2022-04-28 Action normalization detection method and device based on time consistency comparison learning

Publications (2)

Publication Number Publication Date
CN114648723A true CN114648723A (en) 2022-06-21
CN114648723B CN114648723B (en) 2024-08-02

Family

ID=81997635

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210454687.0A Active CN114648723B (en) 2022-04-28 2022-04-28 Action normalization detection method and device based on time consistency comparison learning

Country Status (1)

Country Link
CN (1) CN114648723B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116030538A (en) * 2023-03-30 2023-04-28 中国科学技术大学 Weak supervision action detection method, system, equipment and storage medium
CN116758562A (en) * 2023-08-22 2023-09-15 杭州实在智能科技有限公司 Universal text verification code identification method and system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180204111A1 (en) * 2013-02-28 2018-07-19 Z Advanced Computing, Inc. System and Method for Extremely Efficient Image and Pattern Recognition and Artificial Intelligence Platform
WO2019136591A1 (en) * 2018-01-09 2019-07-18 深圳大学 Salient object detection method and system for weak supervision-based spatio-temporal cascade neural network
CA3104951A1 (en) * 2019-03-21 2020-09-24 Illumina, Inc. Artificial intelligence-based sequencing
CN113011562A (en) * 2021-03-18 2021-06-22 华为技术有限公司 Model training method and device
CN113673489A (en) * 2021-10-21 2021-11-19 之江实验室 Video group behavior identification method based on cascade Transformer
WO2022001489A1 (en) * 2020-06-28 2022-01-06 北京交通大学 Unsupervised domain adaptation target re-identification method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180204111A1 (en) * 2013-02-28 2018-07-19 Z Advanced Computing, Inc. System and Method for Extremely Efficient Image and Pattern Recognition and Artificial Intelligence Platform
WO2019136591A1 (en) * 2018-01-09 2019-07-18 深圳大学 Salient object detection method and system for weak supervision-based spatio-temporal cascade neural network
CA3104951A1 (en) * 2019-03-21 2020-09-24 Illumina, Inc. Artificial intelligence-based sequencing
WO2022001489A1 (en) * 2020-06-28 2022-01-06 北京交通大学 Unsupervised domain adaptation target re-identification method
CN113011562A (en) * 2021-03-18 2021-06-22 华为技术有限公司 Model training method and device
CN113673489A (en) * 2021-10-21 2021-11-19 之江实验室 Video group behavior identification method based on cascade Transformer

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李玺;查宇飞;张天柱;崔振;左旺孟;侯志强;卢湖川;王菡子;: "深度学习的目标跟踪算法综述", 中国图象图形学报, no. 12, 16 December 2019 (2019-12-16) *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116030538A (en) * 2023-03-30 2023-04-28 中国科学技术大学 Weak supervision action detection method, system, equipment and storage medium
CN116030538B (en) * 2023-03-30 2023-06-16 中国科学技术大学 Weak supervision action detection method, system, equipment and storage medium
CN116758562A (en) * 2023-08-22 2023-09-15 杭州实在智能科技有限公司 Universal text verification code identification method and system
CN116758562B (en) * 2023-08-22 2023-12-08 杭州实在智能科技有限公司 Universal text verification code identification method and system

Also Published As

Publication number Publication date
CN114648723B (en) 2024-08-02

Similar Documents

Publication Publication Date Title
CN114648723A (en) Action normative detection method and device based on time consistency comparison learning
CN111814661B (en) Human body behavior recognition method based on residual error-circulating neural network
US9070041B2 (en) Image processing apparatus and image processing method with calculation of variance for composited partial features
WO2022218396A1 (en) Image processing method and apparatus, and computer readable storage medium
CN110689043A (en) Vehicle fine granularity identification method and device based on multiple attention mechanism
CN111612100B (en) Object re-identification method, device, storage medium and computer equipment
CN113283282A (en) Weak supervision time sequence action detection method based on time domain semantic features
CN114842343A (en) ViT-based aerial image identification method
CN116977725A (en) Abnormal behavior identification method and device based on improved convolutional neural network
CN116580243A (en) Cross-domain remote sensing scene classification method for mask image modeling guide domain adaptation
Ataş Performance Evaluation of Jaccard-Dice Coefficient on Building Segmentation from High Resolution Satellite Images
CN110795599A (en) Video emergency monitoring method and system based on multi-scale graph
CN116561649B (en) Diver motion state identification method and system based on multi-source sensor data
CN116434150B (en) Multi-target detection tracking method, system and storage medium for congestion scene
CN117152504A (en) Space correlation guided prototype distillation small sample classification method
CN117333766A (en) Intelligent interactive remote sensing information extraction method and system combined with large visual model
CN115359484A (en) Image processing method, device, equipment and storage medium
CN112989869B (en) Optimization method, device, equipment and storage medium of face quality detection model
CN114581769A (en) Method for identifying houses under construction based on unsupervised clustering
CN111860441A (en) Video target identification method based on unbiased depth migration learning
CN111382712A (en) Palm image recognition method, system and equipment
CN117035802B (en) Consensus method for predicting animal health based on capacity demonstration double test
CN118470608B (en) Weak supervision video anomaly detection method and system based on feature enhancement and fusion
CN117935030B (en) Multi-label confidence calibration method and system for double-view-angle correlation perception regularization
CN114239753B (en) Migratable image identification method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant