CN114494314A - Timing boundary detection method and timing sensor - Google Patents
Timing boundary detection method and timing sensor Download PDFInfo
- Publication number
- CN114494314A CN114494314A CN202111615241.3A CN202111615241A CN114494314A CN 114494314 A CN114494314 A CN 114494314A CN 202111615241 A CN202111615241 A CN 202111615241A CN 114494314 A CN114494314 A CN 114494314A
- Authority
- CN
- China
- Prior art keywords
- boundary
- video
- layer
- attention
- query
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 64
- 238000007906 compression Methods 0.000 claims abstract description 51
- 230000006835 compression Effects 0.000 claims abstract description 48
- 238000000034 method Methods 0.000 claims abstract description 31
- 230000006870 function Effects 0.000 claims abstract description 25
- 230000008569 process Effects 0.000 claims abstract description 16
- 230000009466 transformation Effects 0.000 claims abstract description 10
- 230000009471 action Effects 0.000 claims abstract description 7
- 238000012549 training Methods 0.000 claims description 42
- 239000011159 matrix material Substances 0.000 claims description 21
- 238000013507 mapping Methods 0.000 claims description 19
- 230000002123 temporal effect Effects 0.000 claims description 18
- 238000010606 normalization Methods 0.000 claims description 10
- 238000005070 sampling Methods 0.000 claims description 9
- 238000004422 calculation algorithm Methods 0.000 claims description 7
- 238000004458 analytical method Methods 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 6
- 238000005259 measurement Methods 0.000 claims description 6
- 238000004590 computer program Methods 0.000 claims description 5
- 239000000284 extract Substances 0.000 claims description 5
- 230000004913 activation Effects 0.000 claims description 4
- 238000011176 pooling Methods 0.000 claims description 4
- 230000001131 transforming effect Effects 0.000 claims description 4
- 238000012216 screening Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 2
- 238000003860 storage Methods 0.000 claims description 2
- 238000004364 calculation method Methods 0.000 abstract description 10
- 230000007246 mechanism Effects 0.000 abstract description 8
- 238000012805 post-processing Methods 0.000 abstract description 6
- 230000001427 coherent effect Effects 0.000 abstract 1
- 230000000875 corresponding effect Effects 0.000 description 19
- 238000012360 testing method Methods 0.000 description 9
- 238000010586 diagram Methods 0.000 description 6
- 244000103676 Elateriospermum tapos Species 0.000 description 4
- 235000019580 granularity Nutrition 0.000 description 4
- 238000005457 optimization Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 239000000654 additive Substances 0.000 description 2
- 230000000996 additive effect Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 230000007704 transition Effects 0.000 description 2
- 238000012800 visualization Methods 0.000 description 2
- 238000009825 accumulation Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000002688 persistence Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000004576 sand Substances 0.000 description 1
- 238000004904 shortening Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
- G06T7/13—Edge detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/75—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A time sequence boundary detection method and a time sequence perceptron are based on a transformation decoder structure and an attention mechanism, a universal non-category time sequence action detection model is established, a small amount of hidden feature query quantity is introduced into a coder of the detection model, input features are compressed to fixed dimensionality through a cross attention mechanism, and the features are decoded by the transformation decoder, so that the sparse detection of the universal non-category time sequence boundary is realized. The invention effectively solves the time sequence redundancy problem of the long video through the feature compression, and reduces the complexity of the quadratic model to the linear level; constructing two implicit characteristic query quantities, namely a boundary query quantity and a context query quantity, so as to correspondingly process a non-semantic-coherent boundary region and a coherent context region in the video and fully utilize the semantic structure of the video; an alignment loss function based on cross attention calculation is provided, so that the network can be converged quickly and stably; and the boundary position is sparsely coded by using a transform decoder, so that complex post-processing is avoided, and the generalization performance of the model is improved.
Description
Technical Field
The invention belongs to the technical field of computer software, relates to video time sequence boundary detection, and provides a time sequence boundary detection method and a time sequence sensor.
Background
As video data on the internet grows explosively, video content understanding becomes an important issue in the field of computer vision. In the past literature, the search for understanding long videos is still insufficient. Class-free temporal boundary detection is an effective technique to bridge the gap between understanding of long and short videos, and aims to segment long videos into a series of video segments. Categoriess timing boundaries are timing boundaries that occur naturally due to semantic discontinuities and are not constrained by any predefined semantic categories, including categorical timing boundaries of different granularity at the sub-action level, event level, and scene level, among others, in existing datasets. For the detection of the non-category time sequence boundary with different granularities, the time sequence structure and the context relationship under different scales are acquired by information of different levels.
Currently, due to differences in timing boundary semantics and granularity, the research of classless timing boundary detection is divided into a number of different tasks. The goal of the temporal action segmentation task is to detect sub-action level classless temporal boundaries that segment an action instance into multiple different sub-action segments. Generic time boundary detection aims at locating the classless timing boundary at the event level, i.e. the moment of action/topic/environmental change. Movie scene segmentation detects category-free timing boundaries at the scene level, i.e., transitions between movie scenes, that mark transitions of high-level episodes. The target videos of these tasks have the same semantic structure, and their boundary detection paradigm exhibits similar features. Previous work on these tasks has focused primarily on feature coding that is carefully designed for specific boundaries and resolves the boundary detection problem into a dense prediction problem. In the prediction process, the work adopts a complex post-processing technology to eliminate false positive examples of repeated prediction of the same true value, which exist in a great number of results. Such complex design and post-processing modules are highly correlated with specific boundary types, and therefore cannot be well generalized to different types of non-category boundary detection, and lack of generalization.
Disclosure of Invention
The invention aims to solve the problems that: the existing paradigm of boundary-free timing boundary detection has similar properties, but is scattered to be studied in different tasks due to the difference of boundary semantics and granularity. The existing related work mainly focuses on feature coding which is well designed for specific boundaries, and because a dense prediction paradigm adopts a complex post-processing technology to eliminate false positive examples, the method cannot be well popularized to different types of non-category boundary detection.
The technical scheme of the invention is as follows: the time sequence boundary detection method comprises the following steps of constructing a non-category time sequence boundary detection network to carry out time sequence boundary detection on a video, wherein the detection network comprises a backbone network and a detection model, and the implementation mode is as follows:
1) generating a detection sample by a backbone network: sampling video interval to obtain video image sequenceGenerating a video segment from each frame, wherein the ith video segment is formed by the ith frame image fiBy generating video features for an incoming video segment by a backbone networkAnd continuous scoringFiAnd SiRespectively scoring the RGB characteristics and continuity of the video segment i;
2) performing non-category time sequence action detection by a detection model based on the video characteristics F and the continuity score S, wherein the detection model comprises the following configurations:
2.1) encoder: an encoder: the encoder E comprises NeThe system comprises a plurality of layers of transformation decoding layers which are connected in series, wherein each layer comprises a multi-head self-attention layer, a multi-head cross-attention layer and a linear mapping layer, the self-attention layer, the cross-attention layer and the linear mapping layer are respectively provided with a residual error structure, and M hidden characteristic query quantities Q are introduced into a codereThe video features F are sorted in a descending order based on the continuity scoring S and then input into an encoder, and the encoder compresses the sorted video features into compression features H of M frames and initial compression features H0Is 0, change at the j-th layerDecoding layer, hidden feature query quantity QeAnd the compression characteristic H of the current layerjAdding, interacting with reordered video features in the cross attention layer through the self attention layer and the residual error structure thereof, and transforming through the residual error structure-linear mapping layer-residual error structure to obtain a compression feature Hj+1,j∈[1,(Ne-1)]By stacking NeAfter coding layer, realizing the compression and coding of input characteristics to obtain the compression characteristics
The generation of the hidden feature query quantity comprises the following steps: hidden feature query quantity QeIs divided into MbNumber of boundary queries and McRandomly initializing the context query quantity, and generating the context query quantity along with the learning of a training sample in the process of training a detection model; the boundary query quantity corresponds to the boundary region feature in the video feature, the context query quantity corresponds to the context region feature in the video feature, and M is before and after reordering in the video featurebOne feature is a boundary region feature, and the other feature is a context feature;
2.2) decoder: the decoder D comprises NdThe decoding layers are connected in series, each layer comprises a multi-head self-attention layer, a multi-head cross attention layer and a linear mapping layer, and the self-attention layer, the cross attention layer and the linear mapping layer are respectively provided with a residual error structure; for the compression characteristic H obtained by the encoder, the decoder carries out time sequence boundary point analysis by transforming the decoder structure, and the decoder defines NpIndividual nomination query quantity QdNumber of nomination queries QdLike the hidden feature query quantity, training and learning generation are carried out after random initialization, and a boundary nomination B is initialized00, in the j-th layer, the nomination query quantity QdAnd boundary nomination BjAdding, interacting with the compression characteristic H at the cross attention layer through the self-attention layer and the primary residual error structure, and obtaining an updated boundary nomination B through the transformation of the residual error structure-linear mapping layer-residual error structurej+1(ii) a By stacking of NdAfter decoding layer, realizing analysis of compression characteristics to obtain timing sequence boundary nomination representation
2.3) generation and scoring of time-series class-free boundaries: for the obtained timing boundary nomination representation B, two different fully-connected layer branches are fed: the system comprises a positioning branch and a classification branch, wherein the two branches are respectively used for outputting time sequence class boundary-free time and confidence score;
2.4) assigning training labels: a strict one-to-one training label matching strategy is adopted: obtaining a group of optimal one-to-one matching by using a Hungarian algorithm according to a defined matching cost C, wherein each prediction which is allocated to a non-category boundary truth value obtains a positive sample label, and the corresponding boundary truth value is a training target; the matching cost C consists of a position cost and a classification cost, the position cost is defined on the basis of the absolute value of the distance between the prediction time and the boundary true value time, and the classification cost is defined on the basis of the prediction confidence coefficient;
2.5) time-sequential class-boundary-free submission: after a series of time sequence class-free boundaries are generated, screening out the most credible time sequence class-free boundary time through a confidence score threshold gamma, and submitting the time sequence class-free boundary time for subsequent performance measurement;
3) a training stage: training the configured model by adopting a training sample, using cross entropy, L1 distance and log function as loss functions, using an AdamW optimizer, updating network parameters by a back propagation algorithm, and continuously repeating the step 1) and the step 2) until the iteration times are reached;
4) and (3) detection: inputting the video characteristic sequence and continuity score of the data to be tested into the trained detection model, generating time sequence class-free boundary time and score, and obtaining the time sequence class-free boundary time sequence for performance measurement by the method of 2.3).
The present invention also provides a timing perceptron having a computer storage medium having a computer program embodied therein for implementing the above-described classless timing boundary detection network, the computer program when executed implementing the above-described timing boundary detection method.
The invention provides a universal and unified framework for processing different types of non-category time sequence boundary detection, and can compress redundant video input into stable and reliable characteristic representation based on an attention mechanism and a video semantic structure, thereby reducing the complexity of a model; from the perspective of global context, the position and confidence score of any non-category time sequence boundary are given sparsely, efficiently and accurately.
Compared with the prior art, the invention has the following advantages
The invention provides a general class-free time sequence boundary detection paradigm, which provides a high-efficiency unified method for any class-free time sequence boundary detection based on a transformation structure and an attention mechanism.
The invention introduces a small group of learnable hidden feature query quantity as an anchor frame to compress redundant video input. The hidden feature query quantity compresses input to a hidden feature space with a fixed size while ensuring important boundary information through a cross attention mechanism, and reduces the space-time complexity of a model with an input length square level to linear complexity.
The invention constructs an effective hidden feature query quantity structure for time sequence boundary detection, comprising a boundary query quantity and a context query quantity. In order to better utilize the semantic structure of the boundary and context components in the video, the video features are also divided into boundary region features and context region features. Carrying out targeted extraction on the boundary region characteristics by using the boundary query quantity; and clustering the context areas by the context query quantity, and compressing redundant context content into a plurality of context clustering centers.
The invention utilizes the alignment loss function to align the boundary query quantity and the boundary region characteristics one to one, thereby effectively reducing the training difficulty, shortening the convergence time, forming stable compression characteristics and improving the positioning prediction performance.
The invention utilizes the transformation decoder and the training label matching strategy corresponding to one, effectively utilizes the global context information to carry out boundary prediction, and generates and positions the non-category time sequence boundary position with more accurate sparse and high-efficiency without complex post-processing technology.
The invention has the characteristics of universality, high efficiency, accuracy and the like on the time sequence boundary detection task. Compared with the existing method, the invention achieves better prediction precision and faster reasoning speed on the subclass-free time sequence boundary data sets of the sub-action level, the event level and the scene level, and embodies the generalization performance of the model.
Drawings
FIG. 1 is a system framework diagram used by the present invention.
Fig. 2 is a schematic diagram of the frame extraction process of the video according to the present invention.
Fig. 3 is a schematic diagram of an encoder of the present invention.
Fig. 4 is a schematic diagram of a decoder of the present invention.
FIG. 5 is a schematic diagram of boundary nomination generation and submission of the present invention.
Fig. 6 shows the results of model efficiency comparisons of the present invention on a sample MovieNet dataset.
Fig. 7 shows the results of a comparison of the invention on a sample MovieNet dataset and previous work.
FIG. 8 shows the results of the present invention comparing the previous work with the Kinetics-GEBD and TAPOS data set samples.
Fig. 9 shows the visualization of the results of the present invention on a sample of the MovieNet dataset.
FIG. 10 is a schematic overview of the process of the present invention.
Detailed Description
The invention constructs a non-category time sequence boundary detection network to detect the time sequence boundary of the video, wherein the detection network comprises a backbone network and a detection model. The detection model is a time sequence perception model (Temporal driver) and is a universal non-category time sequence boundary detection framework, a small group of hidden feature query quantity is introduced as an anchor frame, and redundancy is compressed and input to a fixed dimension through a cross attention mechanism, so that a non-category time sequence boundary detection task is realized. The method of the present invention includes a sample generation phase, a network configuration phase, a training phase and a testing phase, as shown in fig. 10, which are specifically described as follows.
1) And a sample generation stage: using a backbone network based on ResNet50 and timing convolutional layers,and generating samples of the training and testing videos, wherein the samples comprise video characteristics F and continuity marks S. For each video, sampling all picture frames corresponding to the video at intervals of tau frames to obtain a video image sequenceIn a video image sequence LfUpper acquisition NfA video segment with the length of 2k frames, wherein the ith video segment is an image sequence formed by each continuous k frames before and after the ith frame image NfIs the length of the sequence of video images and is also the number of video segments. Each video segment picture sequence Ls,iSent to a backbone network, and output D-dimensional RGB characteristic F corresponding to the ith frame through a pre-training and parameter fine-tuning convolution layer, a pooling layer and a full-connection layeriAnd a continuity score Si. The characteristics and the scores of different video segments are spliced according to the time sequence respectively to obtain the characteristics of the whole videoAnd continuous scoringWherein the sampling interval τ represents a fine degree of time division in the global direction; the size of the video segment length 2k represents the local field of view range of the feature. In order to reduce the time complexity and simultaneously keep more local information, in the embodiment of the invention, preferably τ is 3, k is 5, and D is 2048. The specific implementation is as follows:
extracting picture frames from the original video by using denseflow, sampling all the picture frames of the video at intervals of 3 frames to obtain the video with the length of NfOf the video image sequence Lf. And calling a transform packet of the torchvision library, and scaling each picture frame to obtain a corresponding image with the dimension of 224 x 224. On videoImage sequence LfUpper acquisition NfA video segment of length 2 k-10 frames, wherein the ith video segment Ls,iConsisting of successive 5 frames preceding and following the image of the ith frame of video, i.e.A sequence of video segments comprising all video segments is recordedVideo segment Ls,i∈R10×224×224Sending the data into a backbone network, and obtaining an intermediate characteristic F through a ResNet50 network with pre-training and fine-tuning parametersmid,i∈R10×2048,Fmid,iThen, the RGB characteristic F of the video band is obtained through a time sequence convolution layer and a pooling layeri∈R2048Finally F is addediInputting the full connection layer to obtain a continuous score SiE.g. R. Features F of different video segmentsiAnd score SiRespectively spliced according to time sequence to obtain the characteristics of the whole video And continuous scoringVideo feature and continuity scoring is processed by forming a series of segment input models through a sliding window method, wherein the window length NwsAnd the window is 100, and no overlapping frame is taken when the window is slid. The method comprises the following specific steps:
1. the overall video segment sequence obtained after frame extraction and sampling is as follows:
Ls,i={fi-5,fi-4,fi-3,fi-2,fi-1,fi+1,fi+2,fi+3,fi+4,fi+5,}
wherein VfRepresenting a sequence of video segments consisting of NfA segment L of an image sequences,iEach image sequence segment comprises 10 images (2 k).
2. The process by which the backbone network processes the sequence of input video images is as follows:
Fmid,i=Resnet50(Ls,i)
Fi=MaxPooling(Tconv(Fmid,i))
Si=FC(Fi)
wherein Fmid,iRepresenting intermediate features of incoming video segments processed through the Resnet50 network, FiIs Fmid,iVideo segment characteristics, S, obtained after time series convolutioniAnd F, a continuity scoring result is obtained, wherein F is a characteristic sequence formed by splicing different video segments according to a time sequence, and S is a continuity scoring sequence formed by splicing different video segments according to the time sequence.
2) In the network configuration stage, a universal non-category time sequence action detection model Temporal driver is established based on a transformation decoder structure and an attention mechanism, and the model comprises the following configurations:
2.1) encoder: based on the continuity score S generated in 1), carrying out projection and descending sorting on the video features F generated in 1) to obtain FrerankIntroducing M learnable hidden feature query quantities QeAnd a compression characteristic H initialized to 00The reordering features are compressed into compression features H of M frames using a transform decoder structure. The encoder E comprises NeThe layers are connected in series to form a conversion decoding layer, and each layer is marked as an EncoderjJ represents a coding layer number, j ∈ [0 ], (N)e-1)]。EncoderjIs HjThe output is Hj+1. Each EncoderjComprising a multi-headed self-attention layer MSAjA multi-head cross attention layer MCAjA linear mapping layer FFNjAnd three residual structures formed by the addition and layer normalization operations.
Multi-head self-attention layer MSAjThe inputs of (1) are key parameter K, query parameter Q and value parameter V, and the key and query of the self-attention mechanism are the same input. The key parameters and the query parameters are multiplied by themselves and then normalized by a Softmax function to obtain a weight matrix AsAnd multiplying the value parameter by the weight matrix to obtain an output. The multi-head structure divides the input into multiple parts along the parameters, inputs the multiple parts into the self-attention layer respectively, and finally splices the results along the channel. Multi-head cross attention layer MCAjThe input comprises key, inquiry and value parameters, the key parameters and the inquiry parameters are multiplied and subjected to Softmax normalization to obtain a cross attention weight AcValue parameter and cross attention weight AcMultiplying to obtain an output. The multi-head structure divides the input into a plurality of parts along the parameters respectively, inputs the cross attention layers respectively, and finally splices the results along the channels.
Preferred N in the inventioneWhen M is 60, the number of branches of the multi-head self-attention layer is 8, and the number of branches of the multi-head cross-attention layer is 8. MSA in encoderjKey and query parameter of (2) is a hidden feature query quantity QeAnd compression characteristic HjAdding the result, the value parameter being the compression characteristic HjWeight matrix As∈RM×MRecording MSAjThe output after residual structure is H'j。MCAjKey parameter ofrerankSum of video features and corresponding position codes Frerank+PrerankValue parameter is FrerankQuery parameter is MSAjOutput of H'jAnd hidden feature query quantity QeSum, weight matrix Ac∈RN×MRecording MCAjThe output after the residual structure is H ″)j. Encoder E by Stack N e6 coding layers, compression and coding of input features, the last layerEncoding layer Encoder5The output of (1) is the compression characteristic H ═ H output by the encoder E6. The specific calculation is as follows:
1. reordering transform of video feature and position coding:
Frerank=sort(fcproj(F),S)
Prerank=sort(P,S)
p is a position code formed after a sin function is taken according to the time sequence relative position corresponding to F, and the position code corresponds to F one by one; fcprojChanging from input dimension D2048 to model dimension D for video featuresmodelThe subscript proj is an abbreviation for project for the linear projection layer used 512.
2. Encoder for coding layer of certain layer in coderjThe encoding process of (2):
H′j=LayerNorm(Hj+MSAj(Hj,Qe))
H″j=LayerNorm(MCAj(H′j,Qe,Frerank,Prerank)+H′j)
Hj+1=LayerNorm(FFNj(H″j)+H″j)
3. encoding process of multi-headed self-attention layer MSA:
SAh(xh,qh)=xhWv,hAs,h
As,h=Softmax((xh+qh)Wk,h·(xh+qh)Wq,h)
wherein N ishRepresenting the number of head branches in a multi-headed self-attentive layer, Wk,h、Wq,h、Wv,hAnd WoThe h subscripts refer to different branches, and the projection matrix parameters of each layer and each branch are not shared. SAhIndicating single-headedAttention layer, xh,qhFor slices of feature x and its position code q along the channel dimension, respectively, the slices have a total of NhPart (A)s,hThe self-attention matrix of the h-th branch is represented.
4. Coding process of multi-head cross attention layer MCA:
CAh(xh,qx,h,yh,qy,h)=yhWv,hAc,h
Ac,h=Softmax((yh+qy,h)Wk,h·(xh+qx,h)Wq,h)
wherein, Wk,h、Wq,h、Wv,hAnd WoThe projection matrix parameters of each layer and each branch are not shared, and the projection matrix of the cross attention layer is not shared with the self attention layer. CAhRepresenting a single cross attention layer, xh,yh,qx,h,qy,hFor x, y and corresponding position codes qx,qySlicing along the channel dimension, Ac,hThe cross attention matrix of the h-th branch is represented.
5. The encoding process of the encoder:
Hj+1=Encoderj(Hj;Qe,Frerank,Prerank)
H=H6=Encoder5(Encoder4(Encoder3(Encoder2(Encoder1(Encoder0(H0))))))
2.2) the invention introduces a hidden characteristic query quantity Q to the encodereThe generation is specifically as follows: in order to better utilize the semantic structure of the video, the query quantity Q with hidden characteristicseIs divided into MbNumber of boundary queries and McAmount of individual context queries, Mb=48,Mc=12(ii) a The reordering features are divided into boundary region features and context region features, top MbEach feature is a boundary region feature. The definition of the boundary region feature and the context region feature is based on a continuity score S, and after the video features are reordered according to the descending order of S, the top M with higher score is obtainedbEach feature forms a boundary region feature, leaving a context region feature. Hidden boundary query quantity Q of the inventioneFor learnable parameters, the hidden boundary query quantity Q is based on the above descriptioneAnd (4) random initialization, which is generated through learning in the model training process. Front MbDefining the query quantity as boundary query quantity, and combining the boundary region characteristics in a compression process in a one-to-one mode; the rest of McThe individual query quantity is context query quantity, and the context region characteristics are clustered in the compression process to obtain MbEach cluster center characterizes context information.
2.3) further, in order to increase the hidden boundary query quantity QeAnd aligning the hidden feature query quantity with the video features in training according to the encoding effect of the video features: in order to accelerate the model convergence and obtain more stable compression characteristics, the method introduces additional supervision constraint and calculates the alignment loss function based on the last layer of cross attention moment array. The computation of the loss function is the first M of the pair matrixb×MbThe sum of the diagonal weights of the regions takes a negative logarithm, the diagonal weights are maximized by minimizing a loss function, and the boundary query quantity and the features are aligned one by one. The specific calculation is as follows:
alignment loss function LalignThe calculation process of (2):
wherein alpha isalign1 is a parameter of the alignment loss function, which is only calculated for the last cross attention layer of the encoder.
2.4) decoder: for the compression characteristic H obtained in 2.1), N is usedpLearnable nomination query quantity QdAnd boundary nomination B initialized to 00Performing time sequence boundary point analysis by changing decoder structure, and nominating query quantity QdThe method is the same as the hidden feature query quantity, and is generated by learning in training after random initialization; the decoder D comprises NdLayer transform decoding layer, layer j decoding layer is denoted as DecoderjInputting as boundary nomination representation BjThe output is Bj+1. Each decoding layer comprises a multi-head self-attention layer, a multi-head cross-attention layer, a linear mapping layer and a residual error structure formed by adding and layer normalizing operations in three times. At the j-th layer, the nomination query quantity QdAnd boundary nomination BjAdding, interacting with the compression characteristic H at the cross attention layer through the self-attention layer and the primary residual error structure, and obtaining an updated boundary nomination B through the transformation of the residual error structure-linear mapping layer-residual error structurej+1. By stacking of NdAnd after decoding the layers, realizing analysis of compression characteristics to obtain a timing sequence boundary nomination representation B.
Preferred N in the inventiond=6,NpThe number of branches of the multi-head self-attention layer is 8, and the number of branches of the multi-head cross-attention layer is 8. MSAjThe key and the query parameter are nomination query quantity QdAnd boundary nomination representation BjAdding result, value parameter for boundary nomination representing BjWeight matrixMemory access Module (MSA)jThe output after residual structure is B'j。MCAjThe key parameters of (1) are compression characteristic H and hidden characteristic query quantity Q as compression position codeeSum of H + QeValue parameter is H, query parameter is MSAjOutput of B'jAnd nomination query quantity QdSum, weight matrixMemo MCAjThe output after the residual structure is B ″)j. Decoder D and encoder E are symmetric by stacking NdDecoding of boundary nomination is realized, and the last layer of decoding layer Decoder5Output B of6B, the final output of the decoder is put into the fully-connected branch for boundary position and confidence prediction. The specific calculation is as follows:
1. decoder for decoding layer of DecoderjThe decoding process of (2):
B′j=LayerNorm(Bj+MSAj(Bj,Qd))
B″j=LayerNorm(MCAj(B′j,Qd,H,Qe)+B′j)
Bj+1=LayerNorm(FFNj(B″j)+B″j)
2. decoding process of the decoder:
Bj+1=Decoderj(Bj;Qd,H,Qe)
B=B6=Decoder5(Decoder4(Decoder3(Decoder2(Decoder1(Decoder0(B0))))))
2.5) generation and scoring of time-series class-free boundaries: for the timing boundary nomination representation B obtained in 2.4), two different fully-connected layer branches are fed: positioning branch HeadlocAnd a classification branch HeadclsThe two branches are used for outputting time and confidence scores of time sequence class-free boundaries respectively. The predicted boundary time is a decimal between 0 and 1, which represents the relative position in the current segment; the confidence score is two scores, including a positive class confidence and a negative class confidence, with a higher score representing a greater probability of being a corresponding class. The classification branch is composed of a layer of full connection layer, and the input and output feature dimensions are 512 and 2 respectively; the positioning branch consists of a multilayer perceptron consisting of three fully-connected layers and a Sigmoid activation function, and the input and output dimensions are 512, 512 and 1 respectively. The specific calculation is as follows:
1. prediction of location time t for category-free motion:
t=sigmoid(fc2(fc1(fc0(B))))
whereinRecording three full connection layers of the positioning branch as fc2,fc1,fc0The input and output dimensions are 512, 512 and 512, 1 respectively.
2. Two classification confidence score pposGeneration of (1):
ppos,pneg=fc(B)
wherein, the full connection layer of the confidence branch is fc, and the positive case fraction p is takenposAs a confidence score, the input dimension is 512 and the output dimension is 2.
2.6) assigning training labels: a strict one-to-one training label matching strategy is adopted: obtaining a group of optimal one-to-one matching by using a Hungarian algorithm according to a defined matching cost C, wherein each prediction which is allocated to a non-category boundary true value obtains a positive sample label, and the corresponding boundary true value is a training target; the matching cost C is composed of a position cost and a classification cost, the position cost is defined on the basis of the absolute value of the distance between the predicted time and the boundary true value, and the classification cost is defined on the basis of the prediction confidence coefficient. The specific calculation is as follows:
1. optimization indexes of the Hungarian algorithm are as follows:
recording an optimization index as C, wherein the optimization index has two components of position cost and classification cost, each component has a corresponding weight, and the weights are respectively recorded as alphalocAnd alphacls. Preferred alpha in the inventionloc=5,αcls=1。
2. Definition of the optimization component:
Lcls,n=-ppos,n
among the optimized components, the nth predicted location component Lloc,nFrom predicted boundary times tnPosition corresponding to boundary truth valueMeasured by the absolute value of the distance of (1); nth predicted classification component Lcls,nConfidence p with the predicted time as boundarypos,nThe metric, since C is the minimization target, is negative to confidence; σ () is a mapping from predictions to boundary truth values, and σ (n) is the action truth value for the nth prediction.
2.7) time-sequential class-boundary-free submission: and after a series of time sequence class-free boundaries are generated, screening out the most credible time sequence class-free boundary moment through a confidence score threshold gamma, and submitting the time sequence class-free boundary moment for subsequent performance measurement. When p ispos,nIf the predicted position is greater than or equal to gamma, then the predicted position is submitted as a result, when ppos,n<γ, the predicted location is discarded. In the present invention, γ is preferably 0.9.
3) A training stage: training the configured model by adopting a training sample, using cross entropy and L1 distance as a loss function based on a final result, using a log function as a loss function based on an intermediate result, using an AdamW optimizer, updating network parameters by a back propagation algorithm, and continuously repeating the step 1) and the step 2) until the iteration times are reached;
4) and (3) a testing stage: inputting the video characteristic sequence and the continuity score of the data to be tested into a trained Temporal driver model, generating time sequence class-free boundary time and score, and obtaining the time sequence class-free boundary time sequence for performance measurement by a method of 2.5).
The invention provides a time sequence sensor model, and relates to a universal non-category time sequence boundary detection framework. The following is a further description by way of specific examples. High inference speed and high accuracy are achieved through training and testing on a TAPOS dataset, a Kinetics-GEBD dataset and a MovieNet/MovieSecenes dataset, and the method is preferably implemented by using a Python 3.8.8 programming language and a Pythroch 1.7.0 deep learning framework.
Fig. 1 shows a system framework diagram used in the present invention, and the specific implementation steps are as follows:
1) in the preparation phase for generating the sample, as shown in fig. 2, both the training data and the test data are processed in the same manner. The video is extracted into picture frames by using denseflow, the picture frame sequence is sampled at intervals of tau-3, the picture frames are scaled and transformed into a scale of 224 x 224 by using a transform packet of torchvision, and finally the picture frames are transformed into a tensor form and normalized. And taking each frame in the picture frame sequence as a center, inputting each front and rear k-5 frames into a backbone network to obtain a characteristic and continuity score corresponding to the frame, and splicing along a time sequence dimension to obtain a video characteristic and video continuity score. The feature and continuity score of the video level is segmented into a series of video window segments of equal length into the model.
2) In the configuration stage of the model, firstly, based on continuity scoring, as shown in fig. 3, for the extracted video features, the program projects the video features to a lower feature dimension first, and performs descending order sorting with the corresponding feature codes according to the continuity scoring, and inputs the video features to the encoder. The encoder includes alternating multi-headed self-attention modules, multi-headed cross-attention modules, linear mapping layers, and additive and layer normalized residual structures. A set of learnable hidden feature query quantities, including a boundary query quantity and a context query quantity, is input to an encoder. The compression features are initialized to 0, participate in the computation of the coding layers and accumulate the features extracted from the input features by each coding layer. The hidden feature query quantity can be regarded as a position code of the compression feature. And the compression characteristics and the position codes are added to obtain position information to participate in attention calculation.
Firstly, adding the compression characteristic and the hidden characteristic query quantity into a key and a query input self-attention layer, independently inputting the key and the query as value parameters, and extracting mutual information for characteristic updating through modeling the relationship between the compression characteristics in the self-attention layer through global self-attention. And adding and carrying out layer normalization after updating and before updating the compression characteristics to obtain an intermediate representation. Then, the intermediate representation and the hidden feature query quantity as the position code are added again to be used as a query quantity to input the cross attention module, the video feature is input to be used as a value, and the corresponding position code is added to be used as a key to input the cross attention layer. The cross attention layer extracts boundary features and clusters context features from the sorted video features, extracts useful video features by calculating point multiplication activation on an input video sequence through compression features and hidden feature query quantity, updates the compression features, and achieves the effects of feature extraction and complexity reduction. And finally, the result of the cross attention layer passes through a residual structure, the projection of a linear mapping layer and another residual structure to obtain the compression characteristic after accumulation and updating.
The step of decoding the encoded features to obtain the final result representation, i.e. the aforementioned step 2.4), is shown in fig. 4. The input nomination query quantity and boundary nomination are input into a multi-head self-attention layer to strengthen nomination representation, and then input into a cross attention layer. And the learned hidden feature query quantity is used as a position code of a compression feature and participates in cross attention calculation. And the cross attention layer extracts boundary position information of the compression features, finds the weight of each time dimension position of the compression features through a cross attention moment matrix multiplied by the compression features and the boundary nomination, and correspondingly extracts corresponding features. And accumulating the extracted boundary representation in each layer by boundary nomination to finally obtain a decoding result. There is an additive normalized residual transform between the self-attention layer and the cross-attention layer, followed by a second residual transform, a linear projection layer projection and a third residual transform.
Decoding and submission of timing boundary nomination as shown in fig. 5. And respectively inputting the boundary nomination representation into the positioning and classifying branches represented by the full connection layer to obtain position and confidence degree scores. The positioning branch comprises three full connection layers (fc) and a sigmoid activation function, and the position is finally obtained. The classification branch is divided into two classes, including a layer of fc, and a confidence score corresponding to the current prediction position is obtained. The confidence score is screened and if above the threshold γ 0.9, the prediction is submitted as the final prediction.
3) In the training stage, cross entropy, L1 distance and a negative logarithm function are used as loss functions in the example, an AdamW optimizer is used, the batch size is set to be 64, namely 64 window samples are taken from a training set for training each time, the total number of training rounds is set to be 100, the initial learning rate is 2e-4, the learning rate has no attenuation strategy, and the model is trained on a NVIDIA RTX 2080ti GPU. The process from an original video to a time sequence boundary result is divided into two stages, and in the first stage, a backbone network is finely adjusted on a data set based on pre-training parameters to obtain video characteristics and a continuity score; in the second stage, the Temporal driver is trained and tested. In the stage of distributing positive and negative samples, the model adopts a strict one-to-one corresponding matching strategy, so that the occurrence of false positive cases is reduced, and the model can predict the non-category time sequence boundary in a sparse and efficient manner.
4) Testing phase
The test set is preprocessed, like training data, after being framed, the test set is compressed and transformed into a size of 224 × 224, and RGB feature extraction is performed by using a backbone network based on ResNet 50. The test metrics used differ across the data sets of different tasks: f1 scores of different relative distances (Rel. Dis.) between the sub-action non-category timing boundary and the event non-category timing boundary are used as indexes, and AP and M are adopted as the non-category timing boundary of the scene leveliouAs an index. The F1 score is calculated based on the recall (recall) and precision (precision),the recall rate refers to the proportion of the correctly predicted sample number to the total true value number, the precision rate refers to the proportion of the correctly predicted sample number to all the samples predicted to be true values, and the samples with prediction errors within the relative distance can be calculated as correctly predicted. The AP (average precision) is calculated based on the average precision value corresponding to the recall value from 0 to 1. MiouIs the cross-ratio weighted sum of the distance of the predicted boundary and the true scene length. AP and MiouThe criteria require that the prediction result must be consistent with the true position to calculate a correct prediction.
On three randomly selected videos from the MovieNet data set, as shown in FIG. 6, the Temporal driver has 7 times faster scene reasoning speed per second and nearly 200 times smaller floating point number calculation times than that of the LGSS in the classical work, and embodies the advantages of model sparse prediction and no need of a post-processing module; compared with a Transformer variant, the Temporal driver also has higher scene reasoning speed per second and fewer floating point number operation times, and shows that the feature compression is in a modulo modeThe effect of light weight is more proof of the high efficiency of the Temporal driver. In the aspect of prediction precision, the Temporal driver is compared with the previous work, great improvement is obtained on all indexes of all data sets, and the universality and the generalization of the model are reflected: as shown in FIG. 7, on the MovieSenes dataset, Temporal drivers are at AP and MiouThe indexes of the method exceed that of LGSS in classical work by nearly 3 percent; as shown in FIG. 8, in the Kinetics-GEBD and TAPOS datasets, the performance of the Temporal driver is higher than that of the previous-of-the-art working PC based on all indexes of f1@ Rel.Dis., wherein in the Kinetics-GEBD dataset, the performance of the Temporal driver is 12.6% higher than that of the previous-of-the-art working PC based on the index of f1@ 0.05; on the TAPOS data set, based on the average f1 score, the performance of the Temporal driver is 9% higher than that of the PC in the former state-of-the-art work, and the characteristic that the prediction of the Temporal driver is more flexible and accurate is shown. More specific prediction visualization on the MovieNet dataset as shown in fig. 9, the prediction of Temporal persistence avoids false positives near the true values, accurately predicting the true value of each scene.
Claims (8)
1. The time sequence boundary detection method is characterized in that a non-category time sequence boundary detection network is constructed to carry out time sequence boundary detection on a video, the detection network comprises a backbone network and a detection model, and the implementation mode is as follows:
1) generating a detection sample by a backbone network: sampling video interval to obtain video image sequenceGenerating a video segment from each frame, wherein the ith video segment is formed by the ith frame image fiBy generating video features for an incoming video segment by a backbone networkAnd continuous scoringFiAnd SiRespectively scoring the RGB characteristics and continuity of the video segment i;
2) performing non-category time sequence action detection by a detection model based on the video characteristics F and the continuity score S, wherein the detection model comprises the following configurations:
2.1) encoder: the encoder E comprises NeThe system comprises a plurality of layers of transformation decoding layers which are connected in series, wherein each layer comprises a multi-head self-attention layer, a multi-head cross-attention layer and a linear mapping layer, the self-attention layer, the cross-attention layer and the linear mapping layer are respectively provided with a residual error structure, and M hidden characteristic query quantities Q are introduced into a codereThe video features F are sorted in a descending order based on the continuity scoring S and then input into an encoder, and the encoder compresses the sorted video features into compression features H of M frames and initial compression features H00, transform decoding layer at j, hidden feature query quantity QeAnd the compression characteristic H of the current layerjAdding, interacting with video features of cross attention layer and reordering through self attention layer and residual structure thereof, and transforming through residual structure-linear mapping layer-residual structure to obtain compression feature Hj+1,j∈[0,(Ne-1)]By stacking NeAfter each coding layer, realizing the compression and coding of the input characteristics to obtain the compression characteristics
The generation of the hidden feature query quantity comprises the following steps: hidden feature query quantity QeIs divided into MbNumber of boundary queries and McThe context query quantity is initialized randomly and is generated along with the learning of a training sample in the process of training a detection model; the boundary query quantity corresponds to the boundary region feature in the video feature, the context query quantity corresponds to the context region feature in the video feature, and M is before and after reordering in the video featurebOne feature is a boundary region feature, and the other feature is a context feature;
2.2) decoder: the decoder D comprises NdLayers of serial decoding, each layer including a multi-head self-attention layer, a multi-head cross-attention layer and a linear mapping layerThe intention layer, the cross attention layer and the linear mapping layer are respectively provided with a residual error structure; for the compression characteristic H obtained by the encoder, the decoder carries out time sequence boundary point analysis by transforming the decoder structure, and the decoder defines NpIndividual nomination query quantity QdNumber of nomination queries QdLike the hidden feature query quantity, training and learning generation are carried out after random initialization, and a boundary nomination B is initialized00, in the j-th layer, the nomination query quantity QdAnd boundary nomination BjAdding, interacting with the compression characteristic H at the cross attention layer through the self-attention layer and the primary residual error structure, and obtaining an updated boundary nomination B through the transformation of the residual error structure-linear mapping layer-residual error structurej+1(ii) a By stacking of NdAfter decoding layer, realizing analysis of compression characteristics to obtain timing sequence boundary nomination representation
2.3) generation and scoring of time-series class-free boundaries: for the obtained timing boundary nomination representation B, two different fully-connected layer branches are fed: the system comprises a positioning branch and a classification branch, wherein the two branches are respectively used for outputting time sequence class boundary-free time and confidence score;
2.4) assigning training labels: a strict one-to-one training label matching strategy is adopted: obtaining a group of optimal one-to-one matching by using a Hungarian algorithm according to a defined matching cost C, wherein each prediction which is allocated to a non-category boundary true value obtains a positive sample label, and the corresponding boundary true value is a training target; the matching cost C consists of a position cost and a classification cost, the position cost is defined on the basis of the absolute value of the distance between the prediction time and the boundary true value time, and the classification cost is defined on the basis of the prediction confidence coefficient;
2.5) time-sequential class-boundary-free submission: after a series of time sequence class-free boundaries are generated, screening out the most credible time sequence class-free boundary time through a confidence score threshold gamma, and submitting the time sequence class-free boundary time for subsequent performance measurement;
3) a training stage: training the configured model by adopting a training sample, using cross entropy, L1 distance and log function as loss functions, using an AdamW optimizer, updating network parameters by a back propagation algorithm, and continuously repeating the step 1) and the step 2) until the iteration times are reached;
4) and (3) detection: inputting the video characteristic sequence and continuity score of the data to be tested into the trained detection model, generating time sequence class-free boundary time and score, and obtaining the time sequence class-free boundary time sequence for performance measurement by the method of 2.3).
2. The timing boundary detection method of claim 1, wherein the hidden feature query quantity is aligned with the video features when training the detection model: the boundary query quantity is aligned with the boundary region features of the video features through an alignment loss function, the alignment loss function is calculated based on the last layer of cross attention map, the diagonal attention weight formed by the corresponding relation between the boundary query quantity and the boundary region features is taken and the negative logarithm is taken by utilizing the condition that the boundary query quantity is consistent with the boundary region features, the value of the alignment loss function is obtained, the aim of maximizing the diagonal attention weight is achieved by minimizing the alignment loss function, and the boundary query quantity is guaranteed to correspondingly extract the boundary region features in the cross attention.
3. The timing boundary detection method according to claim 1 or 2, wherein the backbone network performs sampling generation of the video based on the ResNet50 and the timing convolutional layer, samples all picture frames corresponding to each video at intervals of τ frames to form a video image sequence, and samples the video image sequenceIs divided into NfA video segment of length 2k frames, NfThe length of the video image sequence and the number of the video segments, the ith video segment is the image sequence formed by the k frames before and after the ith frame imageImage sequence Ls,iFeeding in the backboneThe network outputs the pre-trained and fine-tuned parameters of the convolution layer, the pooling layer and the full-connection layer to obtain RGB characteristics FiAnd a continuity score SiThe characteristics and the scores of different video segments are spliced according to the time sequence respectively to obtain the characteristics of the whole videoAnd continuous scoringWhere the sampling interval τ represents the level of granularity of the temporal division performed globally and the size of the video segment length 2k represents the local receptive field extent of the feature.
4. The method as claimed in claim 3, wherein a densiflow library is used to perform frame extraction on all videos to obtain picture frames, sampling is performed at intervals of τ frames, τ is taken as 3, and the following processing is performed: invoking a transform packet of the torchvision library, zooming each picture frame to 224 × 224, converting the picture frame into a tensor form, and finally normalizing to obtain a video image sequence Lf(ii) a Traversing the image sequence of the whole video, taking k frames before and after each frame, and taking k as 5 to obtain NfExtracting the characteristics of each video segment by using the stacked space convolution layer, the maximum pooling layer and the time sequence convolution layer; and obtaining a continuity score through the full connection layer, and splicing the characteristics and the score according to a time sequence respectively to obtain a video-level characteristic F and a continuity score S.
5. The timing boundary detection method according to claim 1 or 2, wherein in the configuration of step 2), the convolutional layer of the backbone network is composed of a convolution operation, a batch normalization operation, and a ReLU activation function, and the encoder is a transform decoder structure used as an encoder and the decoder is also a transform decoder structure.
6. The timing boundary detection method according to claim 1 or 2, wherein the number of encoder layers N iseTake 6, at layer j, compress feature HjAs a value parameter, a hidden feature query quantity QeAnd compression characteristic HjAdding the values of the parameters to be input into a self-attention module with 8 branches as keys and query parameters, forming an attention matrix through the self multiplication and Softmax normalization of the keys and the query parameters, multiplying the attention matrix by the value parameters to obtain output, and adding the output to H in a residual error structurejAdding and normalizing to obtain Hj′;Hj' Sum implicit feature query quantity QeAdding the video characteristics F as query parameters, reordering the video characteristics F in a descending order based on the continuity score S to obtain value parameters, adding the video characteristics F and the reordered position codes thereof as key parameters, inputting the key parameters into a cross attention module, forming an attention matrix in the module by cross multiplication of keys and the query parameters and Softmax normalization, multiplying the attention matrix by the value parameters, and carrying out residual structure and H'jAdding the normalization, linear mapping and residual error structure again to obtain updated compression characteristics Hj+1。
7. The timing boundary detection method according to claim 1 or 2, wherein the number of decoder layers NdGet 6, at the j-th layer, the boundary nomination BjAs a value parameter, nomination query quantity QdAnd boundary nomination BjAdding the values of the parameters to form a key and query parameters, inputting 8 branches of self-attention modules as the key and the query parameters, forming an attention matrix through the self multiplication and Softmax normalization of the key and the query parameters, multiplying the attention matrix by the value parameters to obtain output, and adding B in a residual error structurejAdding and normalizing to obtain Bj′;Bj' and nomination query quantity QdAdding the compressed characteristic H obtained in 2.1) as a value parameter as a query parameter and the hidden characteristic query quantity Q as a position code after compressioneAdding the values as key parameters, inputting the key parameters into a cross attention module, forming an attention matrix in the module through cross multiplication of the key and query parameters and Softmax normalization, multiplying the attention matrix by value parameters, and performing residual structure and Bj' addition normalization to get updated boundary nomination Bj+1。
8. Timing perceptron characterized by having a computer storage medium having a computer program embodied therein, said computer program for implementing the classless timing boundary detection network of claims 1-7, said computer program when executed implementing the timing boundary detection method of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111615241.3A CN114494314A (en) | 2021-12-27 | 2021-12-27 | Timing boundary detection method and timing sensor |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111615241.3A CN114494314A (en) | 2021-12-27 | 2021-12-27 | Timing boundary detection method and timing sensor |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114494314A true CN114494314A (en) | 2022-05-13 |
Family
ID=81496206
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111615241.3A Pending CN114494314A (en) | 2021-12-27 | 2021-12-27 | Timing boundary detection method and timing sensor |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114494314A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115062603A (en) * | 2022-05-20 | 2022-09-16 | 中国科学院自动化研究所 | Alignment enhancement semantic parsing method, alignment enhancement semantic parsing device and computer program product |
CN115965944A (en) * | 2023-03-09 | 2023-04-14 | 安徽蔚来智驾科技有限公司 | Target information detection method, device, driving device, and medium |
CN116128043A (en) * | 2023-04-17 | 2023-05-16 | 中国科学技术大学 | Training method of video scene boundary detection model and scene boundary detection method |
CN117349610A (en) * | 2023-12-04 | 2024-01-05 | 西南石油大学 | Fracturing operation multi-time-step pressure prediction method based on time sequence model |
-
2021
- 2021-12-27 CN CN202111615241.3A patent/CN114494314A/en active Pending
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115062603A (en) * | 2022-05-20 | 2022-09-16 | 中国科学院自动化研究所 | Alignment enhancement semantic parsing method, alignment enhancement semantic parsing device and computer program product |
CN115965944A (en) * | 2023-03-09 | 2023-04-14 | 安徽蔚来智驾科技有限公司 | Target information detection method, device, driving device, and medium |
CN115965944B (en) * | 2023-03-09 | 2023-05-09 | 安徽蔚来智驾科技有限公司 | Target information detection method, device, driving device and medium |
CN116128043A (en) * | 2023-04-17 | 2023-05-16 | 中国科学技术大学 | Training method of video scene boundary detection model and scene boundary detection method |
CN117349610A (en) * | 2023-12-04 | 2024-01-05 | 西南石油大学 | Fracturing operation multi-time-step pressure prediction method based on time sequence model |
CN117349610B (en) * | 2023-12-04 | 2024-02-09 | 西南石油大学 | Fracturing operation multi-time-step pressure prediction method based on time sequence model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114494314A (en) | Timing boundary detection method and timing sensor | |
CN111985369B (en) | Course field multi-modal document classification method based on cross-modal attention convolution neural network | |
CN110322446B (en) | Domain self-adaptive semantic segmentation method based on similarity space alignment | |
CN108280187B (en) | Hierarchical image retrieval method based on depth features of convolutional neural network | |
Richard et al. | A bag-of-words equivalent recurrent neural network for action recognition | |
CN110909673A (en) | Pedestrian re-identification method based on natural language description | |
CN114743020B (en) | Food identification method combining label semantic embedding and attention fusion | |
CN110188827A (en) | A kind of scene recognition method based on convolutional neural networks and recurrence autocoder model | |
CN114911958B (en) | Semantic preference-based rapid image retrieval method | |
CN114329031B (en) | Fine-granularity bird image retrieval method based on graph neural network and deep hash | |
CN114780767A (en) | Large-scale image retrieval method and system based on deep convolutional neural network | |
CN114491115B (en) | Multi-model fusion integrated image retrieval method based on deep hash | |
CN116341558A (en) | Multi-modal emotion recognition method and model based on multi-level graph neural network | |
CN114821379B (en) | Direct time sequence action detection method based on relaxation transformation decoder | |
CN110516640B (en) | Vehicle re-identification method based on feature pyramid joint representation | |
CN117390506A (en) | Ship path classification method based on grid coding and textRCNN | |
CN115527064A (en) | Toxic mushroom fine-grained image classification method based on multi-stage ViT and contrast learning | |
CN116089646A (en) | Unmanned aerial vehicle image hash retrieval method based on saliency capture mechanism | |
Le et al. | Btel: A binary tree encoding approach for visual localization | |
Seddik et al. | Lightweight neural networks from pca & lda based distilled dense neural networks | |
CN115050032A (en) | Domain-adaptive text image recognition method based on feature alignment and entropy regularization | |
Kumar et al. | Analysis and fast feature selection technique for real-time face detection materials using modified region optimized convolutional neural network | |
Zhao et al. | Lightweight quality evaluation of generated samples and generative models | |
CN118467768B (en) | Rapid image retrieval method and system based on large-model advanced semantic graph embedding | |
CN113902930B (en) | Image classification method for optimizing bag-of-words model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |