CN113255570A - Sequential action detection method for sensing video clip relation - Google Patents
Sequential action detection method for sensing video clip relation Download PDFInfo
- Publication number
- CN113255570A CN113255570A CN202110659154.1A CN202110659154A CN113255570A CN 113255570 A CN113255570 A CN 113255570A CN 202110659154 A CN202110659154 A CN 202110659154A CN 113255570 A CN113255570 A CN 113255570A
- Authority
- CN
- China
- Prior art keywords
- video
- global
- video segments
- candidate
- representing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000009471 action Effects 0.000 title claims abstract description 26
- 238000001514 detection method Methods 0.000 title abstract description 14
- 238000005070 sampling Methods 0.000 claims abstract description 6
- 238000000605 extraction Methods 0.000 claims abstract description 4
- 230000008447 perception Effects 0.000 claims description 25
- 238000000034 method Methods 0.000 claims description 23
- 238000011176 pooling Methods 0.000 claims description 16
- 238000004364 calculation method Methods 0.000 claims description 12
- 230000006870 function Effects 0.000 claims description 9
- 230000008569 process Effects 0.000 claims description 5
- 230000004913 activation Effects 0.000 claims description 4
- 230000004927 fusion Effects 0.000 claims description 4
- 230000000007 visual effect Effects 0.000 claims description 4
- 230000004931 aggregating effect Effects 0.000 claims description 3
- 238000010977 unit operation Methods 0.000 claims description 3
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 claims description 3
- 238000006116 polymerization reaction Methods 0.000 claims description 2
- 230000000875 corresponding effect Effects 0.000 description 5
- 238000012549 training Methods 0.000 description 5
- 238000012795 verification Methods 0.000 description 5
- 238000012360 testing method Methods 0.000 description 4
- 238000011161 development Methods 0.000 description 2
- 230000002708 enhancing effect Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 238000010200 validation analysis Methods 0.000 description 2
- 241000208340 Araliaceae Species 0.000 description 1
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 1
- 235000003140 Panax quinquefolius Nutrition 0.000 description 1
- 230000002238 attenuated effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 235000008434 ginseng Nutrition 0.000 description 1
- 239000004615 ingredient Substances 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Molecular Biology (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to the field of video understanding, in particular to a time sequence action detection method for sensing video clip relation, which comprises the following steps: step S1: sampling a video; step S2: performing primary feature extraction on the video; step S3: performing feature enhancement on the extracted features to generate boundary prediction of time sequence nodes, and extracting the features of all candidate video segments; step S4: capturing the relation between the candidate video segment characteristics; step S5: combining the prediction results of step S3 and step S4 to generate a final judgment score; step S6: removing repeated candidate video segments; step S7: classifying the candidate video segments to obtain class information of the candidate video segments; more effective video segment characteristics are generated by capturing global and local relationships between them, thereby generating more effective prediction results.
Description
Technical Field
The invention relates to the field of video understanding, in particular to a time sequence action detection method for sensing video clip relation.
Background
In recent years, with the continuous development of streaming media, videos on various website platforms are increased explosively, compared with traditional image information, information contained in the videos is richer and gets more attention of researchers, video understanding gradually becomes a popular research field in the industry and academia, a timing action detection task is an important branch of the video, and the task is mainly to detect a timing boundary of each action instance in a long video and determine the type of the action, so that a user can be helped to locate the action information in the videos more conveniently and quickly.
The time sequence action detection task can help people to quickly determine key contents in the long-distance video, can be used as a preprocessing step for problems such as video understanding and man-machine interaction, and is widely applied to actual life: (1) the video monitoring field: with the rapid development of the internet of things, cameras are already spread in various public places such as roads, schools and the like, play an important security role, but bring extremely large video data, if the cameras are simply analyzed manually, the method is obviously unrealistic, and the time sequence action detection task can quickly extract effective information in the monitoring video, liberate a large amount of human resources, and (2) the video retrieval field: nowadays, with the popularity of videos on various social software, common users also upload and share various video data, in order to recommend videos of interest to the users, classification and labeling are required for each uploaded video, if manual processing is used, the labor cost is very high, and the time sequence action detection can detect actions in the videos, perform preliminary classification, sorting and labeling according to a preset definition, and further integrate the actions into a subsequent video retrieval and video recommendation related algorithm.
Due to the wide application of the time sequence action detection task in the industry and academia, many effective algorithms are proposed, and at present, the algorithms can be roughly divided into two types: one-stage and two-stage:
1) one-stage: firstly, detecting video segments with different lengths by utilizing a hierarchical structure, a cascade structure and the like, and simultaneously giving the category information of the video segment in the prediction process;
2) two-stage: firstly, extracting all possible candidate video segments, and then classifying the candidate video segments by using some classical classifiers (such as Unet) to obtain the class information of the candidate video segments;
although the two-stage based methods have achieved good results, they generate the candidate video segments individually in the process of generating the candidate video segments, and ignore the relationship between the video segments.
Disclosure of Invention
Based on the above problems, the present invention provides a time sequence action detection method for sensing video segment relation, which generates more effective video segment characteristics by capturing global relation and local relation between them, thereby generating more effective prediction results.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows:
a time sequence action detection method for sensing video clip relation comprises the following steps:
step S1: sampling a video;
step S2: performing primary feature extraction on the video by using a TSN (time delay network) model to obtain features;
Step S3: using BaseNet model to pair extracted featuresFeature enhancement to generate boundary prediction for time series nodesAndin addition, all the characteristics of the candidate video segments are extracted and recordedWhereinEach element representing a feature of a candidate video segment;
step S4: capturing candidate video segment features using global perception module and feature enhancement moduleThe relationship between;
step S5: combining the prediction results of step S3 and step S4 to generate a final judgment score;
step S6: removing repeated candidate video segments by using a Soft-NMS model;
step S7: and classifying the candidate video segments by using a Unet classifier to obtain the class information of the candidate video segments.
Further, the step S2 specifically includes the following steps:
step S21: firstly, acquiring a long video to a certain number of video segments according to a certain time interval;
step S22: and inputting the video segment into the TSN model, respectively acquiring the visual characteristic and the action characteristic, and connecting the visual characteristic and the action characteristic.
Further, the step S3 specifically includes the following steps:
step S31: establishing the relation among all video frames by utilizing graph convolution, and dynamically fusing multi-scale context semantic information to video characteristics;
step S32: predicting each time sequence position by graph convolution, outputting the possibility of the time sequence position being a starting node or an ending node, namely generating the boundary prediction of the time sequence nodeAndand, in addition, extracting all the characteristics of the candidate video segment and outputting。
Further, the global sensing module specifically includes:
designing a global perception unit to establish a relationship between candidate video segments in the same row and column for inputFirst, theInputting into two parallel paths, each path respectively comprising a vertical pooling layer and a horizontal pooling layerThe calculation formula is as follows:
wherein,to representThe results after being input to the horizontal pooling,representing all possible start times in the video,is shown asA plurality of channels, each of which is provided with a plurality of channels,representing the start time of a video segment as,Representing a video segment of duration,Then it indicates toIs a start timeFor a duration of a video segmentThe value of each channel;
wherein,to representThe results after being input to the vertical pooling,representing all possible durations of the video segments,is shown asA plurality of channels, each of which is provided with a plurality of channels,representing the start time of a video segment as,Representing a video segment of duration,Then it indicates toIs a start timeFor a duration of a video segmentThe value of each channel;
then, the information of the current position and the neighbor thereof is aggregated by using the one-dimensional convolution operation with the convolution kernel of 3, and then the outputs of the two paths are fused to obtain a fusion resultThe following are:
the output of the global sensing unit is then:
wherein,represents the result of the operation of the global sensing unit,which represents the input to the unit or units,it is shown that the activation function is,represents a convolution operation;
repeating the global sensing unit twice to obtain a global sensing module, wherein the calculation formula is as follows:
wherein,the output of the global perception module is represented,representing a global sense unit operation that is,representing a convolution operation.
Further, the feature enhancement module is specifically as follows:
a hierarchy is used to capture local information between candidate video segments for inputFirst, downsampling is used to aggregate features between adjacent candidate boxes, and the calculation formula is as follows:
when in useWhen the temperature of the water is higher than the set temperature,is to inputIf the global perception module is embedded in the feature enhancement module and all layers in the feature enhancement module share the same global perception module, the output of each layerCan be calculated according to the following formula:
represents the output result of each layer of the global perception module,representing the aforementioned global sensing module operations;
then, the upsampling operation is used to aggregate features between different layers, which is calculated as follows:
wherein,the representation is fused with different layersThe output result of the layer is then,it is indicated that the up-sampling operation,the second of the aforementioned global perception modulesAnd +1 layer outputs the result.
Further, the steps S4 to S5 specifically include the following steps:
step S41: using average pooling polymerizationThe relationship between adjacent video segments in the video stream;
step S42: inputting the multi-level structure into a shared global perception module;
step S43: aggregating features output by global perception modules in adjacent levels;
step S44: the prediction score of each candidate video segment is obtained preliminarily through a shared convolution operation.
Further, the step S6 specifically includes the following steps:
step S61: generating all video segment scores and sorting according to the size;
step S62: selecting the video segments with larger scores, selecting the video segments with larger overlapping degree, attenuating the scores of the video segments, and repeating the process in sequence until a specific number of candidate video segments are reserved.
Compared with the prior art, the invention has the beneficial effects that: capturing the video frequency range relationship from different angles (local information and global information), establishing a relationship for a remote candidate video from a global view angle by using the average pooling characteristic for the global information, aggregating information between adjacent candidate video segments by using a hierarchical structure for the local information due to the distribution characteristic of the candidate video segments, and achieving a better detection result by the whole model due to the complementation of the local information and the global information;
for the prediction of candidate video segments, if they are processed separately, a constraint relation between them is ignored, and by exploring the relation between candidate video segments, more complete and accurate results can be generated, for different candidate video segments in the same video, they are often highly correlated, the relation of all candidate video segments is established, the characteristics of action instances can be enhanced by using background information, and for adjacent candidate video segments, there is a great amount of overlap between them, and effective information between them is aggregated by using average pooling, so as to generate more accurate results.
Drawings
Fig. 1 is a flowchart of this embodiment 1.
Detailed Description
The invention will be further described with reference to the accompanying drawings. Embodiments of the present invention include, but are not limited to, the following examples.
Example 1
In this embodiment, the system mainly includes a global perception module and a feature enhancement module, where:
in the global perception module, GA units are designed to establish the relation between candidate video segments in the same row and column, and for inputFirst, theInput into two parallel paths, eachOne path comprising a vertical pooling layer and a horizontal pooling layer, respectively, for a horizontally pooled oneThe calculation formula is as follows:
wherein,to representThe results after being input to the horizontal pooling,representing all possible start times in the video,is shown asA plurality of channels, each of which is provided with a plurality of channels,representing the start time of a video segment as,Representing a video segment of duration,Then it indicates toIs a start timeFor a duration of a video segmentThe value of each channel;
wherein,to representThe results after being input to the vertical pooling,representing all possible durations of the video segments,is shown asA plurality of channels, each of which is provided with a plurality of channels,representing the start time of a video segment as,Display viewThe frequency band has a duration of,Then it indicates toIs a start timeFor a duration of a video segmentThe value of each channel;
then, the information of the current position and the neighbor thereof is aggregated by using the one-dimensional convolution operation with the convolution kernel of 3, and then the outputs of the two paths are fused to obtain a fusion resultThe following are:
the output of the global sensing unit is then:
wherein,represents the result of the operation of the global sensing unit,which represents the input to the unit or units,presentation activation letterThe number of the first and second groups is,represents a convolution operation;
repeating the global sensing unit twice to obtain a global sensing module, wherein the calculation formula is as follows:
wherein,the output of the global perception module is represented,representing a global sense unit operation that is,represents a convolution operation;
in addition, in the feature enhancement module, a hierarchy is used to capture local information between candidate video segments. For inputFirst, downsampling is used to aggregate features between adjacent candidate boxes, and the calculation formula is as follows:
when in useWhen the temperature of the water is higher than the set temperature,is to inputCan only be polymerized to due to the limitation of average poolingThe relationship between adjacent candidate frames cannot capture the long-distance dependency relationship, so that the global perception module is embedded into the feature enhancement module, all layers in the feature enhancement module share the same global perception module, and the output of each layerCan be calculated according to the following formula:
by doing so, the relationship between all candidate video segments can be established from two complementary levels (local and global), and then the upsampling operation is used to aggregate features between different levels, since each level is supervised by label information, the fusion of information of different levels can reduce the generation of noise to the maximum extent, and the calculation formula is as follows:
thus, the relationship between different candidate video segments can be captured from different levels and different scales.
Based on the above, the method for detecting the time sequence action of sensing the relation of the video segments as shown in fig. 1 includes the following steps:
step S1: sampling a video;
selecting a proper training and testing data set, wherein the training and testing are mainly performed on the public data sets ActivityNet-1.3 and THUMOS-14;
the ActivityNet-1.3 dataset is an open dataset for video segment generation and detection, which contains mainly 19994 videos and contains 200 action classes, these videos are mainly crawled from youtube website, all of which are different in resolution and time, it was an ActivityNet change 2016 and 2017 game dataset, which divides all videos into training, evaluation and test sets in a 2:1:1 ratio;
the THIMOS-14 data set contains 413 videos and contains 20 categories of information, wherein the test set contains 212 videos, the validation set contains 200 videos used for the time series motion detection task, the entire model is trained on the validation set and the performance of the entire model is evaluated on the tester.
Step S2: performing primary feature extraction on the video by using a TSN (time delay network) model to obtain features;
Firstly, for an unprocessed long video, extracting a corresponding video frame to represent the video frame asWhereinRepresenting the total number of video frames,the nth frame represented in the video, for which the set of tags may be represented asWhereinIndicating the number of motion video segments contained in a video,andrespectively representThe start time, end time and class information of each label are used only during training, and then the label is usedThe TSN model is used to extract the characteristics of each video, and the characteristics are expressed as。
Step S3: feature enhancement of extracted features using BaseNet model, resulting in boundary prediction of time series nodesAndin addition, the characteristics of all candidate video segments are extracted and output;
Wherein features are aligned using graph convolution (including GCN)Enhancing and obtaining the characteristics of richer semantic information, wherein the calculation formula is as follows:
Then, the featureShared by two branch networks, one of which is used to determine whether each timing position is a start node or an end node, which may be denoted asAndwhereinRepresenting the length of a sequence of video features that will output the features of all candidate video segments for another branching network。
Step S4: capturing candidate video segment features using global perception module and feature enhancement moduleThe relationship between;
for the output in the above step S3Each position of which represents the characteristics of a candidate video segment, and inputting the characteristics into the characteristics enhancing module, the relationship between the candidate video segments can be captured from the local view and the global view, the characteristics of the action instance can be enhanced, and the background information can be suppressed, so that more accurate and more complete results can be generated, and the output is;
wherein,respectively representing two types of prediction results obtained by regression and classification supervision,it is shown that the activation function is,representing a convolution operation.
Step S5: combining the prediction results of step S3 and step S4 to generate a final judgment score;
in order to fully utilize the output result of the whole model, each layer of output in the feature enhancement module is fused into the final score of the candidate video segment, and in addition, the boundary information of each video segment is considered, so that the output of each layer of output in the feature enhancement module is combined into the final score of the candidate video segmentToVideo segment of (1), score thereofThe calculation formula of (a) is as follows:
wherein,,andall the ingredients are super-ginseng,is shown asThe output of the layer(s) is,the total number of layers of the hierarchical result is indicated.
Step S6: removing repeated candidate video segments by using a Soft-NMS model;
after all possible candidate frames are acquired, as most of the possible candidate frames have large overlap, the Soft-NMS model is used for removing the candidate frames again, and the scores of the candidate video segments are countedSorting according to the size of the video segments, selecting the candidate video segment with the largest score, and then calculating the iou of the candidate video segment with the other video segments, wherein the video segments with high overlapping degree are attenuated according to the following formula.
WhereinThe parameters representing the gaussian function are then calculated,which represents a pre-defined threshold value that is,respectively representing the currently selected candidate video segment and other video segments.
Step S7: and classifying the candidate video segments by using a Unet classifier to obtain the class information of the candidate video segments.
After all possible candidate video segments are obtained, use is made ofThe Unet classifier classifies the video segments to obtain the final class information, which can be expressed asWhereinAs the information of the category, it is,is the score to which it corresponds,to predict the number of instances of an action.
Example 2
In this embodiment, the overall model in embodiment 1 needs to be trained, and the overall loss function is represented as:
whereinIs super ginseng and usedTo determine whether each time node is a start node or an end node, which can be expressed as:
andrespectively representing weighted cross entropy loss functions,respectively expressed as the prediction result of the start node and its corresponding label,respectively representing the prediction result of the end node and the corresponding label, and training the feature enhancement module, wherein the label is generated for the 0 th layer in the feature enhancement module and is represented asSince the feature enhancement module is a hierarchical structure and each layer is supervised, a label is generated for each layer using the following formula.
Thus, the loss function for each layer can be defined as:
wherein,a loss function for each layer is represented,respectively representing the prediction result obtained by each layer by using regression supervision and the corresponding label,respectively representing the prediction result obtained by each layer by using classification supervision and the corresponding label.
WhereinAndrespectively represent the square difference loss and the weighted cross entropy loss, thenWhereina loss function representing the hierarchy is used to,is shown asA layer of a material selected from the group consisting of,the number of total layers is indicated,representing hyper-parameters, and finally, loss functionsOptimization can be done in an end-to-end fashion.
Example 3
In this embodiment, the validity of the method is verified by using the selected data set, which specifically includes:
the method is verified on the selected data set, in order to well evaluate the effectiveness of the embodiment, the average accuracy mAP is selected as a main evaluation index, the mAP is respectively calculated on an iou set {0.3,0.4,0.5,0.6 and 0.7} on the THUMOS-14 data set, the mAP on the iou set {0.5,0.75 and 0.95} is calculated on an activityNet1.3 data set, and in addition, the average mAP of ten different ious is calculated on the activityNet 1.3.
The present embodiment performs verification on the data set ActivityNet-1.3 of the current main stream, and the final verification result is shown in the following table (comparing the model performance on the ActivityNet-1.3 data set (%):
TABLE 1 comparison of model Performance on ActivinyNet-1.3 dataset
The present example performed verification on the currently mainstream data set thumb-14, and the final verification results are shown in the following table (model performance vs. (%) on thumb-14 data set).
TABLE 2 modeling Performance on THUMOS-14 dataset vs
The above is an embodiment of the present invention. The specific parameters in the above embodiments and examples are only for the purpose of clearly illustrating the invention verification process of the inventor and are not intended to limit the scope of the invention, which is defined by the claims, and all equivalent structural changes made by using the contents of the specification and the drawings of the present invention should be covered by the scope of the present invention.
Claims (7)
1. A method for detecting time sequence action of sensing video clip relation is characterized by comprising the following steps:
step S1: sampling a video;
step S2: performing primary feature extraction on the video by using a TSN (time delay network) model to obtain features;
Step S3: using BaseNet model to pair extracted featuresFeature enhancement to generate boundary prediction for time series nodesAndin addition, all the characteristics of the candidate video segments are extracted and recordedWhereinEach element representing a feature of a candidate video segment;
step S4: capturing candidate video segment features using global perception module and feature enhancement moduleThe relationship between;
step S5: combining the prediction results of step S3 and step S4 to generate a final judgment score;
step S6: removing repeated candidate video segments by using a Soft-NMS model;
step S7: and classifying the candidate video segments by using a Unet classifier to obtain the class information of the candidate video segments.
2. The method according to claim 1, wherein said method comprises: the step S2 specifically includes the following steps:
step S21: firstly, acquiring a long video to a certain number of video segments according to a certain time interval;
step S22: and inputting the video segment into the TSN model, respectively acquiring the visual characteristic and the action characteristic, and connecting the visual characteristic and the action characteristic.
3. The method according to claim 1, wherein said method comprises: the step S3 specifically includes the following steps:
step S31: establishing the relation among all video frames by utilizing graph convolution, and dynamically fusing multi-scale context semantic information to video characteristics;
step S32: predicting each time sequence position by graph convolution, outputting the possibility of the time sequence position being a starting node or an ending node, namely generating the boundary prediction of the time sequence nodeAndand, in addition, extracting all the characteristics of the candidate video segment and outputting。
4. The method according to claim 1, wherein said method comprises: the global perception module specifically comprises the following modules:
designing a global perception unit to establish a relationship between candidate video segments in the same row and column for inputFirst, theInputting into two parallel paths, each path respectively comprising a vertical pooling layer and a horizontal pooling layerThe calculation formula is as follows:
wherein,to representThe results after being input to the horizontal pooling,representing all possible start times in the video,is shown asA plurality of channels, each of which is provided with a plurality of channels,representing the start time of a video segment as,Representing a video segment of duration,Then it indicates toIs a start timeFor a duration of a video segmentThe value of each channel;
wherein,to representThe results after being input to the vertical pooling,representing all possible durations of the video segments,is shown asA plurality of channels, each of which is provided with a plurality of channels,representing the start time of a video segment as,Representing a video segment of duration,Then it indicates toIs a start timeFor a duration of a video segmentThe value of each channel;
then, the information of the current position and the neighbor thereof is aggregated by using the one-dimensional convolution operation with the convolution kernel of 3, and then the outputs of the two paths are fused to obtain a fusion resultThe following are:
the output of the global sensing unit is then:
wherein,represents the result of the operation of the global sensing unit,which represents the input to the unit or units,it is shown that the activation function is,represents a convolution operation;
repeating the global sensing unit twice to obtain a global sensing module, wherein the calculation formula is as follows:
5. The method according to claim 4, wherein said method comprises: the characteristic enhancement module is specifically as follows:
a hierarchy is used to capture local information between candidate video segments for inputFirst, downsampling is used to aggregate features between adjacent candidate boxes, and the calculation formula is as follows:
when in useWhen the temperature of the water is higher than the set temperature,is to inputIf the global perception module is embedded in the feature enhancement module and all layers in the feature enhancement module share the same global perception module, the output of each layerCan be calculated according to the following formula:
represents the output result of each layer of the global perception module,representing the aforementioned global sensing module operations;
then, the upsampling operation is used to aggregate features between different layers, which is calculated as follows:
6. The method according to claim 5, wherein said method comprises: the steps S4 to S5 specifically include the following steps:
step S41: using average pooling polymerizationThe relationship between adjacent video segments in the video stream;
step S42: inputting the multi-level structure into a shared global perception module;
step S43: aggregating features output by global perception modules in adjacent levels;
step S44: the prediction score of each candidate video segment is obtained preliminarily through a shared convolution operation.
7. The method according to claim 1, wherein said method comprises: the step S6 specifically includes the following steps:
step S61: generating all video segment scores and sorting according to the size;
step S62: selecting the video segments with larger scores, selecting the video segments with larger overlapping degree, attenuating the scores of the video segments, and repeating the process in sequence until a specific number of candidate video segments are reserved.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110659154.1A CN113255570B (en) | 2021-06-15 | 2021-06-15 | Sequential action detection method for sensing video clip relation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110659154.1A CN113255570B (en) | 2021-06-15 | 2021-06-15 | Sequential action detection method for sensing video clip relation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113255570A true CN113255570A (en) | 2021-08-13 |
CN113255570B CN113255570B (en) | 2021-09-24 |
Family
ID=77187848
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110659154.1A Active CN113255570B (en) | 2021-06-15 | 2021-06-15 | Sequential action detection method for sensing video clip relation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113255570B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113920467A (en) * | 2021-12-13 | 2022-01-11 | 成都考拉悠然科技有限公司 | Tourist and commercial detection method and system combining booth detection and scene segmentation |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180350144A1 (en) * | 2018-07-27 | 2018-12-06 | Yogesh Rathod | Generating, recording, simulating, displaying and sharing user related real world activities, actions, events, participations, transactions, status, experience, expressions, scenes, sharing, interactions with entities and associated plurality types of data in virtual world |
CN109241829A (en) * | 2018-07-25 | 2019-01-18 | 中国科学院自动化研究所 | The Activity recognition method and device of convolutional neural networks is paid attention to based on space-time |
CN110705339A (en) * | 2019-04-15 | 2020-01-17 | 中国石油大学(华东) | C-C3D-based sign language identification method |
CN110765854A (en) * | 2019-09-12 | 2020-02-07 | 昆明理工大学 | Video motion recognition method |
CN111079594A (en) * | 2019-12-04 | 2020-04-28 | 成都考拉悠然科技有限公司 | Video action classification and identification method based on double-current cooperative network |
CN111372123A (en) * | 2020-03-03 | 2020-07-03 | 南京信息工程大学 | Video time sequence segment extraction method based on local to global |
CN111931602A (en) * | 2020-07-22 | 2020-11-13 | 北方工业大学 | Multi-stream segmented network human body action identification method and system based on attention mechanism |
CN112131943A (en) * | 2020-08-20 | 2020-12-25 | 深圳大学 | Video behavior identification method and system based on dual attention model |
CN112364852A (en) * | 2021-01-13 | 2021-02-12 | 成都考拉悠然科技有限公司 | Action video segment extraction method fusing global information |
CN112650886A (en) * | 2020-12-28 | 2021-04-13 | 电子科技大学 | Cross-modal video time retrieval method based on cross-modal dynamic convolution network |
-
2021
- 2021-06-15 CN CN202110659154.1A patent/CN113255570B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109241829A (en) * | 2018-07-25 | 2019-01-18 | 中国科学院自动化研究所 | The Activity recognition method and device of convolutional neural networks is paid attention to based on space-time |
US20180350144A1 (en) * | 2018-07-27 | 2018-12-06 | Yogesh Rathod | Generating, recording, simulating, displaying and sharing user related real world activities, actions, events, participations, transactions, status, experience, expressions, scenes, sharing, interactions with entities and associated plurality types of data in virtual world |
CN110705339A (en) * | 2019-04-15 | 2020-01-17 | 中国石油大学(华东) | C-C3D-based sign language identification method |
CN110765854A (en) * | 2019-09-12 | 2020-02-07 | 昆明理工大学 | Video motion recognition method |
CN111079594A (en) * | 2019-12-04 | 2020-04-28 | 成都考拉悠然科技有限公司 | Video action classification and identification method based on double-current cooperative network |
CN111372123A (en) * | 2020-03-03 | 2020-07-03 | 南京信息工程大学 | Video time sequence segment extraction method based on local to global |
CN111931602A (en) * | 2020-07-22 | 2020-11-13 | 北方工业大学 | Multi-stream segmented network human body action identification method and system based on attention mechanism |
CN112131943A (en) * | 2020-08-20 | 2020-12-25 | 深圳大学 | Video behavior identification method and system based on dual attention model |
CN112650886A (en) * | 2020-12-28 | 2021-04-13 | 电子科技大学 | Cross-modal video time retrieval method based on cross-modal dynamic convolution network |
CN112364852A (en) * | 2021-01-13 | 2021-02-12 | 成都考拉悠然科技有限公司 | Action video segment extraction method fusing global information |
Non-Patent Citations (4)
Title |
---|
HAISHENG SU等: "BSN++: Complementary Boundary Regressor with Scale-Balanced Relation Modeling for Temporal Action Proposal Generation", 《ARXIV》 * |
JINGRAN ZHANG等: "Cooperative Cross-Stream Network for Discriminative Action Representation", 《ARXIV》 * |
ZHIWU QING等: "Temporal Context Aggregation Network for Temporal Action Proposal Refinement", 《ARXIV》 * |
李怡颖: "基于深度学习的视频人体动作识别研究", 《中国优秀博硕士学位论文全文数据库(硕士) 信息科技辑》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113920467A (en) * | 2021-12-13 | 2022-01-11 | 成都考拉悠然科技有限公司 | Tourist and commercial detection method and system combining booth detection and scene segmentation |
CN113920467B (en) * | 2021-12-13 | 2022-03-15 | 成都考拉悠然科技有限公司 | Tourist and commercial detection method and system combining booth detection and scene segmentation |
Also Published As
Publication number | Publication date |
---|---|
CN113255570B (en) | 2021-09-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Qi et al. | Exploiting multi-domain visual information for fake news detection | |
CN110175580B (en) | Video behavior identification method based on time sequence causal convolutional network | |
CN111008337B (en) | Deep attention rumor identification method and device based on ternary characteristics | |
Liu et al. | Generalized zero-shot learning for action recognition with web-scale video data | |
CN110516536A (en) | A kind of Weakly supervised video behavior detection method for activating figure complementary based on timing classification | |
Wang et al. | Spatial–temporal pooling for action recognition in videos | |
WO2021082589A1 (en) | Content check model training method and apparatus, video content check method and apparatus, computer device, and storage medium | |
CN110851176B (en) | Clone code detection method capable of automatically constructing and utilizing pseudo-clone corpus | |
CN107341508B (en) | Fast food picture identification method and system | |
CN111783712A (en) | Video processing method, device, equipment and medium | |
CN113010705A (en) | Label prediction method, device, equipment and storage medium | |
CN114092819B (en) | Image classification method and device | |
CN113255570B (en) | Sequential action detection method for sensing video clip relation | |
CN114741599A (en) | News recommendation method and system based on knowledge enhancement and attention mechanism | |
CN112364852B (en) | Action video segment extraction method fusing global information | |
CN117743611B (en) | Automatic classification system for digital media content | |
CN115761599A (en) | Video anomaly detection method and system | |
Cao et al. | Adaptive receptive field U-shaped temporal convolutional network for vulgar action segmentation | |
Jayanthiladevi et al. | Text, images, and video analytics for fog computing | |
Liu et al. | Infmae: A foundation model in infrared modality | |
CN113033500B (en) | Motion segment detection method, model training method and device | |
Lin et al. | Temporal action localization with two-stream segment-based RNN | |
Das et al. | A comparative analysis and study of a fast parallel cnn based deepfake video detection model with feature selection (fpc-dfm) | |
Batta et al. | Predicting popularity of YouTube videos using viewer engagement features | |
Saddam Bekhet | A comparative study for video classification techniques using direct features matching, machine learning, and deep learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |