CN113255570B - Sequential action detection method for sensing video clip relation - Google Patents
Sequential action detection method for sensing video clip relation Download PDFInfo
- Publication number
- CN113255570B CN113255570B CN202110659154.1A CN202110659154A CN113255570B CN 113255570 B CN113255570 B CN 113255570B CN 202110659154 A CN202110659154 A CN 202110659154A CN 113255570 B CN113255570 B CN 113255570B
- Authority
- CN
- China
- Prior art keywords
- video
- global
- video segments
- candidate
- representing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
Abstract
The invention relates to the field of video understanding, in particular to a time sequence action detection method for sensing video clip relation, which comprises the following steps: step S1: sampling a video; step S2: performing primary feature extraction on the video; step S3: performing feature enhancement on the extracted features to generate boundary prediction of time sequence nodes, and extracting the features of all candidate video segments; step S4: capturing the relation between the candidate video segment characteristics; step S5: combining the prediction results of step S3 and step S4 to generate a final judgment score; step S6: removing repeated candidate video segments; step S7: classifying the candidate video segments to obtain class information of the candidate video segments; more effective video segment characteristics are generated by capturing global and local relationships between them, thereby generating more effective prediction results.
Description
Technical Field
The invention relates to the field of video understanding, in particular to a time sequence action detection method for sensing video clip relation.
Background
In recent years, with the continuous development of streaming media, videos on various website platforms are increased explosively, compared with traditional image information, information contained in the videos is richer and gets more attention of researchers, video understanding gradually becomes a popular research field in the industry and academia, a timing action detection task is an important branch of the video, and the task is mainly to detect a timing boundary of each action instance in a long video and determine the type of the action, so that a user can be helped to locate the action information in the videos more conveniently and quickly.
The time sequence action detection task can help people to quickly determine key contents in the long-distance video, can be used as a preprocessing step for problems such as video understanding and man-machine interaction, and is widely applied to actual life: (1) the video monitoring field: with the rapid development of the internet of things, cameras are already spread in various public places such as roads, schools and the like, play an important security role, but bring extremely large video data, if the cameras are simply analyzed manually, the method is obviously unrealistic, and the time sequence action detection task can quickly extract effective information in the monitoring video, liberate a large amount of human resources, and (2) the video retrieval field: nowadays, with the popularity of videos on various social software, common users also upload and share various video data, in order to recommend videos of interest to the users, classification and labeling are required for each uploaded video, if manual processing is used, the labor cost is very high, and the time sequence action detection can detect actions in the videos, perform preliminary classification, sorting and labeling according to a preset definition, and further integrate the actions into a subsequent video retrieval and video recommendation related algorithm.
Due to the wide application of the time sequence action detection task in the industry and academia, many effective algorithms are proposed, and at present, the algorithms can be roughly divided into two types: one-stage and two-stage:
1) one-stage: firstly, detecting video segments with different lengths by utilizing a hierarchical structure, a cascade structure and the like, and simultaneously giving the category information of the video segment in the prediction process;
2) two-stage: firstly, extracting all possible candidate video segments, and then classifying the candidate video segments by using some classical classifiers (such as Unet) to obtain the class information of the candidate video segments;
although the two-stage based methods have achieved good results, they generate the candidate video segments individually in the process of generating the candidate video segments, and ignore the relationship between the video segments.
Disclosure of Invention
Based on the above problems, the present invention provides a time sequence action detection method for sensing video segment relation, which generates more effective video segment characteristics by capturing global relation and local relation between them, thereby generating more effective prediction results.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows:
a time sequence action detection method for sensing video clip relation comprises the following steps:
step S1: sampling a video;
step S2: performing primary feature extraction on the video by using a TSN (time delay network) model to obtain features;
Step S3: using BaseNet model to pair extracted featuresFeature enhancement to generate boundary prediction for time series nodesAndin addition, features of all candidate video segments are extractedIs signed and marked asWhereinEach element representing a feature of a candidate video segment;
step S4: capturing candidate video segment features using global perception module and feature enhancement moduleThe relationship between;
step S5: combining the prediction results of step S3 and step S4 to generate a final judgment score;
step S6: removing repeated candidate video segments by using a Soft-NMS model;
step S7: and classifying the candidate video segments by using a Unet classifier to obtain the class information of the candidate video segments.
Further, the step S2 specifically includes the following steps:
step S21: firstly, acquiring a long video to a certain number of video segments according to a certain time interval;
step S22: and inputting the video segment into the TSN model, respectively acquiring the visual characteristic and the action characteristic, and connecting the visual characteristic and the action characteristic.
Further, the step S3 specifically includes the following steps:
step S31: establishing the relation among all video frames by utilizing graph convolution, and dynamically fusing multi-scale context semantic information to video characteristics;
step S32: predicting each time sequence position by graph convolution, outputting the possibility of the time sequence position being a starting node or an ending node, namely generating the boundary prediction of the time sequence nodeAndand, in addition, extracting all the characteristics of the candidate video segment and outputting。
Further, the global sensing module specifically includes:
designing a global perception unit to establish a relationship between candidate video segments in the same row and column for inputFirst, theInputting into two parallel paths, each path respectively comprising a vertical pooling layer and a horizontal pooling layerThe calculation formula is as follows:
wherein the content of the first and second substances,to representThe results after being input to the horizontal pooling,representing all possible start times in the video,is shown asA plurality of channels, each of which is provided with a plurality of channels,representing the start time of a video segment as,Representing a video segment of duration,Then it indicates toIs a start timeFor a duration of a video segmentThe value of each channel;
wherein the content of the first and second substances,to representThe results after being input to the vertical pooling,display viewThe frequency segments are of all possible durations and,is shown asA plurality of channels, each of which is provided with a plurality of channels,representing the start time of a video segment as,Representing a video segment of duration,Then it indicates toIs a start timeFor a duration of a video segmentThe value of each channel;
then, the information of the current position and the neighbor thereof is aggregated by using the one-dimensional convolution operation with the convolution kernel of 3, and then the outputs of the two paths are fused to obtain a fusion resultThe following are:
the output of the global sensing unit is then:
wherein the content of the first and second substances,represents the result of the operation of the global sensing unit,which represents the input to the unit or units,it is shown that the activation function is,represents a convolution operation;
repeating the global sensing unit twice to obtain a global sensing module, wherein the calculation formula is as follows:
wherein the content of the first and second substances,the output of the global perception module is represented,representing a global sense unit operation that is,representing a convolution operation.
Further, the feature enhancement module is specifically as follows:
a hierarchy is used to capture local information between candidate video segments for inputFirst, downsampling is used to aggregate features between adjacent candidate boxes, and the calculation formula is as follows:
when in useWhen the temperature of the water is higher than the set temperature,is to inputIf the global perception module is embedded in the feature enhancement module and all layers in the feature enhancement module share the same global perception module, the output of each layerCan be calculated according to the following formula:
represents the output result of each layer of the global perception module,representing the aforementioned global sensing module operations;
then, the upsampling operation is used to aggregate features between different layers, which is calculated as follows:
wherein the content of the first and second substances,the representation is fused with different layersThe output result of the layer is then,it is indicated that the up-sampling operation,the second of the aforementioned global perception modulesAnd +1 layer outputs the result.
Further, the steps S4 to S5 specifically include the following steps:
step S41: using average pooling polymerizationThe relationship between adjacent video segments in the video stream;
step S42: inputting the multi-level structure into a shared global perception module;
step S43: aggregating features output by global perception modules in adjacent levels;
step S44: the prediction score of each candidate video segment is obtained preliminarily through a shared convolution operation.
Further, the step S6 specifically includes the following steps:
step S61: generating all video segment scores and sorting according to the size;
step S62: selecting the video segments with larger scores, selecting the video segments with larger overlapping degree, attenuating the scores of the video segments, and repeating the process in sequence until a specific number of candidate video segments are reserved.
Compared with the prior art, the invention has the beneficial effects that: capturing the video frequency range relationship from different angles (local information and global information), establishing a relationship for a remote candidate video from a global view angle by using the average pooling characteristic for the global information, aggregating information between adjacent candidate video segments by using a hierarchical structure for the local information due to the distribution characteristic of the candidate video segments, and achieving a better detection result by the whole model due to the complementation of the local information and the global information;
for the prediction of candidate video segments, if they are processed separately, a constraint relation between them is ignored, and by exploring the relation between candidate video segments, more complete and accurate results can be generated, for different candidate video segments in the same video, they are often highly correlated, the relation of all candidate video segments is established, the characteristics of action instances can be enhanced by using background information, and for adjacent candidate video segments, there is a great amount of overlap between them, and effective information between them is aggregated by using average pooling, so as to generate more accurate results.
Drawings
Fig. 1 is a flowchart of this embodiment 1.
Detailed Description
The invention will be further described with reference to the accompanying drawings. Embodiments of the present invention include, but are not limited to, the following examples.
Example 1
In this embodiment, the system mainly includes a global perception module and a feature enhancement module, where:
in the global perception module, GA units are designed to establish the relation between candidate video segments in the same row and column, and for inputFirst, theInputting into two parallel paths, each path respectively comprising a vertical pooling layer and a horizontal pooling layer, for horizontal poolingIs/are as followsThe calculation formula is as follows:
wherein the content of the first and second substances,to representThe results after being input to the horizontal pooling,representing all possible start times in the video,is shown asA plurality of channels, each of which is provided with a plurality of channels,representing the start time of a video segment as,Representing a video segment of duration,Then it indicates toIs a start timeFor a duration of a video segmentThe value of each channel;
wherein the content of the first and second substances,to representThe results after being input to the vertical pooling,representing all possible durations of the video segments,is shown asA plurality of channels, each of which is provided with a plurality of channels,representing the start time of a video segment as,Representing a video segment of duration,Then it indicates toIs a start timeFor a duration of a video segmentThe value of each channel;
then, the information of the current position and the neighbor thereof is aggregated by using the one-dimensional convolution operation with the convolution kernel of 3, and then the outputs of the two paths are fused to obtain a fusion resultThe following are:
the output of the global sensing unit is then:
wherein the content of the first and second substances,represents the result of the operation of the global sensing unit,which represents the input to the unit or units,it is shown that the activation function is,represents a convolution operation;
repeating the global sensing unit twice to obtain a global sensing module, wherein the calculation formula is as follows:
wherein the content of the first and second substances,the output of the global perception module is represented,representing a global sense unit operation that is,represents a convolution operation;
in addition, in the feature enhancement module, a hierarchy is used to capture local information between candidate video segments. For inputFirst, downsampling is used to aggregate features between adjacent candidate boxes, and the calculation formula is as follows:
when in useWhen the temperature of the water is higher than the set temperature,is to inputDue to the limitation of average pooling, only the relationships between neighboring candidate boxes can be aggregatedAnd the long-distance dependency relationship cannot be captured, so that the global perception module is embedded into the feature enhancement module, all layers in the feature enhancement module share the same global perception module, and the output of each layerCan be calculated according to the following formula:
by doing so, the relationship between all candidate video segments can be established from two complementary levels (local and global), and then the upsampling operation is used to aggregate features between different levels, since each level is supervised by label information, the fusion of information of different levels can reduce the generation of noise to the maximum extent, and the calculation formula is as follows:
thus, the relationship between different candidate video segments can be captured from different levels and different scales.
Based on the above, the method for detecting the time sequence action of sensing the relation of the video segments as shown in fig. 1 includes the following steps:
step S1: sampling a video;
selecting a proper training and testing data set, wherein the training and testing are mainly performed on the public data sets ActivityNet-1.3 and THUMOS-14;
the ActivityNet-1.3 dataset is an open dataset for video segment generation and detection, which contains mainly 19994 videos and contains 200 action classes, these videos are mainly crawled from youtube website, all of which are different in resolution and time, it was an ActivityNet change 2016 and 2017 game dataset, which divides all videos into training, evaluation and test sets in a 2:1:1 ratio;
the THIMOS-14 data set contains 413 videos and contains 20 categories of information, wherein the test set contains 212 videos, the validation set contains 200 videos used for the time series motion detection task, the entire model is trained on the validation set and the performance of the entire model is evaluated on the tester.
Step S2: performing primary feature extraction on the video by using a TSN (time delay network) model to obtain features;
Firstly, for an unprocessed long video, extracting a corresponding video frame to represent the video frame asWhereinRepresenting the total number of video frames,the nth frame represented in the video, for which the set of tags may be represented asWhereinIndicating the number of motion video segments contained in a video,andrespectively representThe start time, end time and category information of each label are only used in the training process, and thenThe TSN model is used to extract the features of each video, which are expressed as。
Step S3: feature enhancement of extracted features using BaseNet model, resulting in boundary prediction of time series nodesAndin addition, the characteristics of all candidate video segments are extracted and output;
Wherein features are aligned using graph convolution (including GCN)Enhancing and obtaining the characteristics of richer semantic information, wherein the calculation formula is as follows:
wherein the content of the first and second substances,representing features enhanced by graph convolution;
Then, the featureShared by two branch networks, one of which is used to determine whether each timing position is a start node or an end node, which may be denoted asAndwhereinRepresenting the length of a sequence of video features that will output the features of all candidate video segments for another branching network。
Step S4: capturing candidate video segment features using global perception module and feature enhancement moduleThe relationship between;
for the output in the above step S3Each position of which represents the characteristics of a candidate video segment, and inputting the characteristics into the characteristics enhancing module, the relationship between the candidate video segments can be captured from the local view and the global view, the characteristics of the action instance can be enhanced, and the background information can be suppressed, so that more accurate and more complete results can be generated, and the output is;
wherein the content of the first and second substances,respectively representing two types of prediction results obtained by regression and classification supervision,it is shown that the activation function is,representing a convolution operation.
Step S5: combining the prediction results of step S3 and step S4 to generate a final judgment score;
in order to fully utilize the output result of the whole model, each layer of output in the feature enhancement module is fused into the final score of the candidate video segment, and in addition, the boundary information of each video segment is considered, so that the output of each layer of output in the feature enhancement module is combined into the final score of the candidate video segmentToVideo segment of (1), score thereofThe calculation formula of (a) is as follows:
wherein the content of the first and second substances,,andall the ingredients are super-ginseng,is shown asThe output of the layer(s) is,the total number of layers of the hierarchical result is indicated.
Step S6: removing repeated candidate video segments by using a Soft-NMS model;
after all possible candidate frames are acquired, as most of the possible candidate frames have large overlap, the Soft-NMS model is used for removing the candidate frames again, and the scores of the candidate video segments are countedSorting according to the size of the video segments, selecting the candidate video segment with the largest score, and then calculating the iou of the candidate video segment with the other video segments, wherein the video segments with high overlapping degree are attenuated according to the following formula.
WhereinThe parameters representing the gaussian function are then calculated,which represents a pre-defined threshold value that is,respectively representing the currently selected candidate video segment and other video segments.
Step S7: and classifying the candidate video segments by using a Unet classifier to obtain the class information of the candidate video segments.
After all possible candidate video segments are obtained, the video segments are classified by using a Unet classifier to obtain final class information of the video segments, and the final class information can be expressed asWhereinAs the information of the category, it is,is the score to which it corresponds,to predict the number of instances of an action.
Example 2
In this embodiment, the overall model in embodiment 1 needs to be trained, and the overall loss function is represented as:
whereinIs super ginseng and usedTo determine whether each time node is a start node or an end node, which can be expressed as:
andrespectively representing weighted cross entropy loss functions,respectively expressed as the prediction result of the start node and its corresponding label,respectively representing the prediction result of the end node and the corresponding label, and training the feature enhancement module, wherein the label is generated for the 0 th layer in the feature enhancement module and is represented asSince the feature enhancement module is a hierarchical structure and each layer is supervised, a label is generated for each layer using the following formula.
Thus, the loss function for each layer can be defined as:
wherein the content of the first and second substances,a loss function for each layer is represented,respectively representing the prediction result obtained by each layer by using regression supervision and the corresponding label,respectively representing the prediction result obtained by each layer by using classification supervision and the corresponding label.
WhereinAndrespectively represent the square difference loss and the weighted cross entropy loss, thenWherein, in the step (A),a loss function representing the hierarchy is used to,is shown asA layer of a material selected from the group consisting of,the number of total layers is indicated,representing hyper-parameters, and finally, loss functionsOptimization can be done in an end-to-end fashion.
Example 3
In this embodiment, the validity of the method is verified by using the selected data set, which specifically includes:
the method is verified on the selected data set, in order to well evaluate the effectiveness of the embodiment, the average accuracy mAP is selected as a main evaluation index, the mAP is respectively calculated on an iou set {0.3,0.4,0.5,0.6 and 0.7} on the THUMOS-14 data set, the mAP on the iou set {0.5,0.75 and 0.95} is calculated on an activityNet1.3 data set, and in addition, the average mAP of ten different ious is calculated on the activityNet 1.3.
The present embodiment performs verification on the data set ActivityNet-1.3 of the current main stream, and the final verification result is shown in the following table (comparing the model performance on the ActivityNet-1.3 data set (%):
TABLE 1 comparison of model Performance on ActivinyNet-1.3 dataset
The present example performed verification on the currently mainstream data set thumb-14, and the final verification results are shown in the following table (model performance vs. (%) on thumb-14 data set).
TABLE 2 modeling Performance on THUMOS-14 dataset vs
The above is an embodiment of the present invention. The specific parameters in the above embodiments and examples are only for the purpose of clearly illustrating the invention verification process of the inventor and are not intended to limit the scope of the invention, which is defined by the claims, and all equivalent structural changes made by using the contents of the specification and the drawings of the present invention should be covered by the scope of the present invention.
Claims (5)
1. A method for detecting time sequence action of sensing video clip relation is characterized by comprising the following steps:
step S1: sampling a video;
step S2: performing primary feature extraction on the video by using a TSN (time delay network) model to obtain featuresF;
Step S3: using BaseNet model to pair extracted featuresFFeature enhancement to generate boundary prediction for time series nodesP S AndP ein addition, all the characteristics of the candidate video segments are extracted and recordedX ACFMWhereinX ACFMEach element representing a feature of a candidate video segment;
step S4: capturing candidate video segment features using global perception module and feature enhancement moduleX ACFMThe relationship between;
step S5: combining the prediction results of step S3 and step S4 to generate a final judgment score;
step S6: removing repeated candidate video segments by using a Soft-NMS model;
step S7: classifying the candidate video segments by using a Unet classifier to obtain class information of the candidate video segments;
the global perception module specifically comprises the following modules:
designing a global perception unit to establish a relationship between candidate video segments in the same row and column for inputXFirst, theXInputting into two parallel paths, each path respectively comprising a vertical pooling layer and a horizontal pooling layerThe calculation formula is as follows:
wherein the content of the first and second substances,to representXThe results after being input to the horizontal pooling,Trepresenting all possible start times in the video, c representing the c-th channel, i representing a video segment start time of i, j representing a video segment duration of j,then represents the value of the c channel of the video segment with i as the start time and j as the duration;
wherein the content of the first and second substances,represents the result of X input to the vertical pooling, D represents all possible durations of the video segments, c represents the c-th channel, i represents the video segment start time i, j represents the video segment duration j,then represents the value of the c channel of the video segment with i as the start time and j as the duration;
then, the information of the current position and the neighbor thereof is aggregated by using the one-dimensional convolution operation with the convolution kernel of 3, and then the outputs of the two paths are fused to obtain a fusion resultThe following are:
the output of the global sensing unit is then:
X unit= Sig(Conv(Y))×X×X
wherein the content of the first and second substances, X unitrepresents the result of the operation of the global sensing unit,Xrepresents the input of the unit, Sig () represents the activation function, Conv () represents the convolution operation;
repeating the global sensing unit twice to obtain a global sensing module, wherein the calculation formula is as follows:
wherein the content of the first and second substances,X gathe output of the global perception module is represented,() Representing a global sense unit operation, Conv () representing a convolution operation;
the characteristic enhancement module is specifically as follows:
a hierarchy is used to capture local information between candidate video segments for inputXFirst, downsampling is used to aggregate features between adjacent candidate boxes, and the calculation formula is as follows:
when the ratio of i =0, the ratio of the total of the sum of the values of i =0,X 0is to inputXIf the global perception module is embedded in the feature enhancement module and all layers in the feature enhancement module share the same global perception module, the output of each layerCan be calculated according to the following formula:
represents the output result of each layer of the global perception module,() Representing the aforementioned global sensing module operations;
then, the upsampling operation is used to aggregate features between different layers, which is calculated as follows:
2. The method according to claim 1, wherein said method comprises: the step S2 specifically includes the following steps:
step S21: firstly, acquiring a long video to a certain number of video segments according to a certain time interval;
step S22: and inputting the video segment into the TSN model, respectively acquiring the visual characteristic and the action characteristic, and connecting the visual characteristic and the action characteristic.
3. The method according to claim 1, wherein said method comprises: the step S3 specifically includes the following steps:
step S31: establishing the relation among all video frames by utilizing graph convolution, and dynamically fusing multi-scale context semantic information to video characteristics;
step S32: predicting each time sequence position by graph convolution, outputting the possibility of the time sequence position being a starting node or an ending node, namely generating the boundary prediction of the time sequence nodeP S AndP eand, in addition, extracting all the characteristics of the candidate video segment and outputtingX ACFM。
4. The method according to claim 1, wherein said method comprises: the steps S4 to S5 specifically include the following steps:
step S41: using average pooling polymerizationX ACFMThe relationship between adjacent video segments in the video stream;
step S42: inputting the multi-level structure into a shared global perception module;
step S43: aggregating features output by global perception modules in adjacent levels;
step S44: the prediction score of each candidate video segment is obtained preliminarily through a shared convolution operation.
5. The method according to claim 1, wherein said method comprises: the step S6 specifically includes the following steps:
step S61: generating all video segment scores and sorting according to the size;
step S62: selecting the video segments with larger scores, selecting the video segments with larger overlapping degree, attenuating the scores of the video segments, and repeating the process in sequence until a specific number of candidate video segments are reserved.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110659154.1A CN113255570B (en) | 2021-06-15 | 2021-06-15 | Sequential action detection method for sensing video clip relation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110659154.1A CN113255570B (en) | 2021-06-15 | 2021-06-15 | Sequential action detection method for sensing video clip relation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113255570A CN113255570A (en) | 2021-08-13 |
CN113255570B true CN113255570B (en) | 2021-09-24 |
Family
ID=77187848
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110659154.1A Active CN113255570B (en) | 2021-06-15 | 2021-06-15 | Sequential action detection method for sensing video clip relation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113255570B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113920467B (en) * | 2021-12-13 | 2022-03-15 | 成都考拉悠然科技有限公司 | Tourist and commercial detection method and system combining booth detection and scene segmentation |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109241829A (en) * | 2018-07-25 | 2019-01-18 | 中国科学院自动化研究所 | The Activity recognition method and device of convolutional neural networks is paid attention to based on space-time |
CN110705339A (en) * | 2019-04-15 | 2020-01-17 | 中国石油大学(华东) | C-C3D-based sign language identification method |
CN110765854A (en) * | 2019-09-12 | 2020-02-07 | 昆明理工大学 | Video motion recognition method |
CN111079594A (en) * | 2019-12-04 | 2020-04-28 | 成都考拉悠然科技有限公司 | Video action classification and identification method based on double-current cooperative network |
CN111931602A (en) * | 2020-07-22 | 2020-11-13 | 北方工业大学 | Multi-stream segmented network human body action identification method and system based on attention mechanism |
CN112131943A (en) * | 2020-08-20 | 2020-12-25 | 深圳大学 | Video behavior identification method and system based on dual attention model |
CN112364852A (en) * | 2021-01-13 | 2021-02-12 | 成都考拉悠然科技有限公司 | Action video segment extraction method fusing global information |
CN112650886A (en) * | 2020-12-28 | 2021-04-13 | 电子科技大学 | Cross-modal video time retrieval method based on cross-modal dynamic convolution network |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11103773B2 (en) * | 2018-07-27 | 2021-08-31 | Yogesh Rathod | Displaying virtual objects based on recognition of real world object and identification of real world object associated location or geofence |
CN111372123B (en) * | 2020-03-03 | 2022-08-09 | 南京信息工程大学 | Video time sequence segment extraction method based on local to global |
-
2021
- 2021-06-15 CN CN202110659154.1A patent/CN113255570B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109241829A (en) * | 2018-07-25 | 2019-01-18 | 中国科学院自动化研究所 | The Activity recognition method and device of convolutional neural networks is paid attention to based on space-time |
CN110705339A (en) * | 2019-04-15 | 2020-01-17 | 中国石油大学(华东) | C-C3D-based sign language identification method |
CN110765854A (en) * | 2019-09-12 | 2020-02-07 | 昆明理工大学 | Video motion recognition method |
CN111079594A (en) * | 2019-12-04 | 2020-04-28 | 成都考拉悠然科技有限公司 | Video action classification and identification method based on double-current cooperative network |
CN111931602A (en) * | 2020-07-22 | 2020-11-13 | 北方工业大学 | Multi-stream segmented network human body action identification method and system based on attention mechanism |
CN112131943A (en) * | 2020-08-20 | 2020-12-25 | 深圳大学 | Video behavior identification method and system based on dual attention model |
CN112650886A (en) * | 2020-12-28 | 2021-04-13 | 电子科技大学 | Cross-modal video time retrieval method based on cross-modal dynamic convolution network |
CN112364852A (en) * | 2021-01-13 | 2021-02-12 | 成都考拉悠然科技有限公司 | Action video segment extraction method fusing global information |
Non-Patent Citations (4)
Title |
---|
BSN++: Complementary Boundary Regressor with Scale-Balanced Relation Modeling for Temporal Action Proposal Generation;Haisheng Su等;《arXiv》;20210301;1-9 * |
Cooperative Cross-Stream Network for Discriminative Action Representation;Jingran Zhang等;《arXiv》;20190827;1-10 * |
Temporal Context Aggregation Network for Temporal Action Proposal Refinement;Zhiwu Qing等;《arXiv》;20210324;1-10 * |
基于深度学习的视频人体动作识别研究;李怡颖;《中国优秀博硕士学位论文全文数据库(硕士) 信息科技辑》;20210115(第01期);I138-1923 * |
Also Published As
Publication number | Publication date |
---|---|
CN113255570A (en) | 2021-08-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110175580B (en) | Video behavior identification method based on time sequence causal convolutional network | |
CN107330362B (en) | Video classification method based on space-time attention | |
CN111008337B (en) | Deep attention rumor identification method and device based on ternary characteristics | |
CN112150450B (en) | Image tampering detection method and device based on dual-channel U-Net model | |
CN110851176B (en) | Clone code detection method capable of automatically constructing and utilizing pseudo-clone corpus | |
Wang et al. | Spatial–temporal pooling for action recognition in videos | |
CN107341508B (en) | Fast food picture identification method and system | |
CN110532911B (en) | Covariance measurement driven small sample GIF short video emotion recognition method and system | |
CN111783712A (en) | Video processing method, device, equipment and medium | |
CN114187311A (en) | Image semantic segmentation method, device, equipment and storage medium | |
CN107220902A (en) | The cascade scale forecast method of online community network | |
CN110533053B (en) | Event detection method and device and electronic equipment | |
CN113255570B (en) | Sequential action detection method for sensing video clip relation | |
CN112364852B (en) | Action video segment extraction method fusing global information | |
CN115659966A (en) | Rumor detection method and system based on dynamic heteromorphic graph and multi-level attention | |
CN110263638B (en) | Video classification method based on significant information | |
Zou et al. | STA3D: Spatiotemporally attentive 3D network for video saliency prediction | |
CN114741599A (en) | News recommendation method and system based on knowledge enhancement and attention mechanism | |
Sharma et al. | Construction of large-scale misinformation labeled datasets from social media discourse using label refinement | |
CN113298015A (en) | Video character social relationship graph generation method based on graph convolution network | |
CN113010705A (en) | Label prediction method, device, equipment and storage medium | |
CN114092819B (en) | Image classification method and device | |
CN113033500B (en) | Motion segment detection method, model training method and device | |
Li et al. | Human perception evaluation system for urban streetscapes based on computer vision algorithms with attention mechanisms | |
Lin et al. | Temporal action localization with two-stream segment-based RNN |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |